index.Rmd

--- 
title: "Poverty and Inequality with Complex Survey Data"
author: "By Guilherme Jacob, Anthony Damico, and Djalma Pessoa.  The authors received no external funding for the `convey` software and this accompanying textbook."
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output:
  bookdown::tufte_html_book:
    toc: yes
    css: toc.css
documentclass: book
bibliography: [book.bib, packages.bib]
biblio-style: apa
link-citations: yes
github-repo: guilhermejacob/context
description: "A book about the R convey package"
delete_merged_file: yes
---

```{r, include=FALSE}
knitr::opts_chunk$set(
  cache=TRUE, cache.lazy=FALSE
)
```

```{r results='hide', echo=FALSE}
set.seed(2017)

```

# Intro

The R `convey` library estimates measures of poverty, inequality and richness/affluence.  There are two other R libraries covering this subject, [vardpoor](https://CRAN.R-project.org/package=vardpoor) [@R-vardpoor] and [laeken](https://CRAN.R-project.org/package=laeken) [@R-laeken], however, only `convey` integrates seamlessly with the [R survey package](https://CRAN.R-project.org/package=survey) [@R-survey-article;@R-survey-book;@R-survey].

`convey` is free and open-source software that runs inside the [R environment for statistical computing](https://www.r-project.org/).  Anyone can review and propose changes to [the source code](https://github.com/ajdamico/convey) for this software.  Readers are welcome to [propose changes to this book](https://github.com/guilhermejacob/context/) as well.

As a companion guide, this flowchart clarifies the various options in this software:

[![](https://mermaid.ink/img/pako:eNrNWWlv20gS_SsNYQa2AUlraZL4wCQDyc41k8xmEy-yyXqxaJEtqcdkN4fdlCwn-e_7qg-SohXnmC8LA7ZE9vGq6tWr6vaHXqJT0TvtXapFyYsluzi_VPixk39f9s412-iKLflKMM4SnReZuGaG019mqnIlNiwVRi4U07M_RGJZUgpuRcrW0i6ZwC9RMrPa-EH7B0zT11IU9YNfLnv_CRuyweAR-_i7_sim2Ps1d5PtkivsrLBVn_4ambqngmVcXAlFK654mRYaH16zTM5KXkphht113wnzkZ01Rq25ssxqxhXPNjeCFXolSotNpBJ_VjyT9BlrljJZKmHM3_h8nlVCJaKG7H_smdvgHEu_8mvUW5_7N4_x6jlb6ypLAfBK0LbCWJnDVczonHyZ57zcYO9UJtzqsoU_LPLki4tURsyrjLkwtuc_8Qs8xQI-ZqenCEmSzlkC384otGGq1Tpr-2RkLFxADg_eYanOpeJwQlyfwTy__jOs_5JLVY_NBcd-wriRGPfMj3tLOFab-cLus-FwyPpswR6ywwPHr8oK46K7FDxNdIUYxeVKGNr_eVY-4swkuoS9CiEB6UC32QZ2SLWg1z-MKGw_jA4P8SzTa7dcXCRDdD1yG_H8ayee0TYety-W4AvBlC5z8OMG-8ZlF7xw0NZLsIVZfoVZUpEnE28EYUilseQ68nAX0ynNLoSmxLoDekhFGryUi2UrMN4lLlvsUhvhPfG5dYaM3j_Ta4FnfXqNydIw8CnLEH4mLCzhGStKPctE7uDdhZ8mc_eJl2xeqcRKrYY062KJV6CCIj9yy-TcpR_5iOVaiQ0Nmpc6D2vCDLCOq5QtJGyVlKQ0xKU88tANYfskMLoiwdHGhMh3QR30u08DJRtT6SX0rMuJdzs5Mf48JwwkowQh7uSGcBFC_GZ4xUqxgKmGjLQl3DMXpRtGsPCiyuArRbGQilTVCKbncduOPUP2HELo9HKNfDVuvwWHpjwcBzbuNbLGmhzd22JpxwnvvRPW3Fqz3zKd126kR7zkYIhztKfF17nEm4oiwRk0TAysHohraQdxCHCJsiiF5cQkx79KkRI5qduvFFySgZAy6bMZeOAV7IBxAy0taI73QipXMgU_HLS3ZEqNHqLhvUsrcqXgHYinXgMUaQ17TfM3UmSp2TJMVfkMQBCNDdhuHH9ByUzwFW1UZySgV6alk--8Wyc1t1KRdOiFJ3Ap0tfvWNOVggSgqH3Iv75zDiqh3bg82RnZIftn1PRazr0IQTNLWgaBdMlDmg7Zjkjt-XuPc9oKPyH9v0C3VXF-o4ojUsnVYIb8aPhll0igpc7SWHp-8xP-4U3iZWH3O_WmnkJxhZJBNEntUrQWFGA8zKQLr1SB9tJoFaVwIRS4kcUxcaYJU50i00woX1k3TtwCtKUh-5DaDPgdy-wAS18N9Hxwy5yDHU54Dpte-FaoyZ3YRTTke-5Hv8ToPeg-y7lCfLwg4TfNBIZf9giOd1HZdZFZ0ngvQtirqDKXmzQj2N0Um7usiDwLkH6PkFLvS-SbKLNN8LLvc7DCBrUC74yoMZa57GKkNyWhIpi5o0ZMcD1vlVivwCsTEKO3BUTqXZDfHXh_j_BmckF1w4Ubkk7hhETlhrZKdZZBCtg-dMDrcFKVxOPNQQtusei6tBTkw1Wb-DR6JuxaCBWybNsKFkxwyGN9u8vdrkyptMNyaY3I5h1bX9XsqNDGhC1TXee7Q9LIn6XSvmNriGhVuFalNp6KNuZ37d82jqMfWDj2QrOhGeY7qdXpzF37XatQ3Rf7_Hk8-Z6-mmFemB7kPEPfom5gHzomoHdl0qCuo0uzqA5rqIFfQSbkuqWPndODJpXojSt81AJ5ShjmeqNClNTqso4VwOFhTL_7iIElpmGNSSMlW6q9pSY0fhInBOMXX81sZ6ALduR4LpSrE2s8VIGPrR1CFfrT3JKjOs_DgqjFSji2zzae7Lpg48MfkdNUQHwP3qQRfZtpa-FcDGr8GZ0xjceZliu2TjR-dD2cXDF1VUj7xH3hGEFCsBK152rDGt9JJbumPcUzPBHzuUykUDYcezAGGZJtBhVts-I4jlKcQbaklDPnX1RohSNa0kbt2u61ICZCrCqbSaKVe51thrugBaffCLXgXWzv6WH0RABWd1TRfh5aWmr9IrG7Phnu8uK07cXHypa62NSkaA2LzhMydlCiMDLDJBwl2e2icLEUMhu8IDKL636rhTWuVXFNuPb9YqsDcXrki7frVSFBUD7PIbML1vQOWKPPwrr4i7DCUfAOXGce1x9Q1m48fx3goSgXtHbcvs7U5iakgaa061lIelGl20DjhQ8qadbI7LTFrIb1Qra63-3YbbeZt6IX-82a3-QD6UUkioqrQHVH6yLTxOgLEEZ3QLj4KxDOmnBMmnjsaq1pw18H57fi8s37bifXWTu53uhEomN9iy6Auqwn4czeiHyDN8Ll9qrLnom9grZqVTN4l16iBpQoXmCLqwrU6PpWDwUJ8gcUFQ7jNIG6NroTaG4O0BknsTqK64JOzGFrRI97bcvp6sPNa6qve7b2xvkTIgZyIMmhrWZJLQUd8txGb7S_c8ndSdldMjT5Rn0eFVAsv2fiijhmEgZYSH0kmp1WbaFrEshx5W5AdjYkdJX3Ol4rTuK1YmCKv617Qi5_E0p2p-xgzCQMolGvG5--0uhC5Q1vXb7QIV0laMX66DOat4Mmb-MFZ70NTWwuHcIhHY69dV1RK5FfJBowqdEF3tDLmGd2U4j_Rl48ZHtnS76HhAuNNb5dlXzF3X1fB5cziE1CB87o4iEX6JXYwjWolBdzxJbEyoY7J7cJDqN0gNfUgpkarFu5uWR0zcTwtgXTL1vw5OnFxahlQ_A326dLgQX7ma4SMWbg2rJvNupw2xrz7ea0-TLd5kvTFbfZsqJEbrL4i1SBdrkrQD4n4O5y8ivY0jh7GsFNvs7Z421nA2709aMv-7rxSdh1jE2f-sYcQnShdWZqZOM4ZnLr7pyE0N-eu2YMzlxJU7l7LgdsXvIkno5vn5nDAXTfCzfaigOkGPy0cK1pDbw-1wQ97_V7OY6cXKa9094HkunLHham68tTfEzFnFeZvexdqk8Yyiur32xU0ju1ZSX6vapIEfhzyXEOyeNDHMDQ2r_0_wNy_wrq9wqu3mtdD8HX3umH3nXvdHz0YDh-cHLv8GR0dPLg-Oh-v7fpnQ7Gh8fDwwejeyfjk9HJyXh0fPyp37txKxwN7z_46f7R8fHo3v2f7t07vv_pf82INM0?type=png)](https://mermaid.live/edit#pako:eNrNWWlv20gS_SsNYQa2AUlraZL4wCQDyc41k8xmEy-yyXqxaJEtqcdkN4fdlCwn-e_7qg-SohXnmC8LA7ZE9vGq6tWr6vaHXqJT0TvtXapFyYsluzi_VPixk39f9s412-iKLflKMM4SnReZuGaG019mqnIlNiwVRi4U07M_RGJZUgpuRcrW0i6ZwC9RMrPa-EH7B0zT11IU9YNfLnv_CRuyweAR-_i7_sim2Ps1d5PtkivsrLBVn_4ambqngmVcXAlFK654mRYaH16zTM5KXkphht113wnzkZ01Rq25ssxqxhXPNjeCFXolSotNpBJ_VjyT9BlrljJZKmHM3_h8nlVCJaKG7H_smdvgHEu_8mvUW5_7N4_x6jlb6ypLAfBK0LbCWJnDVczonHyZ57zcYO9UJtzqsoU_LPLki4tURsyrjLkwtuc_8Qs8xQI-ZqenCEmSzlkC384otGGq1Tpr-2RkLFxADg_eYanOpeJwQlyfwTy__jOs_5JLVY_NBcd-wriRGPfMj3tLOFab-cLus-FwyPpswR6ywwPHr8oK46K7FDxNdIUYxeVKGNr_eVY-4swkuoS9CiEB6UC32QZ2SLWg1z-MKGw_jA4P8SzTa7dcXCRDdD1yG_H8ayee0TYety-W4AvBlC5z8OMG-8ZlF7xw0NZLsIVZfoVZUpEnE28EYUilseQ68nAX0ynNLoSmxLoDekhFGryUi2UrMN4lLlvsUhvhPfG5dYaM3j_Ta4FnfXqNydIw8CnLEH4mLCzhGStKPctE7uDdhZ8mc_eJl2xeqcRKrYY062KJV6CCIj9yy-TcpR_5iOVaiQ0Nmpc6D2vCDLCOq5QtJGyVlKQ0xKU88tANYfskMLoiwdHGhMh3QR30u08DJRtT6SX0rMuJdzs5Mf48JwwkowQh7uSGcBFC_GZ4xUqxgKmGjLQl3DMXpRtGsPCiyuArRbGQilTVCKbncduOPUP2HELo9HKNfDVuvwWHpjwcBzbuNbLGmhzd22JpxwnvvRPW3Fqz3zKd126kR7zkYIhztKfF17nEm4oiwRk0TAysHohraQdxCHCJsiiF5cQkx79KkRI5qduvFFySgZAy6bMZeOAV7IBxAy0taI73QipXMgU_HLS3ZEqNHqLhvUsrcqXgHYinXgMUaQ17TfM3UmSp2TJMVfkMQBCNDdhuHH9ByUzwFW1UZySgV6alk--8Wyc1t1KRdOiFJ3Ap0tfvWNOVggSgqH3Iv75zDiqh3bg82RnZIftn1PRazr0IQTNLWgaBdMlDmg7Zjkjt-XuPc9oKPyH9v0C3VXF-o4ojUsnVYIb8aPhll0igpc7SWHp-8xP-4U3iZWH3O_WmnkJxhZJBNEntUrQWFGA8zKQLr1SB9tJoFaVwIRS4kcUxcaYJU50i00woX1k3TtwCtKUh-5DaDPgdy-wAS18N9Hxwy5yDHU54Dpte-FaoyZ3YRTTke-5Hv8ToPeg-y7lCfLwg4TfNBIZf9giOd1HZdZFZ0ngvQtirqDKXmzQj2N0Um7usiDwLkH6PkFLvS-SbKLNN8LLvc7DCBrUC74yoMZa57GKkNyWhIpi5o0ZMcD1vlVivwCsTEKO3BUTqXZDfHXh_j_BmckF1w4Ubkk7hhETlhrZKdZZBCtg-dMDrcFKVxOPNQQtusei6tBTkw1Wb-DR6JuxaCBWybNsKFkxwyGN9u8vdrkyptMNyaY3I5h1bX9XsqNDGhC1TXee7Q9LIn6XSvmNriGhVuFalNp6KNuZ37d82jqMfWDj2QrOhGeY7qdXpzF37XatQ3Rf7_Hk8-Z6-mmFemB7kPEPfom5gHzomoHdl0qCuo0uzqA5rqIFfQSbkuqWPndODJpXojSt81AJ5ShjmeqNClNTqso4VwOFhTL_7iIElpmGNSSMlW6q9pSY0fhInBOMXX81sZ6ALduR4LpSrE2s8VIGPrR1CFfrT3JKjOs_DgqjFSji2zzae7Lpg48MfkdNUQHwP3qQRfZtpa-FcDGr8GZ0xjceZliu2TjR-dD2cXDF1VUj7xH3hGEFCsBK152rDGt9JJbumPcUzPBHzuUykUDYcezAGGZJtBhVts-I4jlKcQbaklDPnX1RohSNa0kbt2u61ICZCrCqbSaKVe51thrugBaffCLXgXWzv6WH0RABWd1TRfh5aWmr9IrG7Phnu8uK07cXHypa62NSkaA2LzhMydlCiMDLDJBwl2e2icLEUMhu8IDKL636rhTWuVXFNuPb9YqsDcXrki7frVSFBUD7PIbML1vQOWKPPwrr4i7DCUfAOXGce1x9Q1m48fx3goSgXtHbcvs7U5iakgaa061lIelGl20DjhQ8qadbI7LTFrIb1Qra63-3YbbeZt6IX-82a3-QD6UUkioqrQHVH6yLTxOgLEEZ3QLj4KxDOmnBMmnjsaq1pw18H57fi8s37bifXWTu53uhEomN9iy6Auqwn4czeiHyDN8Ll9qrLnom9grZqVTN4l16iBpQoXmCLqwrU6PpWDwUJ8gcUFQ7jNIG6NroTaG4O0BknsTqK64JOzGFrRI97bcvp6sPNa6qve7b2xvkTIgZyIMmhrWZJLQUd8txGb7S_c8ndSdldMjT5Rn0eFVAsv2fiijhmEgZYSH0kmp1WbaFrEshx5W5AdjYkdJX3Ol4rTuK1YmCKv617Qi5_E0p2p-xgzCQMolGvG5--0uhC5Q1vXb7QIV0laMX66DOat4Mmb-MFZ70NTWwuHcIhHY69dV1RK5FfJBowqdEF3tDLmGd2U4j_Rl48ZHtnS76HhAuNNb5dlXzF3X1fB5cziE1CB87o4iEX6JXYwjWolBdzxJbEyoY7J7cJDqN0gNfUgpkarFu5uWR0zcTwtgXTL1vw5OnFxahlQ_A326dLgQX7ma4SMWbg2rJvNupw2xrz7ea0-TLd5kvTFbfZsqJEbrL4i1SBdrkrQD4n4O5y8ivY0jh7GsFNvs7Z421nA2709aMv-7rxSdh1jE2f-sYcQnShdWZqZOM4ZnLr7pyE0N-eu2YMzlxJU7l7LgdsXvIkno5vn5nDAXTfCzfaigOkGPy0cK1pDbw-1wQ97_V7OY6cXKa9094HkunLHham68tTfEzFnFeZvexdqk8Yyiur32xU0ju1ZSX6vapIEfhzyXEOyeNDHMDQ2r_0_wNy_wrq9wqu3mtdD8HX3umH3nXvdHz0YDh-cHLv8GR0dPLg-Oh-v7fpnQ7Gh8fDwwejeyfjk9HJyXh0fPyp37txKxwN7z_46f7R8fHo3v2f7t07vv_pf82INM0)

Individuals getting started in the field of poverty and inequality statistics might find the number of techniques described in this textbook overwhelming, especially choosing which method might be most appropriate for each particular research question.  The authors of this textbook consider Dr. Ija Trapeznikova's article [Measuring income inequality](https://wol.iza.org/articles/measuring-income-inequality/long) an important summary of how to approach selecting between available techniques.


## Installation {#install}

In order to work with the `convey` library, you will need to have R running on your machine.  If you have never used R before, you will need to [install that software](https://www.r-project.org/) before `convey` can be accessed.  Once you have R loaded on your machine, you can install..

* the latest released version from [CRAN](https://CRAN.R-project.org/package=convey) with

```R
install.packages("convey")
```

* the latest development version from github with

```R
remotes::install_github("ajdamico/convey")
```

In order to know how to cite this package, run `citation("convey")`.

## Complex surveys and statistical inference {#survey}

In this book, we demonstrate how to estimate poverty and inequality measures in a population using microdata collected from a complex survey sample.  Most surveys administered by government agencies or larger research organizations utilize a sampling design that violates the assumption of simple random sampling (SRS), including:

1. Different units selection probabilities;
2. Clustering of units;
3. Stratification of clusters;
4. Reweighting to compensate for missing values and other adjustments.

Therefore, basic unweighted R commands such as `mean()` or `glm()` will not properly account for the weighting nor the measures of uncertainty (such as sampling variance estimates and confidence intervals) present in the dataset.  For some examples of publicly-available complex survey data sets, see [http://asdfree.com]().  

Unlike other software, the R `convey` package does not require that the user specify these parameters throughout the analysis.  So long as the [svydesign object](http://r-survey.r-forge.r-project.org/survey/html/svydesign.html) or [svrepdesign object](http://r-survey.r-forge.r-project.org/survey/html/svrepdesign.html) has been constructed properly at the outset of the analysis, the `convey` package will incorporate the survey design automatically and produce statistics and variances that take the complex sample into account.

Survey analysts familiar with the R `dplyr` syntax implemented by the `survey` library's wrapper `srvyr` package might be interested in implementing specific `convey` functions by following the [`svygini()` example](http://gdfe.co/srvyr/articles/extending-srvyr.html) published by `srvyr` author Greg Freedman Ellis.  Note that the full design stored by `convey_prep()` may in some cases complicate this extension.


## Usage Examples


In the following example, we've loaded the data set `eusilc` from the R library [laeken](https://CRAN.R-project.org/package=laeken) [@R-laeken].

```{r results='hide', message=FALSE, warning=FALSE}
library(laeken)
data(eusilc)
```
Next, we create an object of class `survey.design` using the function `svydesign` of the `survey` library:

```{r results='hide', message=FALSE, warning=FALSE}
library(survey)
des_eusilc <-
  svydesign(
    ids = ~ rb030,
    strata =  ~ db040,
    weights = ~ rb050,
    data = eusilc
  )
```

Right after the creation of the design object `des_eusilc`, we should use the function `convey_prep` that adds an attribute to the survey design which saves information on the design object based upon the whole sample, needed to work with subsetted design objects.

```{r}
library(convey)
des_eusilc <- convey_prep(des_eusilc)
```

To estimate the at-risk-of-poverty rate, we use the function `svyarpt`:

```{r comment=NA}
svyarpr( ~ eqIncome, design = des_eusilc)
```

To estimate the at-risk-of-poverty rate across domains defined by the variable `db040` we use:

```{r comment=NA}
svyby(
  ~ eqIncome,
  by = ~ db040,
  design = des_eusilc,
  FUN = svyarpr,
  deff = FALSE
)
```

Using the same data set, we estimate the quintile share ratio: 

```{r comment=NA}
# for the whole population
svyqsr( ~ eqIncome, design = des_eusilc, alpha1 = .20)

# for domains
svyby(
  ~ eqIncome,
  by = ~ db040,
  design = des_eusilc,
  FUN = svyqsr,
  alpha1 = .20,
  deff = FALSE
)
```

These functions can be used as S3 methods for the classes `survey.design` and `svyrep.design`.

Let's create a design object of class `svyrep.design` and run the function `convey_prep` on it:

```{r}
des_eusilc_rep <- as.svrepdesign(des_eusilc, type = "bootstrap")
des_eusilc_rep <- convey_prep(des_eusilc_rep)
```

The function `svyarpr` produces matching coefficients and near-identical standard errors on the replication design:

```{r comment=NA}
svyarpr( ~ eqIncome, design = des_eusilc_rep)

svyby(
  ~ eqIncome,
  by = ~ db040,
  design = des_eusilc_rep,
  FUN = svyarpr,
  deff = FALSE
)
```

The functions of the convey `library` are called in a similar way to the functions in `survey` library.

It is also possible to discard missing values by using the argument `na.rm`:

```{r comment=NA}
# survey.design using a variable with missings
svygini( ~ py010n , design = des_eusilc)
svygini( ~ py010n , design = des_eusilc , na.rm = TRUE)

# svyrep.design using a variable with missings
svygini( ~ py010n , design = des_eusilc_rep)
svygini( ~ py010n , design = des_eusilc_rep , na.rm = TRUE)
```

## Current Population Survey - Annual Social and Economic Supplement (CPS-ASEC)

Sponsored jointly by the U.S. Census Bureau and the U.S. Bureau of Labor Statistics (BLS), the CPS-ASEC is the primary source of labor force statistics for the population of the United States.

This section downloads, imports, and prepares the most current microdata for analysis, then reproduces some statistics and margin of error terms from the U.S. Census Bureau.


Download and unzip the 2023 file:

```{r results='hide', message=FALSE, warning=FALSE}
library(httr)

tf <- tempfile()

this_url <-
  "https://www2.census.gov/programs-surveys/cps/datasets/2023/march/asecpub23sas.zip"

GET(this_url , write_disk(tf), progress())

unzipped_files <- unzip(tf , exdir = tempdir())
```

Import all four files:

```{r results='hide', message=FALSE, warning=FALSE}
library(haven)

four_tbl <- lapply(unzipped_files , read_sas)

four_df <- lapply(four_tbl , data.frame)

four_df <-
  lapply(four_df , function(w) {
    names(w) <- tolower(names(w))
    w
  })

household_df <-
  four_df[[grep('hhpub' , basename(unzipped_files))]]
family_df <-
  four_df[[grep('ffpub' , basename(unzipped_files))]]
person_df <-
  four_df[[grep('pppub' , basename(unzipped_files))]]
repwgts_df <-
  four_df[[grep('repwgt' , basename(unzipped_files))]]
```


```{r results='hide', echo=FALSE}
rm( four_tbl , four_df ) ; gc()
```

Divide weights:

```{r results='hide', message=FALSE, warning=FALSE}
household_df[, 'hsup_wgt'] <- household_df[, 'hsup_wgt'] / 100
family_df[, 'fsup_wgt'] <- family_df[, 'fsup_wgt'] / 100
for (j in c('marsupwt' , 'a_ernlwt' , 'a_fnlwgt'))
  person_df[, j] <- person_df[, j] / 100
```

Merge these four files:

```{r results='hide', message=FALSE, warning=FALSE}
names(family_df)[names(family_df) == 'fh_seq'] <- 'h_seq'
names(person_df)[names(person_df) == 'ph_seq'] <- 'h_seq'
names(person_df)[names(person_df) == 'phf_seq'] <- 'ffpos'

hh_fm_df <- merge(household_df , family_df)
hh_fm_pr_df <- merge(hh_fm_df , person_df)
cps_df <- merge(hh_fm_pr_df , repwgts_df)

stopifnot(nrow(cps_df) == nrow(person_df))
```


```{r results='hide', echo=FALSE}
rm( household_df , family_df , person_df , hh_fm_df , hh_fm_pr_df , repwgts_df ) ; gc()

# variables to keep
cps_df <- cps_df[ c( 'a_age' , 'a_sex' , 'moop' , 'a_maritl' , 'a_famrel' , 'gestfips' , 'ftotval' , 'htotval' , 'pearnval' , 'a_exprrp' , 'a_famtyp' , 'wewkrs' , 'marsupwt' , paste0( 'pwwgt' , 1:160 ) ) ] ; gc()
```


Construct a complex sample survey design:

```{r results='hide', message=FALSE, warning=FALSE}
library(survey)

cps_design <-
  svrepdesign(
    weights = ~ marsupwt ,
    repweights = "pwwgt[1-9]" ,
    type = "Fay" ,
    rho = (1 - 1 / sqrt(4)) ,
    data = cps_df ,
    combined.weights = TRUE ,
    mse = TRUE
  )
```

Add a sex variable:

```{r results='hide', message=FALSE, warning=FALSE}
cps_design <-
  update(cps_design, sex = factor(
    a_sex ,
    levels = 1:2 ,
    labels = c('male' , 'female')
  ))
```

Run the `convey_prep()` function on the full design:

```{r results='hide', message=FALSE, warning=FALSE}
cps_design <- convey_prep(cps_design)
```


```{r results='hide', echo=FALSE}
rm( cps_df ) ; gc()
```


### Analysis Examples with the `survey` library \ {-}

Add new columns to the data set:
```{r results='hide', message=FALSE, warning=FALSE}
cps_design <-
  update(
    cps_design ,
    
    one = 1 ,
    
    a_maritl =
      factor(
        a_maritl ,
        labels =
          c(
            "married - civilian spouse present" ,
            "married - AF spouse present" ,
            "married - spouse absent" ,
            "widowed" ,
            "divorced" ,
            "separated" ,
            "never married"
          )
      ) ,
    
    state_name =
      factor(
        gestfips ,
        levels =
          c(
            1L,
            2L,
            4L,
            5L,
            6L,
            8L,
            9L,
            10L,
            11L,
            12L,
            13L,
            15L,
            16L,
            17L,
            18L,
            19L,
            20L,
            21L,
            22L,
            23L,
            24L,
            25L,
            26L,
            27L,
            28L,
            29L,
            30L,
            31L,
            32L,
            33L,
            34L,
            35L,
            36L,
            37L,
            38L,
            39L,
            40L,
            41L,
            42L,
            44L,
            45L,
            46L,
            47L,
            48L,
            49L,
            50L,
            51L,
            53L,
            54L,
            55L,
            56L
          ) ,
        labels =
          c(
            "Alabama",
            "Alaska",
            "Arizona",
            "Arkansas",
            "California",
            "Colorado",
            "Connecticut",
            "Delaware",
            "District of Columbia",
            "Florida",
            "Georgia",
            "Hawaii",
            "Idaho",
            "Illinois",
            "Indiana",
            "Iowa",
            "Kansas",
            "Kentucky",
            "Louisiana",
            "Maine",
            "Maryland",
            "Massachusetts",
            "Michigan",
            "Minnesota",
            "Mississippi",
            "Missouri",
            "Montana",
            "Nebraska",
            "Nevada",
            "New Hampshire",
            "New Jersey",
            "New Mexico",
            "New York",
            "North Carolina",
            "North Dakota",
            "Ohio",
            "Oklahoma",
            "Oregon",
            "Pennsylvania",
            "Rhode Island",
            "South Carolina",
            "South Dakota",
            "Tennessee",
            "Texas",
            "Utah",
            "Vermont",
            "Virginia",
            "Washington",
            "West Virginia",
            "Wisconsin",
            "Wyoming"
          )
      ) ,
    
    male = as.numeric(a_sex == 1)
  )
```

Count the unweighted number of records in the survey sample, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
sum( weights( cps_design , "sampling" ) != 0 )

svyby( ~ one , ~ state_name , cps_design , unwtd.count )
```

Count the weighted size of the generalizable population, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
svytotal( ~ one , cps_design )

svyby( ~ one , ~ state_name , cps_design , svytotal )
```


Calculate the mean (average) of a linear variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
svymean( ~ pearnval , cps_design )

svyby( ~ pearnval , ~ state_name , cps_design , svymean )
```

Calculate the distribution of a categorical variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
svymean( ~ a_maritl , cps_design )

svyby( ~ a_maritl , ~ state_name , cps_design , svymean )
```

Calculate the sum of a linear variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
svytotal( ~ pearnval , cps_design )

svyby( ~ pearnval , ~ state_name , cps_design , svytotal )
```

Calculate the weighted sum of a categorical variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
svytotal( ~ a_maritl , cps_design )

svyby( ~ a_maritl , ~ state_name , cps_design , svytotal )
```

Calculate the median (50th percentile) of a linear variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
svyquantile( ~ pearnval , cps_design , 0.5 )

svyby( 
	~ pearnval , 
	~ state_name , 
	cps_design , 
	svyquantile , 
	0.5 ,
	ci = TRUE 
)
```

Estimate a ratio:
```{r results='hide', message=FALSE, warning=FALSE}
svyratio( 
	numerator = ~ moop , 
	denominator = ~ pearnval , 
	cps_design 
)
```


Restrict the survey design to persons aged 18-64:
```{r results='hide', message=FALSE, warning=FALSE}
sub_cps_design <- subset( cps_design , a_age %in% 18:64 )
```
Calculate the mean (average) of this subset:
```{r results='hide', message=FALSE, warning=FALSE}
svymean( ~ pearnval , sub_cps_design )
```


Extract the coefficient, standard error, confidence interval, and coefficient of variation from any descriptive statistics function result, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
this_result <- svymean( ~ pearnval , cps_design )

coef( this_result )
SE( this_result )
confint( this_result )
cv( this_result )

grouped_result <-
	svyby( 
		~ pearnval , 
		~ state_name , 
		cps_design , 
		svymean 
	)
	
coef( grouped_result )
SE( grouped_result )
confint( grouped_result )
cv( grouped_result )
```

Calculate the degrees of freedom of any survey design object:
```{r results='hide', message=FALSE, warning=FALSE}
degf( cps_design )
```

Calculate the complex sample survey-adjusted variance of any statistic:
```{r results='hide', message=FALSE, warning=FALSE}
svyvar( ~ pearnval , cps_design )
```

Include the complex sample design effect in the result for a specific statistic:
```{r results='hide', message=FALSE, warning=FALSE}
# SRS without replacement
svymean( ~ pearnval , cps_design , deff = TRUE )

# SRS with replacement
svymean( ~ pearnval , cps_design , deff = "replace" )
```

Compute confidence intervals for proportions using methods that may be more accurate near 0 and 1. See `?svyciprop` for alternatives:
```{r results='hide', message=FALSE, warning=FALSE}
svyciprop( ~ male , cps_design ,
	method = "likelihood" )
```

Perform a design-based t-test:
```{r results='hide', message=FALSE, warning=FALSE}
svyttest( pearnval ~ male , cps_design )
```

Perform a chi-squared test of association for survey data:
```{r results='hide', message=FALSE, warning=FALSE}
svychisq( 
	~ male + a_maritl , 
	cps_design 
)
```

Perform a survey-weighted generalized linear model:
```{r results='hide', message=FALSE, warning=FALSE}
glm_result <- 
	svyglm( 
		pearnval ~ male + a_maritl , 
		cps_design 
	)

summary( glm_result )
```


### Household Income

Limit the CPS-ASEC person-level design to the household reference person, then calculate the Gini coefficient with total household income:

```{r}
cps_household_design <- subset(cps_design , a_exprrp %in% 1:2)

(cps_household_gini <- svygini( ~ htotval , cps_household_design))

# match 2022 household gini coefficient
# https://www2.census.gov/programs-surveys/cps/tables/time-series/historical-income-households/h04.xlsx
stopifnot(round(coef(cps_household_gini) , 3) == 0.488)

# match 2022 household gini margin of error
# https://www.census.gov/content/dam/Census/newsroom/press-kits/2023/iphi/20230912-iphi-slides-income.pdf#page=13
stopifnot(round(
  coef(cps_household_gini) - confint(cps_household_gini , level = 0.9)[1] ,
  4
) == 0.0033) 
```

Calculate matching Theil and Atkinson measures:
```{r}
cps_household_design <-
  update(cps_household_design , htotval_ones = ifelse(htotval < 0 , NA , htotval))
cps_household_design <-
  update(cps_household_design ,
         htotval_ones = ifelse(htotval_ones < 1 , 1 , htotval_ones))

(cps_household_theil <-
    svygei(~ htotval_ones , cps_household_design , na.rm = TRUE))
(
  cps_household_atkinson <-
    svyatk(
      ~ htotval_ones ,
      cps_household_design ,
      epsilon = 0.5 ,
      na.rm = TRUE
    )
)

# https://www2.census.gov/programs-surveys/cps/tables/time-series/historical-income-households/h04.xlsx
stopifnot(round(coef(cps_household_theil) , 3) == 0.440)
stopifnot(round(coef(cps_household_atkinson), 3) == 0.207)
```


### Family Income

Limit the CPS-ASEC person-level design to the family reference person of primary families, then calculate the Gini coefficient with total family income:

```{r}
cps_family_design <-
  subset(cps_design , a_famrel %in% 1 & a_famtyp %in% 1)

(cps_family_gini <- svygini( ~ ftotval , cps_family_design))

# match 2022 family gini coefficient
# https://www2.census.gov/programs-surveys/cps/tables/time-series/historical-income-families/f04.xlsx
stopifnot(round(coef(cps_family_gini) , 3) == 0.458)
```


### Worker Earnings


The `convey_prep()` function sets the population of reference for poverty threshold estimation.  For relative poverty measures, it is important to think what should be the reference population for the poverty threshold. For instance, if one is interested is computing poverty rates across regions using a poverty line computed at a nationwide level (represented by a full sample design object), it is important to run `convey_prep()` immediately after creating the full object. However, imagine that one wants to compute the poverty line using the annual earnings among only full-time, full year workers. In this case, subsetting the full object after running `convey_prep()` will compute the poverty line using the entire sample, including the earnings of part-time workers, part year workers, and also zeroes for the non-working population.  So in this (non-standard) case, we suggest subsetting the main object and then running `convey_prep()` again to re-set the reference population:


Limit the CPS-ASEC person-level design to full-time, full year workers:

```{r}
cps_ftfy_worker_design <- subset(cps_design , wewkrs %in% 1)
```

Re-set the population of reference for poverty threshold estimation:

```{r}
cps_ftfy_worker_design <- convey_prep(cps_ftfy_worker_design)
```

Calculate the Gini coefficient with total earnings:
```{r}
svygini( ~ pearnval , cps_ftfy_worker_design)
```


```{r results='hide', echo=FALSE}
rm( cps_design ) ; gc()
```


## Pesquisa Nacional por Amostra de Domicílios Contínua (PNAD Contínua)

Administered by the Instituto Brasileiro de Geografia e Estatística (IBGE), the PNAD Contínua is the primary source of labor force statistics for the population of Brazil.

This section downloads, imports, and prepares the most current microdata for analysis, then reproduces some statistics and margins of error from IBGE.^[See [this link](https://agenciadenoticias.ibge.gov.br/agencia-noticias/2012-agencia-de-noticias/noticias/36857-em-2022-mercado-de-trabalho-e-auxilio-brasil-permitem-recuperacao-dos-rendimentos) for IBGE Gini coefficient estimates reproduced below.]


Download and import the 2022 5th interview file:

```{r results='hide', message=FALSE, warning=FALSE}
library(PNADcIBGE)

pnadc_df <-
  get_pnadc(2022 ,
            interview = 5 ,
            design = FALSE ,
            labels = FALSE)

names(pnadc_df) <- tolower(names(pnadc_df))
```

Recode a number of variables:
```{r results='hide', message=FALSE, warning=FALSE}
pnadc_df <-
  transform(
    pnadc_df ,
    
    household_id = paste0(upa , v1008 , v1014) ,
    
    deflated_labor_income = vd4019 * co2 ,
    
    deflated_other_source_income = vd4048 * co2e
  )

labor_income_sum_df <-
  aggregate(
    cbind(household_deflated_labor_income = deflated_labor_income) ~ household_id ,
    data = pnadc_df[!(pnadc_df[, 'vd2002'] %in% 17:19) ,] ,
    sum ,
    na.rm = TRUE
  )

other_income_sum_df <-
  aggregate(
    cbind(household_deflated_other_source_income = deflated_other_source_income) ~ household_id ,
    data = pnadc_df[!(pnadc_df[, 'vd2002'] %in% 17:19) ,] ,
    sum ,
    na.rm = TRUE
  )

before_nrow <- nrow(pnadc_df)
pnadc_df <- merge(pnadc_df , labor_income_sum_df , all.x = TRUE)
pnadc_df <- merge(pnadc_df , other_income_sum_df , all.x = TRUE)
stopifnot(nrow(pnadc_df) == before_nrow)

pnadc_df[is.na(pnadc_df[, 'household_deflated_labor_income']) , 'household_deflated_labor_income'] <-
  0
pnadc_df[is.na(pnadc_df[, 'household_deflated_other_source_income']) , 'household_deflated_other_source_income'] <-
  0


pnadc_df <-
  transform(
    pnadc_df ,
    
    deflated_per_capita_income =
      (
        household_deflated_labor_income + household_deflated_other_source_income
      ) / vd2003 ,
    
    sex = factor(
      v2007 ,
      levels = 1:2 ,
      labels = c('male' , 'female')
    )
    
  )
```


```{r results='hide', echo=FALSE}
# variables to keep
pnadc_df <- pnadc_df[ c( 'vd4002' , 'uf' , 'v2009' , 'v2007' , 'vd4015' , 'vd4016' , 'vd4017' , 'vd4019' , 'vd4020' , 'deflated_per_capita_income' , 'deflated_labor_income' , 'sex' , 'v1032' , grep( 'v1032[0-9]{3}' , names( pnadc_df ) , value = TRUE ) ) ] ; gc()
```


Construct a complex sample survey design:

```{r results='hide', message=FALSE, warning=FALSE}
library(survey)

pnadc_design <-
  svrepdesign(
    data = pnadc_df ,
    weight = ~ v1032,
    type = "bootstrap" ,
    repweights = "v1032[0-9]{3}" ,
    mse = TRUE
  )
```

Run the `convey_prep()` function on the full design:

```{r results='hide', message=FALSE, warning=FALSE}
pnadc_design <- convey_prep(pnadc_design)
```

### Analysis Examples with the `survey` library \ {-}

Add new columns to the data set:
```{r results='hide', message=FALSE, warning=FALSE}
pnadc_design <-
  update(pnadc_design ,
         
         pia = as.numeric(v2009 >= 14))

pnadc_design <-
  update(
    pnadc_design ,
    
    ocup_c = ifelse(pia == 1 , as.numeric(vd4002 %in% 1) , NA) ,
    
    desocup30 = ifelse(pia == 1 , as.numeric(vd4002 %in% 2) , NA)
  )

pnadc_design <-
  
  update(
    pnadc_design ,
    
	one = 1 ,
	
    uf_name =
      
      factor(
        as.numeric(uf) ,
        
        levels =
          c(
            11L,
            12L,
            13L,
            14L,
            15L,
            16L,
            17L,
            21L,
            22L,
            23L,
            24L,
            25L,
            26L,
            27L,
            28L,
            29L,
            31L,
            32L,
            33L,
            35L,
            41L,
            42L,
            43L,
            50L,
            51L,
            52L,
            53L
          ) ,
        
        labels =
          c(
            "Rondonia",
            "Acre",
            "Amazonas",
            "Roraima",
            "Para",
            "Amapa",
            "Tocantins",
            "Maranhao",
            "Piaui",
            "Ceara",
            "Rio Grande do Norte",
            "Paraiba",
            "Pernambuco",
            "Alagoas",
            "Sergipe",
            "Bahia",
            "Minas Gerais",
            "Espirito Santo",
            "Rio de Janeiro",
            "Sao Paulo",
            "Parana",
            "Santa Catarina",
            "Rio Grande do Sul",
            "Mato Grosso do Sul",
            "Mato Grosso",
            "Goias",
            "Distrito Federal"
          )
        
      ) ,
    
    age_categories = factor(1 + findInterval(v2009 , seq(5 , 60 , 5))) ,
    
    male = as.numeric(v2007 == 1) ,
    
    region = substr(uf , 1 , 1) ,
    
    # calculate usual income from main job
    # (rendimento habitual do trabalho principal)
    vd4016n = ifelse(pia %in% 1 & vd4015 %in% 1 , vd4016 , NA) ,
    
    # calculate effective income from main job
    # (rendimento efetivo do trabalho principal)
    vd4017n = ifelse(pia %in% 1 & vd4015 %in% 1 , vd4017 , NA) ,
    
    # calculate usual income from all jobs
    # (variavel rendimento habitual de todos os trabalhos)
    vd4019n = ifelse(pia %in% 1 & vd4015 %in% 1 , vd4019 , NA) ,
    
    # calculate effective income from all jobs
    # (rendimento efetivo do todos os trabalhos)
    vd4020n = ifelse(pia %in% 1 & vd4015 %in% 1 , vd4020 , NA) ,
    
    # determine the potential labor force
    pea_c = as.numeric(ocup_c == 1 | desocup30 == 1)
    
  )
```

Count the unweighted number of records in the survey sample, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
sum( weights( pnadc_design , "sampling" ) != 0 )

svyby( ~ one , ~ uf_name , pnadc_design , unwtd.count )
```

Count the weighted size of the generalizable population, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
svytotal( ~ one , pnadc_design )

svyby( ~ one , ~ uf_name , pnadc_design , svytotal )
```

Calculate the mean (average) of a linear variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
svymean( ~ vd4020n , pnadc_design , na.rm = TRUE )

svyby( ~ vd4020n , ~ uf_name , pnadc_design , svymean , na.rm = TRUE )
```

Calculate the distribution of a categorical variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
svymean( ~ age_categories , pnadc_design )

svyby( ~ age_categories , ~ uf_name , pnadc_design , svymean )
```

Calculate the sum of a linear variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
svytotal( ~ vd4020n , pnadc_design , na.rm = TRUE )

svyby( ~ vd4020n , ~ uf_name , pnadc_design , svytotal , na.rm = TRUE )
```

Calculate the weighted sum of a categorical variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
svytotal( ~ age_categories , pnadc_design )

svyby( ~ age_categories , ~ uf_name , pnadc_design , svytotal )
```

Calculate the median (50th percentile) of a linear variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
svyquantile( ~ vd4020n , pnadc_design , 0.5 , na.rm = TRUE )

svyby( 
	~ vd4020n , 
	~ uf_name , 
	pnadc_design , 
	svyquantile , 
	0.5 ,
	ci = TRUE , na.rm = TRUE
)
```

Estimate a ratio:
```{r results='hide', message=FALSE, warning=FALSE}
svyratio( 
	numerator = ~ ocup_c , 
	denominator = ~ pea_c , 
	pnadc_design ,
	na.rm = TRUE
)
```

Restrict the survey design to employed persons:
```{r results='hide', message=FALSE, warning=FALSE}
sub_pnadc_design <- subset( pnadc_design , ocup_c == 1 )
```
Calculate the mean (average) of this subset:
```{r results='hide', message=FALSE, warning=FALSE}
svymean( ~ vd4020n , sub_pnadc_design , na.rm = TRUE )
```

Extract the coefficient, standard error, confidence interval, and coefficient of variation from any descriptive statistics function result, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
this_result <- svymean( ~ vd4020n , pnadc_design , na.rm = TRUE )

coef( this_result )
SE( this_result )
confint( this_result )
cv( this_result )

grouped_result <-
	svyby( 
		~ vd4020n , 
		~ uf_name , 
		pnadc_design , 
		svymean ,
		na.rm = TRUE 
	)
	
coef( grouped_result )
SE( grouped_result )
confint( grouped_result )
cv( grouped_result )
```

Calculate the (nominal) design degrees of freedom of any survey design object:
```{r results='hide', message=FALSE, warning=FALSE}
degf( pnadc_design )
```

Calculate the complex sample survey-adjusted variance of any statistic:
```{r results='hide', message=FALSE, warning=FALSE}
svyvar( ~ vd4020n , pnadc_design , na.rm = TRUE )
```

Include the complex sample design effect in the result for a specific statistic:
```{r results='hide', message=FALSE, warning=FALSE}
# SRS without replacement
svymean( ~ vd4020n , pnadc_design , na.rm = TRUE , deff = TRUE )

# SRS with replacement
svymean( ~ vd4020n , pnadc_design , na.rm = TRUE , deff = "replace" )
```

Compute confidence intervals for proportions using methods that may be more accurate near 0 and 1. See `?svyciprop` for alternatives:
```{r results='hide', message=FALSE, warning=FALSE}
svyciprop( ~ male , pnadc_design ,
	method = "likelihood" )
```

Perform a design-based t-test:
```{r results='hide', message=FALSE, warning=FALSE}
svyttest( vd4020n ~ male , pnadc_design )
```

Perform a chi-squared test of association for survey data:
```{r results='hide', message=FALSE, warning=FALSE}
svychisq( 
	~ male + age_categories , 
	pnadc_design 
)
```

Perform a survey-weighted generalized linear model:
```{r results='hide', message=FALSE, warning=FALSE}
glm_result <- 
	svyglm( 
		vd4020n ~ male + age_categories , 
		pnadc_design 
	)

summary( glm_result )
```

### Per Capita Income

Calculate the Gini coefficient with per capita income:

```{r}
# https://sidra.ibge.gov.br/tabela/7435#/n1/all/v/all/p/last%201/d/v10681%203,v10682%201/l/v,p,t/resultado

(
  pnadc_per_capita_gini <-
    svygini( ~ deflated_per_capita_income , pnadc_design , na.rm = TRUE)
)

# match 2022 per_capita gini coefficient
stopifnot(round(coef(pnadc_per_capita_gini) , 3) == 0.518)

# match 2022 per_capita gini coefficient of variation
stopifnot(round(cv(pnadc_per_capita_gini), 3) == 0.006) 
```

### Worker Earnings

Estimate the Gini coefficient with total earnings:

```{r}
# https://sidra.ibge.gov.br/tabela/7453#/n1/all/v/all/p/last%201/d/v10806%203,v10807%201/l/v,p,t/resultado

(pnadc_earnings_gini <-
   svygini( ~ deflated_labor_income , pnadc_design , na.rm = TRUE))

# match 2022 earnings gini coefficient
stopifnot(round(coef(pnadc_earnings_gini) , 3) == 0.486)

# match 2022 earnings gini coefficient of variation
stopifnot(round(cv(pnadc_earnings_gini), 3) == 0.007) 
```


```{r results='hide', echo=FALSE}
rm( pnadc_df ) ; gc()
```


## Survey of Consumer Finances (SCF)


The SCF studies net worth across the United States by asking respondents about both active and passive income, mortgages, pensions, credit card debt, even car leases.  Administered by the [Board of Governors of the Federal Reserve System](https://www.federalreserve.gov/) triennially since 1989, this complex sample survey generalizes to the civilian non-institutional population and comprehensively assesses household wealth.   

This section downloads, imports, and prepares the most current microdata for analysis, then reproduces some statistics and margin of error terms from the Federal Reserve.

This survey uses a multiply-imputed variance estimation technique described in the [2004 Codebook](https://www.federalreserve.gov/econres/files/2004_codebk2004.txt). Most users do not need to study this function carefully. Define a function specific to only this dataset:

```{r results='hide', message=FALSE, warning=FALSE}
scf_MIcombine <-
  function (results,
            variances,
            call = sys.call(),
            df.complete = Inf,
            ...) {
    m <- length(results)
    oldcall <- attr(results, "call")
    if (missing(variances)) {
      variances <- suppressWarnings(lapply(results, vcov))
      results <- lapply(results, coef)
    }
    vbar <- variances[[1]]
    cbar <- results[[1]]
    for (i in 2:m) {
      cbar <- cbar + results[[i]]
      # MODIFICATION:
      # vbar <- vbar + variances[[i]]
    }
    cbar <- cbar / m
    # MODIFICATION:
    # vbar <- vbar/m
    evar <- var(do.call("rbind", results))
    r <- (1 + 1 / m) * evar / vbar
    df <- (m - 1) * (1 + 1 / r) ^ 2
    if (is.matrix(df))
      df <- diag(df)
    if (is.finite(df.complete)) {
      dfobs <- ((df.complete + 1) / (df.complete + 3)) * df.complete *
        vbar / (vbar + evar)
      if (is.matrix(dfobs))
        dfobs <- diag(dfobs)
      df <- 1 / (1 / dfobs + 1 / df)
    }
    if (is.matrix(r))
      r <- diag(r)
    rval <- list(
      coefficients = cbar,
      variance = vbar + evar *
        (m + 1) / m,
      call = c(oldcall, call),
      nimp = m,
      df = df,
      missinfo = (r + 2 / (df + 3)) / (r + 1)
    )
    class(rval) <- "MIresult"
    rval
  }
```


Define a function to download and import each stata file:

```{r results='hide', message=FALSE, warning=FALSE}
library(haven)

scf_dta_import <-
  function(this_url) {
    this_tf <- tempfile()
    
    download.file(this_url , this_tf , mode = 'wb')
    
    this_tbl <- read_dta(this_tf)
    
    this_df <- data.frame(this_tbl)
    
    file.remove(this_tf)
    
    names(this_df) <- tolower(names(this_df))
    
    this_df
  }
```	

Download and import the full, summary extract, and replicate weights tables:

```{r results='hide', message=FALSE, warning=FALSE}
scf_df <-
  scf_dta_import("https://www.federalreserve.gov/econres/files/scf2022s.zip")

ext_df <-
  scf_dta_import("https://www.federalreserve.gov/econres/files/scfp2022s.zip")

scf_rw_df <-
  scf_dta_import("https://www.federalreserve.gov/econres/files/scf2022rw1s.zip")
```

Confirm both the full public data and the summary extract contain five records per family:
```{r results='hide', message=FALSE, warning=FALSE}
stopifnot(nrow(scf_df) == nrow(scf_rw_df) * 5)
stopifnot(nrow(scf_df) == nrow(ext_df))
```

Confirm only the primary economic unit and the five implicate identifiers overlap:
```{r results='hide', message=FALSE, warning=FALSE}
stopifnot(all(sort(intersect(
  names(scf_df) , names(ext_df)
)) == c('y1' , 'yy1')))
stopifnot(all(sort(intersect(
  names(scf_df) , names(scf_rw_df)
)) == c('y1' , 'yy1')))
stopifnot(all(sort(intersect(
  names(ext_df) , names(scf_rw_df)
)) == c('y1' , 'yy1')))
```

Remove the implicate identifier from the replicate weights table, add a column of fives for weighting:
```{r results='hide', message=FALSE, warning=FALSE}
scf_rw_df[, 'y1'] <- NULL

scf_df[, 'five'] <- 5
```

Construct a multiply-imputed, complex sample survey design:

Break the main table into five different implicates based on the final character of the column `y1`:
```{r results='hide', message=FALSE, warning=FALSE}
library(stringr)

s1_df <- scf_df[str_sub(scf_df[, 'y1'] ,-1 ,-1) == 1 ,]
s2_df <- scf_df[str_sub(scf_df[, 'y1'] ,-1 ,-1) == 2 ,]
s3_df <- scf_df[str_sub(scf_df[, 'y1'] ,-1 ,-1) == 3 ,]
s4_df <- scf_df[str_sub(scf_df[, 'y1'] ,-1 ,-1) == 4 ,]
s5_df <- scf_df[str_sub(scf_df[, 'y1'] ,-1 ,-1) == 5 ,]
```

Combine these into a single `list`, then merge each implicate with the summary extract:
```{r results='hide', message=FALSE, warning=FALSE}
scf_imp <- list(s1_df , s2_df , s3_df , s4_df , s5_df)

scf_list <- lapply(scf_imp , merge , ext_df)

```


```{r results='hide', echo=FALSE}
# variables to keep
scf_list <-
	lapply(
		scf_list ,
		function( w ) w[ c( 'yy1' , 'wgt' , 'networth' , 'hhsex' , 'married' , 'edcl' , 'five' , 'x8022' , 'income' , 'lf' ) ]
	)

gc()
```


Replace all missing values in the replicate weights table with zeroes, multiply the replicate weights by the multiplication factor, then only keep the unique identifier and the final (combined) replicate weights:
```{r results='hide', message=FALSE, warning=FALSE}
scf_rw_df[is.na(scf_rw_df)] <- 0

scf_rw_df[, paste0('wgt' , 1:999)] <-
  scf_rw_df[, paste0('wt1b' , 1:999)] * scf_rw_df[, paste0('mm' , 1:999)]

scf_rw_df <- scf_rw_df[, c('yy1' , paste0('wgt' , 1:999))]
```

Sort both the five implicates and also the replicate weights table by the unique identifier:

```{r results='hide', message=FALSE, warning=FALSE}
scf_list <-
  lapply(scf_list , function(w)
    w[order(w[, 'yy1']) ,])

scf_rw_df <- scf_rw_df[order(scf_rw_df[, 'yy1']) ,]
```

Define the design:
```{r results='hide', message=FALSE, warning=FALSE}
library(survey)
library(mitools)

scf_design <-
  svrepdesign(
    weights = ~ wgt ,
    repweights = scf_rw_df[,-1] ,
    data = imputationList(scf_list) ,
    scale = 1 ,
    rscales = rep(1 / 998 , 999) ,
    mse = FALSE ,
    type = "other" ,
    combined.weights = TRUE
  )

```


Run the `convey_prep()` function on the full design:

```{r results='hide', message=FALSE, warning=FALSE}
scf_design$designs <- lapply(scf_design$designs , convey_prep)
```

This example matches the "Table 4" tab's cell Y6 of the [Excel Based on Public Data](https://www.federalreserve.gov/econres/files/scf2022_tables_public_nominal_historical.xlsx):

```{r results='hide', message=FALSE, warning=FALSE}
mean_net_worth <-
  scf_MIcombine(with(scf_design , svymean(~ networth)))

stopifnot(round(coef(mean_net_worth) / 1000 , 1) == 1059.5)
```

This example comes within $500 of the standard error of mean net worth from Table 2 of the [Federal Reserve Bulletin](https://www.federalreserve.gov/publications/files/scf23.pdf#page=18), displaying the minor differences between the [Internal Data](https://www.federalreserve.gov/econres/files/scf2022_tables_internal_nominal_historical.xlsx) and [Public Data](https://www.federalreserve.gov/econres/files/scf2022_tables_public_nominal_historical.xlsx):
```{r results='hide', message=FALSE, warning=FALSE}
stopifnot(abs(23.2 - round(SE(mean_net_worth) / 1000 , 1)) < 0.5)
```

This example matches the "Table 4" tab's cells X6 of the [Excel Based on Public Data](https://www.federalreserve.gov/econres/files/scf2022_tables_public_nominal_historical.xlsx):

```{r results='hide', message=FALSE, warning=FALSE}
# compute quantile with all five implicates stacked (not the recommended technique)
fake_design <-
  svydesign(~ 1 , data = ext_df[c('networth' , 'wgt')] , weights = ~ wgt)

median_net_worth_incorrect_errors <-
  svyquantile(~ networth , fake_design , 0.5)

stopifnot(round(coef(median_net_worth_incorrect_errors) / 1000 , 2) == 192.7)
```


```{r results='hide', echo=FALSE}
rm( scf_rw_df , ext_df , s1_df , s2_df , s3_df , s4_df , s5_df , scf_imp , scf_list ) ; gc()
```


### Analysis Examples with the `survey` library

Add new columns to the data set:
```{r results='hide', message=FALSE, warning=FALSE}
scf_design <-
  update(
    scf_design ,
    
    hhsex = factor(
      hhsex ,
      levels = 1:2 ,
      labels = c("male" , "female")
    ) ,
    
    married = as.numeric(married == 1) ,
    
    edcl =
      factor(
        edcl ,
        levels = 1:4 ,
        labels =
          c(
            "less than high school" ,
            "high school or GED" ,
            "some college" ,
            "college degree"
          )
      )
    
  )
```


Count the unweighted number of records in the survey sample, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
scf_MIcombine(with(scf_design , svyby(~ five , ~ five , unwtd.count)))

scf_MIcombine(with(scf_design , svyby(~ five , ~ hhsex , unwtd.count)))
```

Count the weighted size of the generalizable population, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
scf_MIcombine(with(scf_design , svytotal(~ five)))

scf_MIcombine(with(scf_design ,
                   svyby(~ five , ~ hhsex , svytotal)))
```

Calculate the mean (average) of a linear variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
scf_MIcombine(with(scf_design , svymean(~ networth)))

scf_MIcombine(with(scf_design ,
                   svyby(~ networth , ~ hhsex , svymean)))
```

Calculate the distribution of a categorical variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
scf_MIcombine(with(scf_design , svymean(~ edcl)))

scf_MIcombine(with(scf_design ,
                   svyby(~ edcl , ~ hhsex , svymean)))
```

Calculate the sum of a linear variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
scf_MIcombine(with(scf_design , svytotal(~ networth)))

scf_MIcombine(with(scf_design ,
                   svyby(~ networth , ~ hhsex , svytotal)))
```

Calculate the weighted sum of a categorical variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
scf_MIcombine(with(scf_design , svytotal(~ edcl)))

scf_MIcombine(with(scf_design ,
                   svyby(~ edcl , ~ hhsex , svytotal)))
```

Calculate the median (50th percentile) of a linear variable, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
scf_MIcombine(with(
  scf_design ,
  svyquantile(~ networth ,
              0.5 , se = TRUE , interval.type = 'quantile')
))

scf_MIcombine(with(
  scf_design ,
  svyby(
    ~ networth ,
    ~ hhsex ,
    svyquantile ,
    0.5 ,
    se = TRUE ,
    interval.type = 'quantile' ,
    ci = TRUE
  )
))
```

Estimate a ratio:
```{r results='hide', message=FALSE, warning=FALSE}
scf_MIcombine(with(
  scf_design ,
  svyratio(numerator = ~ income , denominator = ~ networth)
))
```

Restrict the survey design to labor force participants:
```{r results='hide', message=FALSE, warning=FALSE}
sub_scf_design <- subset(scf_design , lf == 1)
```
Calculate the mean (average) of this subset:
```{r results='hide', message=FALSE, warning=FALSE}
scf_MIcombine(with(sub_scf_design , svymean(~ networth)))
```

Extract the coefficient, standard error, confidence interval, and coefficient of variation from any descriptive statistics function result, overall and by groups:
```{r results='hide', message=FALSE, warning=FALSE}
this_result <-
  scf_MIcombine(with(scf_design ,
                     svymean(~ networth)))

coef(this_result)
SE(this_result)
confint(this_result)
cv(this_result)

grouped_result <-
  scf_MIcombine(with(scf_design ,
                     svyby(~ networth , ~ hhsex , svymean)))

coef(grouped_result)
SE(grouped_result)
confint(grouped_result)
cv(grouped_result)
```

Calculate the degrees of freedom of any survey design object:
```{r results='hide', message=FALSE, warning=FALSE}
degf(scf_design$designs[[1]])
```

Calculate the complex sample survey-adjusted variance of any statistic:
```{r results='hide', message=FALSE, warning=FALSE}
scf_MIcombine(with(scf_design , svyvar(~ networth)))
```

Include the complex sample design effect in the result for a specific statistic:
```{r results='hide', message=FALSE, warning=FALSE}
# SRS without replacement
scf_MIcombine(with(scf_design ,
                   svymean(~ networth , deff = TRUE)))

# SRS with replacement
scf_MIcombine(with(scf_design ,
                   svymean(~ networth , deff = "replace")))
```


Perform a survey-weighted generalized linear model:
```{r results='hide', message=FALSE, warning=FALSE}
glm_result <-
  scf_MIcombine(with(scf_design ,
                     svyglm(networth ~ married + edcl)))

summary(glm_result)
```


### Family Net Worth

Calculate the Gini coefficient with family net worth:

```{r}
scf_MIcombine(with(scf_design , svygini(~ networth)))
```

### Family Income

Calculate the Gini coefficient with income:

```{r}
scf_MIcombine(with(scf_design , svygini(~ income)))
```


## Real World Examples

In 2006, [the Financial Times reported](https://www.ft.com/content/41470ec0-845b-11db-87e0-0000779e2340) on a team of researchers finding that "Personal wealth is distributed so unevenly across the world that the richest two per cent of adults own more than 50 per cent of the world's assets while the poorest half hold only 1 per cent of wealth."  Although the original publication presented a global estimate, we haven chosen to reproduce this calculation using nationally-representative surveys from both the United States and Brazil.  We reproduce this inequality statistic with a variety of surveys and levels of analysis to highlight how this software can be used not only to estimate a number but also to understand the _uncertainty_ around that number.

To understand the construction of each survey design object and respective variables of interest, please refer to [section 1.4](https://guilhermejacob.github.io/context/1.4-current-population-survey---annual-social-and-economic-supplement-cps-asec.html) for CPS-ASEC, [section 1.5](https://guilhermejacob.github.io/context/1.5-pesquisa-nacional-por-amostra-de-domic%C3%ADlios-cont%C3%ADnua-pnad-cont%C3%ADnua.html) for PNAD-Contínua, and [section 1.6](https://guilhermejacob.github.io/context/1.6-survey-of-consumer-finances-scf.html) for SCF.

### CPS-ASEC Household Income

```{r}
result <-
  svylorenz( ~ htotval ,
             cps_household_design ,
             quantile = c(0.5 , 0.98) ,
             plot = FALSE)
```

The richest two percent:
```{r}
1 - coef(result)[2]
1 - confint(result)[2, ]
```

The poorest half:
```{r}
coef(result)[1]
confint(result)[1, ]
```

### CPS-ASEC Family Income

```{r}
result <-
  svylorenz( ~ ftotval ,
             cps_family_design ,
             quantile = c(0.5 , 0.98) ,
             plot = FALSE)
```

The richest two percent:
```{r}
1 - coef(result)[2]
1 - confint(result)[2, ]
```

The poorest half:
```{r}
coef(result)[1]
confint(result)[1, ]
```


### CPS-ASEC Worker Earnings

```{r}
result <-
  svylorenz(~ pearnval ,
            cps_ftfy_worker_design ,
            quantile = c(0.5 , 0.98) ,
            plot = FALSE)
```

The richest two percent:
```{r}
1 - coef(result)[2]
1 - confint(result)[2, ]
```

The poorest half:
```{r}
coef(result)[1]
confint(result)[1, ]
```


### PNAD-Contínua Per Capita Income

```{r}
result <- svylorenz(
  ~ deflated_per_capita_income ,
  pnadc_design ,
  na.rm = TRUE ,
  quantile = c(0.5 , 0.98) ,
  plot = FALSE
)
```

The richest two percent:
```{r}
1 - coef(result)[2]
1 - confint(result)[2, ]
```

The poorest half:
```{r}
coef(result)[1]
confint(result)[1, ]
```

### PNAD-Contínua Worker Earnings

```{r}
result <- svylorenz(
  ~ deflated_labor_income ,
  pnadc_design ,
  na.rm = TRUE ,
  quantile = c(0.5 , 0.98) ,
  plot = FALSE
)
```

The richest two percent:
```{r}
1 - coef(result)[2]
1 - confint(result)[2, ]
```

The poorest half:
```{r}
coef(result)[1]
confint(result)[1, ]
```


### SCF Family Net Worth

```{r}
result <-
  scf_MIcombine(with(scf_design , svylorenz(
    ~ networth , quantile = c(0.5 , 0.98) , plot = FALSE
  )))
```

The richest two percent:
```{r}
1 - coef(result)[2]
1 - confint(result)[2, ]
```

The poorest half:
```{r}
coef(result)[1]
confint(result)[1, ]
```


### SCF Family Income

```{r}
result <-
  scf_MIcombine(with(scf_design , svylorenz(
    ~ income , quantile = c(0.5 , 0.98) , plot = FALSE
  )))
```

The richest two percent:
```{r}
1 - coef(result)[2]
1 - confint(result)[2, ]
```

The poorest half:
```{r}
coef(result)[1]
confint(result)[1, ]
```