************************************************************ ************************************************************ *** *** *** Do-file for working with pairfam data *** *** WEIGHTING *** *** ANCHOR DATA WAVES 1-12 *** *** Release 12.0 *** *** *** *** May 2021 *** *** *** *** Author: Martin Wetzel *** *** *** ************************************************************ ************************************************************ /* With release 12.0, new and improved weights are distributed. This concept contains two types of weights: * Design weights (d$weight) adjusting for differences between the population and the gross sample * Combined calibration weights (cd$weight) which already include the design weight and adjust characteristics of the sample with characteristics of the population (gender, federal state, education level, migration background, settlement structure bik, family status, no. of children in household). Both types of weights are available for four subsamples: * c/dweight: pairfam base * c/d1weight: pairfam base and DemoDiff * c/d2weight: pairfam base, DemoDiff, and refreshment sample (w11) * c/d3weight: refreshment sample (w11) This do-file shows some examples for using weights. The use of the design weights is more often appropriate than not. They are in particular needed if users analyze data pooled over the cohorts and over samples (pairfam base, DemoDiff, and refreshment sample of wave 11). The pairfam study is subject to typical patterns of selective participation at the first observation and of panel attrition (see Technical Paper #1). The combined calibration weights are one way of reducing the resulting bias. As the extent of the selection bias also depends on the particular research question, we recommend to run analyses with and without cd$weights to evaluate the direction and extent of selection. For further information on weights please refer to the pairfam Data Manual, Release 12.0, section "Weights" and to the pairfam Technical Paper No. 17 "New weights for the pairfam anchor data" (Wetzel, Schumann, & Schmiedeberg 2021; available on the pairfam website: https://www.pairfam.de/dokumentation/technical-papers/). If you need further help please do not hesitate to contact us: support@pairfam.de Structure of the Quick-Start: ***** I) Using weights for cross-sectional analyses ***** ***** II) Using weights for long-format data ***** */ *************************************************************************** *** PRELIMINARIES *** *************************************************************************** clear all set more off // tells Stata not to pause for --more-- messages set maxvar 15000 // increases maximal number of variables global inpath "insert your datapath here" // directory of original data global oupath "insert your datapath here" // working directory ****************************************** *** Using weights *** ****************************************** ***** I) Analyses for the full sample of wave 1 or wave 11 ***** *** Load data cd "$inpath" use id sample nkidsbioalv nkids cohort relstat yeduc sex_gen dweight d1weight cdweight cd1weight /// using anchor1, clear // load Anchor data wave 1 append using anchor1_DD, keep(id sample nkidsbioalv nkids cohort /// relstat yeduc sex_gen dweight d1weight cdweight cd1weight) label language de // use German labels *** Remember: c/dweight is applicable for the pairfam base sample only *** while c/d1weight includes also the DemoDiff sample *# Example: Average number of kids replace nkidsbioalv=nkids if nkidsbioalv==-7 mean nkidsbioalv, over(cohort) // unweighted (N=13,891) mean nkidsbioalv [pweight=d1weight], over(cohort) // design weight (N=13,891) mean nkidsbioalv [pweight=dweight], over(cohort) // design weight (N=12,402) - pairfam only * Note: The impact of design weights on cohort-specific analysis at wave 1 is small. * However, for combined analyses over cohorts it is important. mean nkidsbioalv mean nkidsbioalv [pweight=d1weight] // design weight * Now control for selective participation using the combined calibration weight: mean nkidsbioalv [pweight=cd1weight], over(cohort) // combined calibrated design weight (N=13,891) * Note: Calibrated design weights correct for selective participation. * Because number of children (but acutally those living in the household "nkidsliv") * is controlled for in the weights, the distribution of this variable is very close * to the population (Mikrozensus). * Alternatively, use the svy command: svyset [pweight=cd1weight] // combined calibration weight svy: mean nkidsbioalv, over(cohort) // weighted (using svy) *# Example: Distribution of number of kids tab nkidsbioalv if cohort==3 // unweighted tab nkidsbioalv [iweight=cd1weight] if cohort==3 // combined calibration weight svyset [pweight=cd1weight] proportion nkidsbioalv if cohort==3 // unweighted svy, subpop(if cohort==3): proportion nkidsbioalv // weighted (exact case selection, df correct) svy: proportion nkidsbioalv if cohort==3 // weighted (sloppy case selection, df wrong) svy, subpop(if cohort==3): tab nkidsbioalv // weighted (if you need only proportions) *# Example: Regression on number of kids gen woman = sex_gen==2 recode yeduc -7/0=. reg nkidsbioalv woman yeduc if cohort==3 // unweighted reg nkidsbioalv woman yeduc if cohort==3 [pweight=cd1weight] // weighted svy, subpop(if cohort==3): reg nkidsbioalv woman yeduc // weighted * Instead of running the analyses separately by cohort, a cohort (and its interactions) can be included. reg nkidsbioalv (i.woman c.yeduc)##cohort [pweight=cd1weight] // weighted svy: reg nkidsbioalv (i.woman c.yeduc)##cohort // weighted *# Example: Using the wave 11 refreshment sample use id sample nkidsbioalv nkids cohort relstat yeduc sex_gen d*weight cd*weight /// using anchor11, clear // load anchor data wave 11 mean nkidsbioalv [pweight=d2weight], over(cohort) // design weight all samples (N=9,435) mean nkidsbioalv [pweight=d3weight], over(cohort) // design weight only refreshment (N=5,021) mean nkidsbioalv [pweight=cd2weight], over(cohort) // calibration weight all samples (N=9,435) mean nkidsbioalv [pweight=cd3weight], over(cohort) // calibration weight only refreshment (N=5,021) ***** II) Using weights for long-format data ***** *** Data Preparation: Extracting weight variables and pooling waves use id wave cohort sample d*weight cd*weight relstat age mardur sat6 nkidsbioalv using anchor1, clear quietly: for num 2/11: append using anchorX.dta, keep (id wave cohort sample d*weight cd*weight /// relstat age mardur sat6 nkidsbioalv) * Some variables needed below mvdecode _all, mv(-1=.a\-2=.b\-3=.c\-4=.d\-5=.e\-6=.f\-7/-11=.g) // Define missings *# Example: Mean level of children over waves sort id wave tabdisp id wave in 1/110, cellvar(nkidsbioalv) // First 10 persons à 11 observations svyset [pweight=cd2weight] // combined calibration weight of all subsamples svy, sub(if cohort == 2): mean nkidsbioalv , over(wave) tabstat nkidsbioalv [aweight=cd2weight] if cohort == 2, s(mean semean n) by(wave) *# Example: Fixed-Effects Panel Regression of Marriage on Life Satisfaction * Life satisfaction rename sat6 happy // rename sat6 to happy tab happy, missing * Dummy for marriage (0=never-married 1=married 2=divorced, widowed) recode relstat 1/3=0 4/5=1 6/11=2, into(marry) lab var marry "Marriage" * Sample Defintion * Exclude person-years with missing on the outcome and event variable drop if mi(happy) drop if mi(marry) * Data preparation: Only persons who were never married when first observed bysort id (wave): gen pynr = _n // person-year ID (within person) gen help=0 replace help=1 if marry>0 & pynr==1 bysort id (wave): replace help = sum(help) // ==1 for all pys of those initially not unmarried keep if help==0 drop help * All person-years after first marriage are excluded gen help=0 replace help=1 if marry>1 // flag pys after first marriage bysort id (wave): replace help=sum(help) // flag all following pys (could be a second marriage) keep if help==0 // all pys after first marriage are dropped drop help pynr bysort id (wave): gen pynr = _n * Restricting the estimation sample to those with at least 2 observations bysort id: gen pycount = _N // # of person-years (within person) tab pycount if pynr==1 // Length of the panels keep if pycount>1 xtset id wave // Information on panel data structure * Example: Effect of marriage (event) on life satisfaction (outcome) recode mardur .=0 * Fixed effects & areg regression without using weights xtreg happy i.marry age, fe vce(cluster id) est store FE1 areg happy i.marry age, absorb(id) vce(cluster id) est store AR1 * Areg regression using panel weights areg happy i.marry age [pw=d2weight], absorb(id) vce(cluster id) est store DW1 areg happy i.marry age [pw=cd2weight], absorb(id) vce(cluster id) est store CW1 estimates table FE1 AR1 DW1 CW1, b(%7.3f) se t stfmt(%6.0f) stats(N N_clust) esttab FE1 AR1 DW1 CW1, mtitle // ado needed * We see that FE1 and AR1 produce same point estimates with small differences in the T-stats. * With design weights DW1: Benefit of marriage and losses over age are slightly higher * --> Due to the sampling design (cohort sizes & East-West) unweighted sample underestimates those effects. * With calibrated design weights CW1: Benefits of marriage and losses over age are actually lower. * --> Selective participation seems to overrepresent those with stronger gains by marriage (happy couples) * and to underrepresent those with stable life satisfaction over age * [Caution: This is only an example and results are highly overinterpreted!]