************************************************************
************************************************************
***														 ***
***		    Do-file for working with pairfam data   	 ***
***        	 		   WEIGHTING			 			 ***
***           	  ANCHOR DATA WAVES 1-12           		 ***
***                   Release 12.0			             ***
***	  													 ***
***					    May 2021	                 	 ***
***														 ***
***				 Author: Martin Wetzel					 ***
***														 ***			
************************************************************
************************************************************

/*
With release 12.0, new and improved weights are distributed. This concept contains 
two types of weights: 
	* Design weights (d$weight) adjusting for differences between the population 
			and the gross sample
	* Combined calibration weights (cd$weight) which already include the design 
			weight and adjust characteristics of the sample with characteristics
			of the population (gender, federal state, education level, migration 
			background, settlement structure bik, family status, no. of children
			in household).

Both types of weights are available for four subsamples: 
	* c/dweight: pairfam base
	* c/d1weight: pairfam base and DemoDiff
	* c/d2weight: pairfam base, DemoDiff, and refreshment sample (w11)
	* c/d3weight: refreshment sample (w11)
			
This do-file shows some examples for using weights. 

The use of the design weights is more often appropriate than not. They are in particular
needed if users analyze data pooled over the cohorts and over samples (pairfam base, 
DemoDiff, and refreshment sample of wave 11).

The pairfam study is subject to typical patterns of selective participation at the
first observation and of panel attrition (see Technical Paper #1). The combined
calibration weights are one way of reducing the resulting bias. As the extent of the 
selection bias also depends on the particular research question, we recommend to run
analyses with and without cd$weights to evaluate the direction and extent of selection.

For further information on weights please refer to the pairfam Data Manual, 
Release 12.0, section "Weights" and to the pairfam Technical Paper No. 17 "New weights 
for the pairfam anchor data" (Wetzel, Schumann, & Schmiedeberg 2021; available on the
pairfam website: https://www.pairfam.de/dokumentation/technical-papers/). 

If you need further help please do not hesitate to contact us:
support@pairfam.de

Structure of the Quick-Start:
***** I) Using weights for cross-sectional analyses *****
***** II) Using weights for long-format data  *****
*/

***************************************************************************
***                     PRELIMINARIES                                   ***
***************************************************************************

clear all
set more off		// tells Stata not to pause for --more-- messages
set maxvar 15000	// increases maximal number of variables


global inpath "insert your datapath here"  // directory of original data
global oupath "insert your datapath here"  // working directory	


******************************************
***			 Using weights             ***
******************************************


***** I) Analyses for the full sample of wave 1 or wave 11 *****

*** Load data
cd "$inpath"
use id sample nkidsbioalv nkids cohort relstat yeduc sex_gen dweight d1weight cdweight cd1weight ///	
		using anchor1, clear 				// load Anchor data wave 1
append using anchor1_DD, keep(id sample nkidsbioalv nkids cohort ///
			relstat yeduc sex_gen dweight d1weight cdweight cd1weight)

label language de									// use German labels

*** Remember: c/dweight is applicable for the pairfam base sample only 
***		while c/d1weight includes also the DemoDiff sample 

*# Example: Average number of kids 
replace nkidsbioalv=nkids if nkidsbioalv==-7
mean nkidsbioalv, over(cohort)                     // unweighted	(N=13,891)
mean nkidsbioalv [pweight=d1weight], over(cohort)  // design weight (N=13,891)
mean nkidsbioalv [pweight=dweight], over(cohort)   // design weight (N=12,402) - pairfam only
*		Note: The impact of design weights on cohort-specific analysis at wave 1 is small.
*			  However, for combined analyses over cohorts it is important.
mean nkidsbioalv
mean nkidsbioalv [pweight=d1weight]					// design weight 

* Now control for selective participation using the combined calibration weight:
mean nkidsbioalv [pweight=cd1weight], over(cohort) // combined calibrated design weight (N=13,891)
*		Note: Calibrated design weights correct for selective participation.
*			  Because number of children (but acutally those living in the household "nkidsliv")
*			  is controlled for in the weights, the distribution of this variable is very close 
*			  to the population (Mikrozensus).

* Alternatively, use the svy command:
svyset [pweight=cd1weight]                         // combined calibration weight 
svy: mean nkidsbioalv, over(cohort)                // weighted (using svy)

*# Example: Distribution of number of kids 
tab nkidsbioalv if cohort==3					   // unweighted	
tab nkidsbioalv [iweight=cd1weight] if cohort==3   // combined calibration weight	

svyset [pweight=cd1weight] 
proportion nkidsbioalv if cohort==3                // unweighted
svy, subpop(if cohort==3): proportion nkidsbioalv  // weighted (exact case selection, df correct)
svy: proportion nkidsbioalv if cohort==3           // weighted (sloppy case selection, df wrong)
svy, subpop(if cohort==3): tab nkidsbioalv		   // weighted (if you need only proportions)

*# Example: Regression on number of kids
gen woman = sex_gen==2
recode yeduc -7/0=.
reg nkidsbioalv woman yeduc if cohort==3                 		// unweighted
reg nkidsbioalv woman yeduc if cohort==3 [pweight=cd1weight]    // weighted
svy, subpop(if cohort==3): reg nkidsbioalv woman yeduc   		// weighted
* Instead of running the analyses separately by cohort, a cohort (and its interactions) can be included.
reg nkidsbioalv (i.woman c.yeduc)##cohort [pweight=cd1weight]    	// weighted
svy: reg nkidsbioalv (i.woman c.yeduc)##cohort  					// weighted

*# Example: Using the wave 11 refreshment sample
use id sample nkidsbioalv nkids cohort relstat yeduc sex_gen d*weight cd*weight ///	
		using anchor11, clear 				// load anchor data wave 11

mean nkidsbioalv [pweight=d2weight], over(cohort)   // design weight all samples (N=9,435)
mean nkidsbioalv [pweight=d3weight], over(cohort) 	// design weight only refreshment (N=5,021)

mean nkidsbioalv [pweight=cd2weight], over(cohort)  // calibration weight all samples (N=9,435)
mean nkidsbioalv [pweight=cd3weight], over(cohort) 	// calibration weight only refreshment (N=5,021)




***** II) Using weights for long-format data *****

*** Data Preparation: Extracting weight variables and pooling waves 

use id wave cohort sample d*weight cd*weight relstat age mardur sat6 nkidsbioalv using anchor1, clear
quietly: for num 2/11: append using anchorX.dta, keep (id wave cohort sample d*weight cd*weight ///
													   relstat age mardur sat6 nkidsbioalv)	

* Some variables needed below
mvdecode _all, mv(-1=.a\-2=.b\-3=.c\-4=.d\-5=.e\-6=.f\-7/-11=.g)    	// Define missings


*# Example: Mean level of children over waves 
sort id wave			
tabdisp id wave in 1/110, cellvar(nkidsbioalv)			// First 10 persons à 11 observations

svyset [pweight=cd2weight]                         		// combined calibration weight of all subsamples
svy, sub(if cohort == 2): mean nkidsbioalv , over(wave) 
tabstat nkidsbioalv [aweight=cd2weight]  if cohort == 2, s(mean semean n) by(wave)


*# Example: Fixed-Effects Panel Regression of Marriage on Life Satisfaction

* Life satisfaction
rename sat6 happy							// rename sat6 to happy
tab happy, missing

* Dummy for marriage (0=never-married 1=married 2=divorced, widowed) 
recode relstat 1/3=0 4/5=1 6/11=2, into(marry)
lab var marry "Marriage"


* Sample Defintion             
* Exclude person-years with missing on the outcome and event variable 
drop if mi(happy)
drop if mi(marry)

* Data preparation: Only persons who were never married when first observed 
bysort id (wave): gen pynr    = _n   		// person-year ID (within person)
gen     help=0
replace help=1 if marry>0 & pynr==1       
bysort id (wave): replace help = sum(help) 	// ==1 for all pys of those initially not unmarried
keep if help==0   
drop   help  

* All person-years after first marriage are excluded 
gen     help=0
replace help=1 if marry>1                  	// flag pys after first marriage
bysort id (wave): replace help=sum(help)   	// flag all following pys (could be a second marriage)
keep if help==0                            	// all pys after first marriage are dropped
drop   help pynr
bysort id (wave): gen pynr = _n 

* Restricting the estimation sample to those with at least 2 observations
bysort id:        gen pycount = _N   		// # of person-years (within person)
tab pycount if pynr==1               		// Length of the panels
keep if pycount>1

xtset id wave                        		// Information on panel data structure


* Example: Effect of marriage (event) on life satisfaction (outcome)
recode mardur .=0

* Fixed effects & areg regression without using weights
xtreg happy i.marry age, fe vce(cluster id)  
est store FE1

areg  happy i.marry age, absorb(id) vce(cluster id) 
est store AR1

* Areg regression using panel weights
areg  happy i.marry age [pw=d2weight], absorb(id) vce(cluster id)   
est store DW1 

areg  happy i.marry age [pw=cd2weight], absorb(id) vce(cluster id)
est store CW1 
estimates table FE1 AR1 DW1 CW1, b(%7.3f) se t stfmt(%6.0f) stats(N N_clust) 
esttab FE1 AR1 DW1 CW1, mtitle										// ado needed 
* We see that FE1 and AR1 produce same point estimates with small differences in the T-stats.
*	With design weights DW1: Benefit of marriage and losses over age are slightly higher 
*	--> Due to the sampling design (cohort sizes & East-West) unweighted sample underestimates those effects.
*   With calibrated design weights CW1: Benefits of marriage and losses over age are actually lower.
*	--> Selective participation seems to overrepresent those with stronger gains by marriage (happy couples)
*								  and to underrepresent those with stable life satisfaction over age
*	[Caution: This is only an example and results are highly overinterpreted!]