************************************************************ ************************************************************ *** *** *** Do-file for working with pairfam data *** *** SAMPLE DEFINITION *** *** ANCHOR DATA *** *** Release 13.0 *** *** *** *** May 2022 *** *** *** *** Author: Michel Herzig *** *** *** ************************************************************ ************************************************************ /* This do-file presents some suggestions for defining samples of analysis in a longitudinal perspective (anchor data). It uses waves 1 to 11, for the handling of wave 12 and wave 13 see the Data Manual. If you need further help please do not hesitate to contact: support@pairfam.de */ *************************************************************************** *** PRELIMINARIES *** *************************************************************************** clear all set more off // tells Stata not to pause for --more-- messages set maxvar 15000 // increases maximal number of variables global inpath `""Insert path""' // directory of original data global outpath `""Insert path""' // saving directory ***************************************************************************** *** Loading the anchor data (wave 1) *** ***************************************************************************** // load anchor data wave 1 (only necessary variables) cd $inpath use id wave sample demodiff original_sex sex_gen cohort int1 int2 nat1 nat2 east cob using anchor1, clear label language de // use German labels *label language en // use English labels (optional) ***************************************************************************** * Version 1: Create longitudinal data set * * pairfam + demodiff anchors + refreshment * * (reduced number of variables) * ***************************************************************************** * Looping syntax for combining the data sets // create panel dataset (long form) & suppress label notices (via quietly) quietly: for num 2/11: append using anchorX.dta, keep(id wave sample demodiff original_sex sex_gen cohort nat1 nat2 east cob ethni) // add DemoDiff (DD) sample quietly: append using anchor1_DD.dta, keep(id wave sample demodiff original_sex sex_gen cohort nat1 nat2 east cob ethni) tab wave sample, miss // recognize that DD wave 1 correspondes to years 2009/10! recode wave (1=2) if sample==2 // therefore set starting year of DD to pairfam wave 2 (2009/10) tab wave sample, miss // check result des, short // show basic information for new data set (N, N of vars) sort id wave // sort the observations bysort id (wave): gen pynr = _n // flag person years lab var pynr "person years" cd $outpath save anchorall.dta, replace // save longitudinal combined data set (optional) ***************************************************************************** * Version 2: Create longitudinal data set * * pairfam anchors & refreshment * * (full number of variables) * ***************************************************************************** cd "$inpath" use anchor1, clear // load anchor data wave 1 * Alternative looping syntax for combining the data sets forvalues x = 2/11 { quietly: append using anchor`x'.dta } tab wave sample, miss // from wave 3 on DemoDiff is included in anchor dataset drop if sample == 2 // drop demodiff anchors, but keep refreshment (in W11) des, short // show basic information for new data set (N, N of vars) sort id wave // sort the observations *cd $outpath *save anchorpairfam.dta, replace // save longitudinal pairfam data set (optional) ***************************************************************************** * Defining (some) analysis samples * * * ***************************************************************************** /* Hint: The following sample selections can not be executed in a row, as the dataset in memory changes after keep/drop commands. There are 2 ways to deal with changing sample definitions: a) save the longitudinal/appended data set on disk and load it before sample selection (recommended) b) Use: "preserve" before sample selection; "restore" to undo selection commands (must be executed at once in do-files; alternatively use the command window) */ ***************************************************************************** * Version 1.1: Create an unbalanced sample: * * females from cohort 1 * * participated at least once * ***************************************************************************** cd $outpath use anchorall.dta, clear // load longitudinal basic data set (reduced to necessary variables!) tab wave original_sex // distribution of cross-sectional sex information (original_sex) tab original_sex sex_gen // distribution of "corrected" (sex_gen) & cross-sectional sex information // recommendation: use *_gen vars to minimize measurement error keep if sex_gen==2 // select only women keep if cohort==1 // select only the youngest cohort tab wave // case numbers per wave: only women from cohort 1 (= unbalanced) ***************************************************************************** * Version 1.2: Create a balanced sample: * * females from cohort 1, * * participated all waves * ***************************************************************************** cd $outpath use anchorall.dta, clear // load longitudinal basic data set tab sex_gen // note that 2 persons changed their sex within the observation period! tab original_sex wave if sex_gen==-4 // check wave specific sex of those persons /* How to deal with these persons depends on your research question. You could either replace "sex_gen" with the wave specific sex information (original_sex) or decide to drop them */ keep if sex_gen==2 // select only women keep if cohort==1 // select only the youngest cohort bysort id: gen pycount=_N // counter: N of person-years (rows in data) per person lab var pycount "number of participations" tab pycount wave // distribution over waves keep if pycount == 11 // select those participated all waves (= balanced) ***************************************************************************** ** Version 2.1: Selection over characteristics collected only once * ** (stable in the data set) * ** e.g. migration background, language skills * ***************************************************************************** cd $outpath use anchorall.dta, clear // load longitudinal basic data set * Drop all persons with very low german lanuguage skills (speaking & understanding) tab int1 wave, m // int 1 & int2 available in wave 1 only bysort id (wave): replace int1=1 if int1[_n-1]==1 // fill all person years with wave 1 information about very low german language skills bysort id: egen inttest =min(int1) bysort id (wave): replace int2=1 if int2[_n-1]==1 tab int1 wave, m // show/compare results drop if int1==1 | int2==1 // drop all persons with low/bad german language skills * Restrict sample to German citizens without migration background (i.e. with german parents) gen german = nat1==1 & ethni==1 // citizenship is a time constant measure in pairfam tab german if pynr==1 keep if german==1 // select Germans tab cob if pynr==1 // control result 1: some anchors not born in germany tab nat2 if pynr==1 // control result 2: some anchors with 2nd citizenship * Very strict classification of Germans: anchor born in Germany & no 2nd citizenship keep if inlist(cob,1,2) & nat2==-3 // use of (time-constant) gen vars => no further recoding ***************************************************************************** * Version 2.2: Selection over time-varying characteristics * * * * e.g. homosexuality, residence * ***************************************************************************** * Sample: Persons residing in East Germany at the start of the pairfam panel (DemoDiff starts in wave 2!) cd $outpath use anchorall.dta, clear // load longitudinal basic data set * drop if east==0 // only deletes person years gen starteast = east==1 if wave==1 // pairfam anchors living in East Germany in wave 1 replace starteast = 1 if demodiff==1 & wave==2 // demodiff anchors (living in East Germany at start by definition) tab wave starteast, m bysort id (wave): replace starteast = starteast[_n-1] if starteast[_n-1]!=. // mark all person years as East Germany tab wave starteast, m tab starteast east // tab time-varying vs. fixed residence information keep if starteast == 1 // keep relevant anchors