I am a graduate student working on the thesis. I'm studying how the quality of school infrastructure affects students performances (test score, attendance) in rural Angola. I have data collected in two phases (baseline and endline). I want to perform a diff-in-diff, but before doing that I need to properly organize my dataset. I have two data sets I want to merge (variable in common in the two dataset is schhol_id):
- one dataset in a long format, 59,000 obs, at a student level characteristics:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input long school_id int student_id float gender_S long age_S 1 35 0 . 1 23 0 . 1 27 0 . 1 30 0 . 1 2 0 . 1 24 0 . 1 24 0 . 1 2 0 . 1 43 0 . 1 42 0 . 1 18 0 . 1 9 0 . 1 29 0 . 1 13 0 . 1 7 0 . 1 23 0 . 1 19 0 . 1 7 0 . 1 42 0 . 1 8 0 . 1 . 0 . 1 36 0 . 1 16 0 . 1 36 0 . 1 12 0 . 1 16 0 . 1 5 0 . 1 30 0 . 1 11 0 . 1 26 0 . 1 40 0 . 1 . 0 12 1 27 0 . 1 7 0 . 1 . 0 . 1 . 0 . 1 14 0 . 1 . 0 . 1 25 0 . 1 37 0 . 1 34 0 . 1 23 0 . 1 22 0 . 1 16 0 . 1 24 0 . 1 2 0 . 1 7 0 . 1 44 0 . 1 26 0 . 1 16 0 . 1 10 0 . 1 3 0 . 1 19 0 . 1 6 0 . 1 36 0 . 1 40 0 . 1 13 0 . 1 4 0 . 1 48 0 . 1 36 0 . 1 41 0 . 1 33 0 . 1 47 0 . 1 28 0 . 1 22 0 . 1 42 0 . 1 11 0 . 1 13 0 . 1 22 0 . 1 7 0 . 1 13 0 . 1 10 0 . 1 8 0 . 1 12 0 . 1 39 0 . 1 42 0 . 1 38 0 . 1 26 0 . 1 19 0 . 1 42 0 . 1 23 0 . 1 6 0 . 1 36 0 . 1 2 0 . 1 33 0 . 1 26 0 . 1 1 0 . 1 10 0 . 1 44 0 . 1 4 0 . 1 31 0 . 1 35 0 . 1 38 0 . 1 9 0 . 1 37 0 . 1 33 0 . 1 38 0 . 1 16 0 . 1 24 0 . 1 40 0 . end
- one dataset in a long format containing a the test score averages of each school both in the baseline(2016) and endline phases(2018), collected in one variable called 'test'. I also have a variable for the year(2016/2018) and the school_id:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input long school_id float(test time) 1 35.693993 0 1 51.16608 1 2 48.04455 0 2 59.34124 1 3 48.02045 0 3 58.74098 1 4 46.77065 0 4 49.02344 1 5 37.7081 0 5 48.65749 1 6 59.57922 0 6 68.95295 1 7 22.8125 0 7 69.07217 1 8 31.959213 0 8 44.28571 1 9 44.7 0 9 71.31195 1 10 56.50097 0 10 59.56289 1 11 41.18038 0 11 39.90138 1 12 52.53336 0 12 69.228966 1 13 53.78653 0 13 75.81396 1 14 37.916664 0 14 81.23595 1 15 65.38461 0 15 61.28039 1 16 39.66835 0 16 47.20472 1 17 33.210228 0 17 56.44737 1 18 36.833332 0 18 30.465117 1 19 38.98624 0 19 63.75571 1 20 57.44048 0 20 62.63514 1 21 49.22457 0 21 61.26515 1 22 60.18293 0 22 70.76655 1 23 45.59028 0 23 41 1 24 58.56349 0 24 74.27728 1 25 22.71739 0 25 23.76923 1 26 51.91993 0 26 50.95238 1 27 42.2586 0 27 54 1 28 25.069445 0 28 32.97872 1 29 39.3125 0 29 44.29824 1 30 47.72187 0 30 58.64078 1 31 48.375 0 31 65.25641 1 32 44.61121 0 32 67.17391 1 33 60.23297 0 33 46.11465 1 34 56.14584 0 34 52.40385 1 35 59.94257 0 35 80.03456 1 36 52.78233 0 36 47.45614 1 37 55.39412 0 37 29.23729 1 38 46.10526 0 38 68.13067 1 39 33.499027 0 39 36.913185 1 40 56.37134 0 40 71.58184 1 41 28.48035 0 41 48.61371 1 42 25.462963 0 42 47.13636 1 43 33.82611 0 43 56.21519 1 44 34.721645 0 44 57 1 45 54.30785 0 45 69.21283 1 46 38.115253 0 46 41.32308 1 47 53.17122 0 47 32.455357 1 48 36.628788 0 48 32.75 1 49 76.92538 0 49 56.68317 1 50 63.04464 0 50 60.88123 1 end
My regression in stata woud be:
diff test, t(treated*) p(time**)
*treated is a dummy=1 if the school is trated
** time=1 if year==2018
I need the averages of test scores in 2016 and 2018 to be organized in one variable('test'), since it will be my outcome variable. However I am having trouble merging the two dataset, since m:m merge is not a good idea, 1:1 merge won't work because in both dataset the school_id is not uniquely identified. How shoud I merge thetwo datasets?
Thank you in advance
0 Response to Merge to dataset for a diff-in-diff
Post a Comment