Hy everybody.
I am a graduate student working on the thesis. I'm studying how the quality of school infrastructure affects students performances (test score, attendance) in rural Angola. I have data collected in two phases (baseline and endline). I want to perform a diff-in-diff, but before doing that I need to properly organize my dataset. I have two data sets I want to merge (variable in common in the two dataset is schhol_id):

- one dataset in a long format, 59,000 obs, at a student level characteristics:

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long school_id int student_id float gender_S long age_S
1 35 0  .
1 23 0  .
1 27 0  .
1 30 0  .
1  2 0  .
1 24 0  .
1 24 0  .
1  2 0  .
1 43 0  .
1 42 0  .
1 18 0  .
1  9 0  .
1 29 0  .
1 13 0  .
1  7 0  .
1 23 0  .
1 19 0  .
1  7 0  .
1 42 0  .
1  8 0  .
1  . 0  .
1 36 0  .
1 16 0  .
1 36 0  .
1 12 0  .
1 16 0  .
1  5 0  .
1 30 0  .
1 11 0  .
1 26 0  .
1 40 0  .
1  . 0 12
1 27 0  .
1  7 0  .
1  . 0  .
1  . 0  .
1 14 0  .
1  . 0  .
1 25 0  .
1 37 0  .
1 34 0  .
1 23 0  .
1 22 0  .
1 16 0  .
1 24 0  .
1  2 0  .
1  7 0  .
1 44 0  .
1 26 0  .
1 16 0  .
1 10 0  .
1  3 0  .
1 19 0  .
1  6 0  .
1 36 0  .
1 40 0  .
1 13 0  .
1  4 0  .
1 48 0  .
1 36 0  .
1 41 0  .
1 33 0  .
1 47 0  .
1 28 0  .
1 22 0  .
1 42 0  .
1 11 0  .
1 13 0  .
1 22 0  .
1  7 0  .
1 13 0  .
1 10 0  .
1  8 0  .
1 12 0  .
1 39 0  .
1 42 0  .
1 38 0  .
1 26 0  .
1 19 0  .
1 42 0  .
1 23 0  .
1  6 0  .
1 36 0  .
1  2 0  .
1 33 0  .
1 26 0  .
1  1 0  .
1 10 0  .
1 44 0  .
1  4 0  .
1 31 0  .
1 35 0  .
1 38 0  .
1  9 0  .
1 37 0  .
1 33 0  .
1 38 0  .
1 16 0  .
1 24 0  .
1 40 0  .
end

- one dataset in a long format containing a the test score averages of each school both in the baseline(2016) and endline phases(2018), collected in one variable called 'test'. I also have a variable for the year(2016/2018) and the school_id:

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long school_id float(test time)
 1 35.693993 0
 1  51.16608 1
 2  48.04455 0
 2  59.34124 1
 3  48.02045 0
 3  58.74098 1
 4  46.77065 0
 4  49.02344 1
 5   37.7081 0
 5  48.65749 1
 6  59.57922 0
 6  68.95295 1
 7   22.8125 0
 7  69.07217 1
 8 31.959213 0
 8  44.28571 1
 9      44.7 0
 9  71.31195 1
10  56.50097 0
10  59.56289 1
11  41.18038 0
11  39.90138 1
12  52.53336 0
12 69.228966 1
13  53.78653 0
13  75.81396 1
14 37.916664 0
14  81.23595 1
15  65.38461 0
15  61.28039 1
16  39.66835 0
16  47.20472 1
17 33.210228 0
17  56.44737 1
18 36.833332 0
18 30.465117 1
19  38.98624 0
19  63.75571 1
20  57.44048 0
20  62.63514 1
21  49.22457 0
21  61.26515 1
22  60.18293 0
22  70.76655 1
23  45.59028 0
23        41 1
24  58.56349 0
24  74.27728 1
25  22.71739 0
25  23.76923 1
26  51.91993 0
26  50.95238 1
27   42.2586 0
27        54 1
28 25.069445 0
28  32.97872 1
29   39.3125 0
29  44.29824 1
30  47.72187 0
30  58.64078 1
31    48.375 0
31  65.25641 1
32  44.61121 0
32  67.17391 1
33  60.23297 0
33  46.11465 1
34  56.14584 0
34  52.40385 1
35  59.94257 0
35  80.03456 1
36  52.78233 0
36  47.45614 1
37  55.39412 0
37  29.23729 1
38  46.10526 0
38  68.13067 1
39 33.499027 0
39 36.913185 1
40  56.37134 0
40  71.58184 1
41  28.48035 0
41  48.61371 1
42 25.462963 0
42  47.13636 1
43  33.82611 0
43  56.21519 1
44 34.721645 0
44        57 1
45  54.30785 0
45  69.21283 1
46 38.115253 0
46  41.32308 1
47  53.17122 0
47 32.455357 1
48 36.628788 0
48     32.75 1
49  76.92538 0
49  56.68317 1
50  63.04464 0
50  60.88123 1
end

My regression in stata woud be:
diff test, t(treated*) p(time**)

*treated is a dummy=1 if the school is trated
** time=1 if year==2018

I need the averages of test scores in 2016 and 2018 to be organized in one variable('test'), since it will be my outcome variable. However I am having trouble merging the two dataset, since m:m merge is not a good idea, 1:1 merge won't work because in both dataset the school_id is not uniquely identified. How shoud I merge thetwo datasets?
Thank you in advance