Hi

I have a dataset that includes a quarterly_macro variable which is the same for each week of the quarter and weekly variables (i.e., the macro variable of quarter Q is linked to weeks from 1-12 of quarter Q).

I want to fit a model on a sample and test its performance out-of-sample. My problem is as follows:

When I run the same loop on my data, it produces different results conditional on the -sort- I use. I am not sure what does the -sort- exactly mean in each case. I elaborate below.


First alternative:
Code:
frames reset

use quarterly_weekly.dta, clear


// The issue is here:
sort  quarter_date myWEEK // this sort changes the result


gen SUBsample=.
replace SUBsample=1 if quarter_date<tq(1999q4)  // a sample to fit the model
replace SUBsample=2  if SUBsample==.   // a sample to evaluate the model (out of sample)


gen prediction=.

set seed 12345 // this is done for the results to be reproducible 

levelsof myWEEK, local(levels)

foreach x of local levels {
  
lasso linear quarterly_macro weekly_var1-weekly_var30 if SUBsample == 1 
estimates store LASSOrev 
predict temp if SUBsample==2 & myWEEK==`x', postselection
replace prediction=temp  if myWEEK==`x'
drop temp
}
The data according to this sort look like this:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(quarter_date myWEEK quarterly_macro) double(weekly_var1 weekly_var2)
88 1 .234 .10359720885753632 .8679591417312622
88 2 .234 .1684819906949997 .6942142844200134
88 3 .234 .1135205626487732 .78067946434021
88 4 .234 .1124100349843502 .7621394395828247
88 5 .234 .11129950731992722 .7485765814781189
88 6 .234 .12139459326863289 .7513777315616608
88 7 .234 .12892714142799377 .7541788816452026
88 8 .234 .1302659884095192 .7332549393177032
88 9 .234 .12449266016483307 .7331060171127319
88 10 .234 .12892714142799377 .7332549393177032
88 11 .234 .13228518515825272 .7331060171127319
88 12 .234 .13228518515825272 .7365813255310059
89 1 .442 .07843606919050217 .7152390778064728
89 2 .442 .08434794098138809 .6936862468719482
89 3 .442 .1293327808380127 .7534686923027039
89 4 .442 .1293327808380127 .7293541431427002
89 5 .442 .1374310553073883 .7074033617973328
89 6 .442 .135581336915493 .7195637226104736
89 7 .442 .13115884363651276 .7097733020782471
89 8 .442 .12929266691207886 .7085883319377899
89 9 .442 .12363984063267708 .7193162739276886
89 10 .442 .12925255298614502 .7097733020782471
89 11 .442 .11802712827920914 .7097733020782471
89 12 .442 .11715778335928917 .7193162739276886
90 1 -.027 .05430242419242859 .6410727500915527
90 2 -.027 .04069159924983978 .6852232813835144
90 3 -.027 .07825291529297829 .7406743466854095
90 4 -.027 .0726819857954979 .7218160629272461
90 5 -.027 .07203014194965363 .7287604510784149
90 6 -.027 .07235606387257576 .7396003007888794
90 7 -.027 .07137066125869751 .7391292452812195
90 8 -.027 .0707111805677414 .7340230643749237
90 10 -.027 .0707111805677414 .7287604510784149
90 11 -.027 .0726819857954979 .7287604510784149
90 12 -.027 .07203014194965363 .7340230643749237
91 1 1.466 .12981104850769043 .6597848534584045
91 2 1.466 .09218613058328629 .6661829948425293
91 3 1.466 .06856242567300797 .6784851551055908
91 4 1.466 .06452743709087372 .6872596144676208
91 5 1.466 .052778784185647964 .6839054822921753
91 6 1.466 .05186103284358978 .6872596144676208
91 7 1.466 .052778784185647964 .6872596144676208
91 8 1.466 .0509432815015316 .6872596144676208
91 9 1.466 .05037875659763813 .6872596144676208
91 10 1.466 .0509432815015316 .6872596144676208
91 11 1.466 .05037875659763813 .6872596144676208
91 12 1.466 .05037875659763813 .6880350112915039

end
format %tq quarter_date
[/CODE]



Second alternative: // the same code but I only change the sort
Code:
frames reset

use quarterly_weekly.dta, clear


// The issue is here: THE CHANGE
sort  myWEEK quarter_date // this sort changes the result


gen SUBsample=.
replace SUBsample=1 if quarter_date<tq(1999q4)  // a sample to fit the model
replace SUBsample=2  if SUBsample==.   // a sample to evaluate the model (out of sample)


gen prediction=.

set seed 12345 // this is done for the results to be reproducible 

levelsof myWEEK, local(levels)

foreach x of local levels {
  
lasso linear quarterly_macro weekly_var1-weekly_var30 if SUBsample == 1 
estimates store LASSOrev 
predict temp if SUBsample==2 & myWEEK==`x', postselection
replace prediction=temp  if myWEEK==`x'
drop temp
}
The data according to this sort look like this:

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(quarter_date myWEEK quarterly_macro) double(weekly_var1 weekly_var2)
 88 1   .234   .10359720885753632  .8679591417312622
 89 1   .442   .07843606919050217  .7152390778064728
 90 1  -.027   .05430242419242859  .6410727500915527
 91 1  1.466   .12981104850769043  .6597848534584045
 92 1  -.498    .1568187177181244   .752483606338501
 93 1  1.061   .19480887055397034  .6877254247665405
 94 1  -.277  -.22961216419935226  .6739154607057571
 95 1   .533   .20222391188144684  .6799057126045227
 96 1  1.357   .16426868736743927   .607530027627945
 97 1  -.366   .15438511967658997  .6481877565383911
 98 1 -1.091    .5329873561859131  .5102922320365906
 99 1   .326   .36372411251068115    .56190025806427
100 1 -1.089    .2837163358926773   .749171257019043
101 1   .146   .10795796103775501  .5764660388231277
103 1 -1.603   .07519269734621048    .59975266456604
104 1  -.262   .04535355046391487    .68744957447052
105 1  -.474 -.061102996580302715     .7769795358181
106 1   .345  -.03943045064806938  .7835010290145874
107 1   -.69    .1510646566748619  .7343248724937439
108 1    .47   .12457912415266037  .8115817308425903
109 1  -.119   .09771569073200226  .6026725769042969
110 1    .51  .027549786493182182  .8069815039634705
111 1   .614   .23088725749403238  .7811554968357086
112 1  1.348   .06547258608043194  .5904433727264404
113 1  -.114   .04756193794310093  .6583850979804993
114 1   .214   .09916634485125542  .8127434849739075
115 1   .426   .09765896201133728  .6433976590633392
116 1   -1.1   .18991564214229584  .7861429452896118
117 1   .851   .07597751915454865   .746464192867279
118 1   .474   .07555382326245308  .7181055545806885
119 1    .58   .13181869685649872  .6889693140983582
120 1  -.234   .17234022915363312  .6527844965457916
121 1  -.786   .04685705155134201  .7141219973564148
122 1  -.351   .07157503068447113  .8004719018936157
123 1   .548   .09237945079803467  .6117232441902161
124 1   .019  .017031755298376083  .7257599234580994
125 1  -.921   .09774142131209373  .6640038192272186
126 1  -.541   .08078432828187943   .650404155254364
127 1    .14  -.05550992488861084  .6119781732559204
128 1   .753 -.005241314647719264   .670651912689209
129 1   .158  .054290205240249634  .6977491974830627
130 1    .77    .2013719528913498  .6792399883270264
131 1    .92    .0713081881403923  .5789836049079895
132 1 -1.068   .06416945718228817  .6183741390705109
133 1   .319  .038945429027080536  .7245792746543884
134 1   .024   .18857061862945557  .6980202794075012
135 1  1.101   .05807130690664053  .5858350396156311
136 1   .822   .04747616872191429  .5992066860198975
137 1   .383   .09270800650119781  .6318810880184174
138 1   .604   .13145361468195915    .71683070063591
139 1    .54    .1347259134054184   .590053141117096
140 1   -.12   .16695279628038406  .6224272847175598
end
format %tq quarter_date

I will really appreciate it if someone can explain how the loop estimates the model and then produces the predictions in each case. For example, does it estimate the model on a weekly basis (i.e. does any of these alternatives run the regression for the quarterly dependent variable and week 1 data in SUBsample 1 and then use the parameters to create predictions for week 1 in SUBsample 2 and then do the same thing for each week?)


I look forward to reading your contributions

Thanks