Dear All,

To estimate a regression discontinuity in time model I want to use cross validation to determine the optimal bandwidth. I have 7 weeks of pre-intervention and 7 weeks of post-intervention data. For cross validation, I retain only the 7 weeks pre-intervention.
I use the leave-on-out procedure
and then iterate through 2, 3, 4, 5, 6 , 7 weeks to see which bandwidth gives me the smallest mean square error.

Following is the data:

Code:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 npi int year float week int userTRA
"J338339LLR" 2014 4 0
"J338339LLR" 2014 6 0
"J338339LLR" 2014 7 0
"J33833J3J3" 2014 2 0
"J33833J99R" 2014 2 1
"J33833JOLJ" 2014 4 0
"J33833NF9L" 2014 5 0
"J33833R8F8" 2014 1 0
"J33833RLFF" 2014 7 0
"J33833RO8R" 2014 2 0
"J338383FRV" 2014 7 0
"J338383R89" 2014 2 0
"J33838FR9R" 2014 6 0
"J33838LJ8R" 2014 3 0
"J33838LVFO" 2014 6 1
"J33838RFNL" 2014 1 1
"J338393FOR" 2014 7 0
"J338398J88" 2014 6 0
"J3383998JF" 2014 4 0
"J338399JRF" 2014 5 0
"J338399N33" 2014 2 1
"J338399V3R" 2014 7 0
"J33839F99O" 2014 6 1
"J33839F99O" 2014 7 0
"J33839FNL3" 2014 5 0
"J33839JFRV" 2014 6 0
"J33839JLRL" 2014 5 0
"J33839NNOL" 2014 6 0
"J33839O383" 2014 4 0
"J33839O8R8" 2014 2 0
"J33839OR8R" 2014 6 2
"J3383F33NJ" 2014 2 0
"J3383F38JN" 2014 4 0
"J3383F988V" 2014 2 0
"J3383FN3VR" 2014 2 1
"J3383FNFNL" 2014 1 0
"J3383FNFNL" 2014 5 0
"J3383FR8L9" 2014 2 0
"J3383FROOF" 2014 2 0
"J3383FVRVO" 2014 5 0
"J3383J3983" 2014 1 0
"J3383J3JV8" 2014 3 1
"J3383J88FO" 2014 3 0
"J3383J8RJV" 2014 3 0
"J3383J8RJV" 2014 4 0
"J3383J8VFV" 2014 5 0
"J3383JONVF" 2014 5 0
"J3383JRLJ8" 2014 2 1
"J3383JRLJ8" 2014 7 1
"J3383L3VJV" 2014 7 0
"J3383L88NV" 2014 5 0
"J3383LF888" 2014 1 0
"J3383LFR3J" 2014 7 0
"J3383LJJFO" 2014 2 0
"J3383LL9RN" 2014 6 1
"J3383LLN8N" 2014 5 0
"J3383LLVFJ" 2014 1 0
"J3383LLVFJ" 2014 5 0
"J3383LRVO8" 2014 2 0
"J3383LVFOR" 2014 7 0
"J3383LVN93" 2014 3 0
"J3383N83R8" 2014 5 0
"J3383N888L" 2014 2 0
"J3383N9LFJ" 2014 5 0
"J3383NL93R" 2014 2 0
"J3383NLV8O" 2014 7 1
"J3383NNJFV" 2014 1 0
"J3383NNJFV" 2014 6 1
"J3383NO3RF" 2014 6 0
"J3383NVJNJ" 2014 4 0
"J3383O3LVV" 2014 7 0
"J3383OFJJL" 2014 5 0
"J3383ON8LO" 2014 3 1
"J3383OOO9F" 2014 1 2
"J3383OORLN" 2014 1 0
"J3383ORLLO" 2014 2 0
"J3383OVN8F" 2014 4 0
"J3383OVRF3" 2014 6 0
"J3383R3F9N" 2014 2 0
"J3383RF9O9" 2014 6 0
"J3383RF9O9" 2014 7 0
"J3383RNV9R" 2014 1 0
"J3383ROLLV" 2014 6 0
"J3383V3OOO" 2014 4 0
"J3383V3OOO" 2014 7 0
"J3383V83FL" 2014 6 0
"J3383V83N9" 2014 7 0
"J3383V9J89" 2014 3 0
"J3383VL398" 2014 1 1
"J3383VL398" 2014 2 1
"J338F8FJFV" 2014 4 0
"J338FLNVF3" 2014 5 0
"J338FR9RLL" 2014 3 0
"J338FRR3LF" 2014 6 0
"J338FRV38F" 2014 5 0
"J338J33FOV" 2014 3 0
"J338J33RNO" 2014 5 1
"J338J38JR3" 2014 5 1
"J338J398ON" 2014 1 1
"J338J39OV3" 2014 5 0
end
Then the code I am trying to run is:

Code:
. // PREPARING DATA FOR CV FOR FEDERAL SCHEDULING
.
. use "C:\Users\Sumedha\Documents\OPTUM\fillsProviderPanel_weekly_DS_10.dta", clear

. drop if dateWeek<2834 |dateWeek>2840 // only 7 weeks pre-intervention
(5,265,166 observations deleted)

. gen event_time_dateweek=dateWeek-2840

. rename  prov_state state

. sort state

. drop _merge

. merge m:1  state using "G:\Misc. Data\stateFIPS.dta"

    Result                           # of obs.
    -----------------------------------------
    not matched                         3,459
        from master                     3,459  (_merge==1)
        from using                          0  (_merge==2)

    matched                            66,609  (_merge==3)
    -----------------------------------------

. drop if _merge~=3
(3,459 observations deleted)

. drop _merge

. rename stateFIPS st_fips

.
.         gen ReschedTRA_treat=0

.         replace ReschedTRA_treat=1 if st_fips==5| st_fips==13|st_fips==17|st_fips==21|
> ///
>         st_fips==       28|st_fips==35|st_fips==36 |st_fips==38|st_fips==40|st_fips==47
> |st_fips==56|st_fips==39
(15,658 real changes made)

.
. keep if ReschedTRA_treat==0
(15,658 observations deleted)

. gen week=dateWeek-2833

.
. /*
> NOTES: cllr_crossval
> The goal is to estimate the bandwidth that minimizes the IMSE of a local linear regress
> ion.
> A grid search is used and estimation is based on the cllr program described above.
>
> Arguments
>   outcome: a stata variable containing the dependent variable
>         x: a stata variable containing the independent variable
>     start: a hardcoded number or local variable defining start of a sequence candidate
> bandwidths
>      step: a hardcoded number or local variable defining the stepsize of the sequence o
> f candidate bandwidth
>      stop: a hardcoded number or local variable defining the end of a sequence of candi
> date bandwidths.
>       sub: a stata variable set to 1 if the observation should be included in the analy
> sis
>
> Returns
>   A stata matrix and set of stata variables that contain the estimated IMSE for each ca
> ndidate bandwidth.
>
> */
.
.
. sort npi

. gen N=_n if npi[_n]~=npi[_n-1]
(6,945 missing values generated)

. bysort npi: egen maxN=max(N)

. replace N=maxN if N==.
(6,945 real changes made)

. bysort N week: gen counter=_n

. drop if counter>1
(141 observations deleted)

. xtset N week
       panel variable:  N (unbalanced)
        time variable:  week, 1 to 7, but with gaps
                delta:  1 unit

.
. gen outcome = userTRA

. gen x = week

.
. capture program drop cllr_crossval

. program define cllr_crossval
  1.         set more off
  2.         args outcome x start step stop sub narrowsub
  3.         tempvar cx ew e2 e2n
  4.
.         local stop = 7
  5.         local start = 1
  6.         local step = 1
  7.         *make a matrix to store the estimated IMSE
.         local size = ((`stop' - `start')/`step')+1
  8.         matrix M = J(`size', 3, .)
  9.        
.        
.         *Iterate over candidate bandwidths
.         local count = 0
 10.         forvalues h = `start'(`step')`stop'{
 11.        
.                 *increment counter
.                 local count = `count' + 1
 12.                
.                 *store location on the bandwidth grid
.                 matrix M[`count', 1] = `h'
 13.                
.                 *initialize the residual variable
.                 gen `e2' = .
 14.                
.                
.                 *Iterate over observations
.                 forvalues i = 1(1)`N'{
 15.                 capture quietly reghdfe /*regress*/ `outcome' `x' if  _n~=`i' & week
> =<`h', absorb(npi)
 16.                                 replace `e2' = (`outcome' - _b[_cons])^2 in `i'
 17.                                         }
 18.                
.                 *compute IMSE for the candidate bandwidth
.                 su `e2'
 19.                 matrix M[`count',2] = r(mean)
 20.
.                                
.                 drop `e2'
 21.         }
 22.
.         matrix list M
 23.         svmat M
 24. end

.
end of do-file

.
But it doesn't run at all! I will be grateful for any help in identifying what I am doing wrong.

Sincerely,
Sumedha.