Dear STATA users,

I’m currently facing the problem of a too-long computation time.
I have a large dataset with observations over several periods.
There are three variables I use for each period for computing: x, y, and b. (150 periods, 500 levels of x, 10000+ levels of y, for each x,y one level of b)

I have to calculate for every possible combination of x, with the data points matched on y to calculate my measure k using b.

The formula for each period is:

K_vx = ∑_y(b_v * b_x) / ∑_y(b_v * b_v)

In the following I am going to describe my approach, and my code.


I, therefore, looped over each level of period, kept only one period, and saved this data set as a new dataset. (Single_period.dta)
Then I looped over each level of a, keeping only this level of a.
Renamed variables x and b to v and b_from. Calculate each b_from^2 and the sum of all b-from ^2 for v.
Saved this dataset to a new dataset (single_period_x.dta)
Restored dataset single_period.
Then I loop again over each level of x and keep only one level of x.
I join this dataset with single_period_x on y. for the case that there is no fit in y I open an if loop for x is not 0
Then I calculate b_v * b_s and the sum of it. Finally I calculate with both sums K_vx = ∑_y(b_v * b_x) / ∑_y(b_v * b_v).

To save this Output I collapse the dataset to k by (v x period) and append this to an output file.
Then I use clear and restore the dataset with all x using single_period_x.dta

This program has a runtime for each period of 1,5 - 3h on my system. I am looking for a way to improve this calculation.
Does someone have an idea how to calculate this faster or can identify a bottleneck in this calculation?


Code:
clear
use data.dta

levelsof period, l(period_levels)
 foreach q of local period_levels{
     keep if period == `q'
    save single_period, replace
    levelsof x, l(x_levels)
    foreach p of local x_levels{
        keep if x == `p'
        rename x v                
        rename b b_v
        
        rename period period_v
        
        generate b_v_sq = b_v * b_v
        egen b_v_sq_sum = sum(b_v_sq)
        
        save single_x_period, replace
        clear
        use single_period.dta
            foreach r of local x_levels{
                keep if x == `r'
                joinby y using single_x_period.dta, unmatched(none)
                
                if v != .{
                
                    generate b_sq = b * b
                    generate b_b_v = b * b_v
                    
                    egen b_sq_sum = sum(b_sq)
                    egen b_b_v_sum = sum(b_b_v)
                    
                    generate k = b_b_v_sum / b_v_sq_sum
                    
                    collapse k, by(period_v period v x)

                    append using data/results/k_L1_L2.dta
                    save data/results/k_L1_L2.dta, replace
                }
                
                clear
                use single_period.dta    
            }
        use single_period.dta
        }
    use data.dta
    }