Hi there, Stata brethren.

Recently I have been trying to use the new nonparametric regression feature in Stata 16, npregress series, on different subsamples of my data. I found it to be slow. After digging in, I think I've discovered a strange behavior, where npregress becomes much slower when you increase the size of the data-set in memory, without changing the size of the sample in the estimation.

Consider the below example.
Toy example
Code:
clear
set obs 100000
gen x1 = runiform()
gen x2 = runiform()
gen y = cos(x1)*sin(x2) + x1^2 + 1/3*runiform()
npregress series y x1 x2 if _n < 1001, polynomial
This takes my computer about 60 seconds to run. Now I use the exact same sample, but drop the unused observations.

Code:
drop if _n >=1001
npregress series y x1 x2 if _n < 1001, polynomial
This takes about 2 seconds. This was not the expected behavior, because if I run a similar experiment regress instead of npregress, the speeds will be roughly the same.

Can someone explain why this is happening? Is npregress utilizing the unsampled data somehow? I was hoping to be able to repeatedly run npregress on subsamples of my data in order to construct non-parametric predictions without needing to repeatedly shuffle the data in memory (which will also take a long time, given that I am using a moderately large data-set).

Best,
Rustin