I want to estimate a probit regression on a large dataset.
In particular, a 1 refers to a positive trade flow from country i to country j for a particular product. Any directed country-pair can export many products, i.e. a link = 1 if i exports to j this particular good. No 0's are recorded. For any given year in the data, this results in around 5 million observations.
I want to estimate the probability that a link is present given covariates: probit link x1 x2..., vce(cluster ...)
Creating all possible i-j-product combinations results in 228 million observations. Hence, I would like to estimate a probit on all observed 1's and create a random subsample of all 0's, estimate the probit, and then reweigh the coefficients to correct for the true number of 0's in the data. I can create and store the large dataset of 228 mln observations, but the server chokes on the probit. What would be the correct way to proceed?