Dear Statalists,

I'd very much like to hear your opinion on the following.

What I’m trying to do:
I’m trying to create a forecast model for firm entry rates (i.e. the number of newly created firms as share of incumbents) per region, industry branch and year as the dependent variable. Independent variables are several regional institutional factors.

I intend to build the model by using training data for 33 regions, 66 branches and 4 years and check the model quality by comparing the predicted values with the actual values for year 5.

Currently I’m using a GLM Model with logit link and binomial family. A tobit model censored between 0 and 1 leads to similar results, but not quite as good as the GLM.

Code:
Glm entry_rate x1 x2 x3 if training_data==1, link(logit) family(binomial) robust nolog;

tobit entry_rate x1 x2 x3 if training_data==1, ll(0) ul(1) robust;
Now to the problem:
Unfortunately, entry rates are distributed very differently across industries. Prognosis works quite well for industries that are approximately normally distributed (~ 30 branches). However, there are many zero-inflated industries (~ 36 branches), i.e. the industries are largely subject to entry rates of 0. Due to those rates, overall predictions are rather dispersed and R2 pretty low.
The following selected histograms illustrate one branch that works quite well for prognosis (upper left) and two with numerous zero observations.

Array

Now my question:
Is it possible to model entry differently for different industries, or how is it possible to adequately account for all the 0 rates in the regression model?

Thanks a lot & regards,
Michael