Hello Statalist Community,

I am trying to test the capabilities of STATA 15's FMM procedure to estimate the parameters of a zero-inflated distribution of a proportion. For the purpose, I simulate the underlying data and estimate the parameters. The FMM seems to do a good job recovering the latent class marginal probabilities, yielding estimates of 38.8% for class 1 (true prob = 39%) and 61.2% for class 2 (true prob = 61%). It captures well the magnitude of the pointmass at 0, but it seems to have a hard time recovering the parameters of the logistically distributed proportion within the (0, 1) interval. In contrast, the simple GLM procedure, using the class 2 data within the (0, 1) interval, recovers successfully the slope parameter b (estimate = 0.3903188, while the true parameter is set at 0.4).

In particular, the code I am running is:

Code:
clear all
set more off
set obs 2000
set seed 12345

// generate class indicator
gen class = inrange(_n, 1, 780)*0 + ///            // 39% in class 1
             inrange(_n, 781, 2000)*1              // 61% in class 2

// set parameters
scalar mu = -0.1
scalar sx = 0.3
scalar se = 0.1
scalar b = 0.4

// generate random Normal variables
gen x = rnormal(0, sx)
gen e = rnormal(0, se)

// generate simulated series for Y
gen y = 0 if class == 0
replace y = 1/(1 + exp(-(mu + b*x + e))) if class == 1

// plot the ys versus the x
twoway scatter y x, by(class) name(y_by_x, replace)
histogram y, frequency by(class) width(0.03) fcolor(forrest_green%50) name(y_by_class, replace)
histogram y, frequency width(0.03) fcolor(navy%50) name(y_hist, replace)

// estimate paramters using known class and single GLM
glm y x if class == 1, family(binomial) link(logit)

sort y
// use FMM with a pointmass at 0 and a GLM to estimate the parameters
fmm, difficult :    (pointmass y, value(0)) ///
                    (glm y x, family(binomial) link(logit))
predict exp_y*
predict pr*, classposteriorpr
format %4.3f pr*
estat lcprob
estat lcmean

// compute the predicted values
gen y_hat pr1*0 + pr2*exp_y1
// summarize the dependent variable and its fitted values
su y y_hat exp_y
The distributions of the simulated dependent variable (y) and its FMM-fitted values (y_hat) are quite different in the (0, 1) interval. Do you know what might be going on and why the FMM has such a difficult time estimating the parameters of a GLM with a family(binomial) and link(logit) in this example?