Hello,

I have a question regarding which type of regression model is right to use for a zero-inflated distribution.

Some info about the data:

- The dependent variable for one of my hypotheses is ‘distvolatility’ (shown in the table below).
- Its distribution is heavily zero-inflated (1304 out of 1459 observations are 0) and positively skewed. These zeroes are real/true values (not censored/truncated).
- There are 15 possible ‘distvolatility’ scores for respondents with a non-zero value for ‘distvolatility’, ranging from .2959995 to 32.373 (there are no other possible values other than those shown below).
Code:
 tab distvolatility 

distvolatil |
        ity |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      1,304       89.38       89.38
   .2959995 |         10        0.69       90.06
   .8690004 |         39        2.67       92.73
      4.661 |          7        0.48       93.21
     11.673 |          7        0.48       93.69
     12.542 |         12        0.82       94.52
     14.874 |         17        1.17       95.68
      15.17 |          4        0.27       95.96
     16.334 |          5        0.34       96.30
     17.203 |         11        0.75       97.05
     19.535 |          8        0.55       97.60
     19.831 |          3        0.21       97.81
     31.208 |          9        0.62       98.42
     31.504 |          3        0.21       98.63
     32.077 |         15        1.03       99.66
     32.373 |          5        0.34      100.00
------------+-----------------------------------
      Total |      1,459      100.00
My question is which type of regression model is best to use for this type of zero-inflated distribution. I have done a lot of research and have seen a lot of different suggestions, although none seem to be completely correct for my data.

- Zero-inflated poisson/zero-inflated binomial - These both assume count data. Would it severely bias the results if I were to use one of these forms of regression model (likely zinb as the variance is much higher than the mean), as my data are discrete but not count data?

- Two-step generalised linear model - Another option is to model the probability of distvolatility being 0/1 as a binary logistic regression, and then use a GLM function on the non-zero values.

- Tobit regression - I have also seen this mentioned as an option for zero-inflated distributions, although it assumes the zeroes are censored, which is not the case here.


The zero-inflated negative binomial seems to be the best option at the moment, but any advice would be greatly appreciated!