This is a new topic for me, so pardon my basic understanding of the heckman model. In fact, it's possible I should not be using a selection model at all, so I wanted to check first.

My dependent variable (individual-level) is species diversity captured by a birdwatcher in a district-time period. The independent variable is deforestation in the district-time period. Many individuals go looking for specific birds, so their data points are less useful for eliciting the general impact of deforestation on species diversity. The goal is to identify the individuals who capture everything in sight and focus on them for the analysis.

I plan to define a "veteran" birdwatcher as someone capturing a reasonably representative measure of diversity based on some predefined criteria. This could include the total trips they take, the number of months per year they go out, and whether they report all species during the trip. These predict veteran status but are not part of the outcome equation.

Initially, I dropped all observations that didn't meet the selection criteria. My understanding is that doing this biases my coefficients by truncating the distribution of error terms. Can I treat my issue as a selection model?

My idea is to generate a dummy=1 for veterans and 0 for non-veterans, based on the above criteria. Then I would estimate coefficients and standard errors with the heckman command. Is this the right approach? Is it a problem that I am "incidentally truncating" the data myself, and then correcting for it? Thanks.