Dear experts,

I have a panel data set with 77 variables and about 57,000 observations for the years 2014 - 2018. Therefore, I use dummy variables for the independent variable firm size (small medium large). I want to use this to measure the impact on the effective tax rate (ETR).

Regarding the size classification, unfortunately I am not sure if I have defined this logically correctly. Could you please check this?

The classification should correspond to the following:

(1) Small corporations are those that do not exceed at least two of the following three characteristics:

1. 6 000 000 euros balance sheet total.
2. turnover of 12,000,000 euros
3. 50 employees.

(2) Medium-sized corporations are those which exceed at least two of the three characteristics referred to in subsection (1) and do not exceed at least two of the following three characteristics each:

1. 20 000 000 euros balance sheet total.
2. 40 000 000 euros turnover
3. 250 employees.

(3) Large corporations are those which exceed at least two of the three characteristics referred to in paragraph 2.

This is how i write the code in stata:

Code:
capture drop size

generate size   = "small firms"  if ((turnover <= 12000000) & (total_assets <= 6000000)) | ((turnover <= 12000000) & (employees <= 50)) | ((total_assets <= 6000000) & (employees <= 50))
replace  size   = "medium firms"  if ((turnover > 12000000) & (turnover <= 12000000) & (total_assets > 6000000) & (total_assets <= 20000000) )| ((turnover > 12000000) & (employees > 50) & (turnover <= 40000000) & (employees <=250)) | ((total_assets > 6000000) & (total_assets <= 20000000) & (employees > 50) & (employees <= 250 ))
replace  size   = "large firms"  if ((turnover > 40000000) & (total_assets > 20000000)) | ((turnover > 40000000) & (employees > 250))|((total_assets > 20000000) & (employees > 250))

by size, sort: summarize ETR

encode size, generate (size_new)

label variable size_new  "firm size"

tab size_new

numlabel, add 

tab size_new, gen(firm_size)

describe firm_size*

tab size_new firm_size1
tab size_new firm_size2
tab size_new firm_size3

rename (firm_size*) (large small medium)

describe (large small medium)

tabstat ETR, statistics (count mean median sd max min range) by(size_new)
If I do the classification as above, 8945 observations cannot be assigned. This may be due to the fact that none of the criteria apply to these companies, right?

Code:
-> firm_size = 

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         ETR |      8,945    27.15413    15.14579   4.25e-06   99.85857
many thanks.