Hello everyone,

I am analyzing a large dataset with STATA, and am utilizing OLS regression to try to understand how the median age of the employers in an industry and the median number of workers per firm in an industry affect the percentage of employees in the workforce.

I started with a longitudinal dataset with over 20million rows, and after collapsing by year and industry_ID obtained a dataset that looks like the following, with Industry_ID representing the unique id of each industry, age_median representing the median age of the workers in the industry, per year, nemp_median representing the median number of workers in the firms of each industry, per year, and Percentage_employees_per_industry representing the ration of employees per total workers in an industry, per year (this is just a part of the whole dataset obtained, as I have 107 different industry IDs, but it looks as such)
Industry_ID year age_median nemp_median Percentage_employees_per_industry
1 2007 43 7 6.1584667
1 2008 43 7 6.3488696
1 2009 43 7 6.6453313
1 2010 43 8 6.0971645
1 2011 43 8 6.8209944
1 2012 43 8 7.2148137
1 2013 43 9 7.6716896
1 2014 43 8 7.7114815
1 2015 42 9 7.9195938
1 2016 42 10 7.8262298
1 2017 43 10 7.8576314
1 2018 42 10 7.3446328
1 2019 41 12 6.9016757
2 2007 39 7 10.932619
2 2008 39 7 11.627907
2 2009 40 7 11.778952
2 2010 40 7 9.8929845
2 2011 40 7 10.824859
2 2012 41 7 10.758377
2 2013 41 8 10.984848
2 2014 41 8 11.038062
2 2015 42 8 10.876434
2 2016 43 8 10.933797
2 2017 43 8 11.466373
2 2018 43 8 11.333044
2 2019 43 8 11.588974
3 2007 45 13 7.722245
3 2008 45 14 8.2989884
3 2009 46 12 9.1343025
3 2010 47 14 6.7039106
3 2011 47 14 5.7249712
3 2012 47 14 6.2974417
3 2013 47 14 6.9543705
3 2014 47 11 7.165838
3 2015 47 14 7.0180229
3 2016 48 13 6.8453171
3 2017 47.5 13 7.038961
3 2018 47 14 5.8881016
3 2019 46 15 5.5490517

The line of code I am using to run the OLS is: reg Percentage_employees_per_industry age_median nemp_median i.year

All works fine, no issues here, all coefficients are significative, and no problems arose.

My question is related to the process that collapse does to achieve the dataset presented above, and if I need to do any weighing of the data in order to analyze it - the industries (given by Industry_ID) are very heterogeneousin the amount of workers they have, and the amount of employees they have: Industry 1 has 2001 workers in year 2007, 2095 workers in year 2008 and 2118 workers in year 2009; Industry 2 has 170 workers in year 2007, 171 workers in year 2008 and 166 workers in year 2009; Industry 3 has 11021 workers in year 2007, 12111 workers in year 2008 and 14206 workers in year 2009. Given this, I was thinking that doing this OLS, just as it is, is not exactly correct - what I am trying to ascertain is if the age of the workers in an industry and if the amount of workers per firm affects the percentages of employers in the workforce. But, given the heregoneity of the industries, should I not consider assigning some form of weighing to this regression? Because, as it is, each of the values of Percentage_employees_per_industry has the impact in the OLS regression as all the others. But, in fact, some of the industries account for nearly 10% of the amount of total workers (and employers) in the workforce, while some others have less than 0.01% of the amount of workers and employers. The collapse command, however, does not take this into account, and provides one line of result for industry 1, year 2007 with one value of Percentage_employers_per_industry, and does the exact same for industry 3, even though industry 3 has more than 10 times as many workers and employers.

After this long explanation, what is it I should do? Are there any method of assigning weights to this procedure? Is this even necessary, or am I misinterpreting it?

Thank you very much, any help will be much appreciated!