I am analyzing a large dataset with STATA, and am utilizing OLS regression to try to understand how the median age of the employers in an industry and the median number of workers per firm in an industry affect the percentage of employees in the workforce.
I started with a longitudinal dataset with over 20million rows, and after collapsing by year and industry_ID obtained a dataset that looks like the following, with Industry_ID representing the unique id of each industry, age_median representing the median age of the workers in the industry, per year, nemp_median representing the median number of workers in the firms of each industry, per year, and Percentage_employees_per_industry representing the ration of employees per total workers in an industry, per year (this is just a part of the whole dataset obtained, as I have 107 different industry IDs, but it looks as such)
Industry_ID | year | age_median | nemp_median | Percentage_employees_per_industry |
1 | 2007 | 43 | 7 | 6.1584667 |
1 | 2008 | 43 | 7 | 6.3488696 |
1 | 2009 | 43 | 7 | 6.6453313 |
1 | 2010 | 43 | 8 | 6.0971645 |
1 | 2011 | 43 | 8 | 6.8209944 |
1 | 2012 | 43 | 8 | 7.2148137 |
1 | 2013 | 43 | 9 | 7.6716896 |
1 | 2014 | 43 | 8 | 7.7114815 |
1 | 2015 | 42 | 9 | 7.9195938 |
1 | 2016 | 42 | 10 | 7.8262298 |
1 | 2017 | 43 | 10 | 7.8576314 |
1 | 2018 | 42 | 10 | 7.3446328 |
1 | 2019 | 41 | 12 | 6.9016757 |
2 | 2007 | 39 | 7 | 10.932619 |
2 | 2008 | 39 | 7 | 11.627907 |
2 | 2009 | 40 | 7 | 11.778952 |
2 | 2010 | 40 | 7 | 9.8929845 |
2 | 2011 | 40 | 7 | 10.824859 |
2 | 2012 | 41 | 7 | 10.758377 |
2 | 2013 | 41 | 8 | 10.984848 |
2 | 2014 | 41 | 8 | 11.038062 |
2 | 2015 | 42 | 8 | 10.876434 |
2 | 2016 | 43 | 8 | 10.933797 |
2 | 2017 | 43 | 8 | 11.466373 |
2 | 2018 | 43 | 8 | 11.333044 |
2 | 2019 | 43 | 8 | 11.588974 |
3 | 2007 | 45 | 13 | 7.722245 |
3 | 2008 | 45 | 14 | 8.2989884 |
3 | 2009 | 46 | 12 | 9.1343025 |
3 | 2010 | 47 | 14 | 6.7039106 |
3 | 2011 | 47 | 14 | 5.7249712 |
3 | 2012 | 47 | 14 | 6.2974417 |
3 | 2013 | 47 | 14 | 6.9543705 |
3 | 2014 | 47 | 11 | 7.165838 |
3 | 2015 | 47 | 14 | 7.0180229 |
3 | 2016 | 48 | 13 | 6.8453171 |
3 | 2017 | 47.5 | 13 | 7.038961 |
3 | 2018 | 47 | 14 | 5.8881016 |
3 | 2019 | 46 | 15 | 5.5490517 |
The line of code I am using to run the OLS is: reg Percentage_employees_per_industry age_median nemp_median i.year
All works fine, no issues here, all coefficients are significative, and no problems arose.
My question is related to the process that collapse does to achieve the dataset presented above, and if I need to do any weighing of the data in order to analyze it - the industries (given by Industry_ID) are very heterogeneousin the amount of workers they have, and the amount of employees they have: Industry 1 has 2001 workers in year 2007, 2095 workers in year 2008 and 2118 workers in year 2009; Industry 2 has 170 workers in year 2007, 171 workers in year 2008 and 166 workers in year 2009; Industry 3 has 11021 workers in year 2007, 12111 workers in year 2008 and 14206 workers in year 2009. Given this, I was thinking that doing this OLS, just as it is, is not exactly correct - what I am trying to ascertain is if the age of the workers in an industry and if the amount of workers per firm affects the percentages of employers in the workforce. But, given the heregoneity of the industries, should I not consider assigning some form of weighing to this regression? Because, as it is, each of the values of Percentage_employees_per_industry has the impact in the OLS regression as all the others. But, in fact, some of the industries account for nearly 10% of the amount of total workers (and employers) in the workforce, while some others have less than 0.01% of the amount of workers and employers. The collapse command, however, does not take this into account, and provides one line of result for industry 1, year 2007 with one value of Percentage_employers_per_industry, and does the exact same for industry 3, even though industry 3 has more than 10 times as many workers and employers.
After this long explanation, what is it I should do? Are there any method of assigning weights to this procedure? Is this even necessary, or am I misinterpreting it?
Thank you very much, any help will be much appreciated!
0 Response to Collapse command, OLS regression, and weighing per group
Post a Comment