I'm a first year grad student that is fairly new to Stata. I'm working on a project in which I have to clean and analyze a secondary dataset found online. The dataset I am using is Cook County, IL sentencing data. The link to the data is: https://datacatalog.cookcountyil.gov...tg8v-tm6u/data.
The dataset has 221,170 observations, with each observation being a separate charge. Multiple charges can belong to the same court case, so multiple rows can pertain to the same value for the case_id variable.There are 37 variables, but many are repetitive for the analysis I will be doing (e.g. arrest date/arraignment date/sentencing date when I only need the year). I do not need to use all of them. Essentially the data look something like this:
Charge_Id | Case_Id | Offense Category | Race | Age | Gender | Charge Count |
1 | 1 | Narcotics | White | 27 | Male | 2 |
2 | 1 | Narcotics | White | 27 | Male | 1 |
3 | 1 | Robbery | White | 27 | Male | 1 |
4 | 2 | Retail Theft | Black | 33 | Female | 3 |
5 | 2 | Burglary | Black | 33 | Female | 1 |
6 | 3 | Homicide | Hispanic | 46 | Male | 1 |
7 | 4 | Battery | White | 22 | Female | 1 |
My question is: "Does the total amount of charges that a defendant receives on average vary by race?" My unit of analysis would be case_id, in other words each individual court case possibly containing multiple charges and offense categories. My DV would be total charge count, and my IV would be defendant race. I would then control for variables such as: offense category, law enforcement agency that made the arrest, age of defendant, gender, etc. The dataset is made entirely of string variables. So I destrung the ones I will use, and dummied out the non-numeric ones such as gender and race.
Where I am having difficulty is aggregating the data so that each row is a single case ID. As I said, each row represents a charge that is filed, and each charge filed could contain multiple counts. So one row might be the charge of robbery, but you were charged on 3 counts of it. Multiple rows, or charges, can pertain to a single case_id number. Therefore, each case_id number can contain charges from different offense categories (e.g. narcotics and robbery).
In an ideal world, my data would look something like this:
Case_Id | Offense Category | Race | Age | Gender | Charge Count |
1 | Narcotics, Robbery | White | 27 | Male | 4 |
2 | Retail theft, Burglary | Black | 33 | Female | 4 |
3 | Homicide | Hispanic | 46 | Male | 1 |
4 | Battery | White | 22 | Female | 1 |
What I have tried is collapsing by case_id, but that seems to mess with some of these data values. For example, I coded race as white=0; black=1; hispanic=2, But some of the values are showing up as decimals. Because all of the values should be the same for the same case_id, I figured taking the mean value would ensure the values stayed the same. I suspect missing values or errors in the recording are responsible for the values changing when I collapse. For reference, here is my code after destringing and/or dummying out the variables:
Code:
collapse (mean) sent_phase (mean) age (mean) sex ///
(mean) def_race (sum) charge_count (mean) arrest_year, by(case_id)
Thank you in advance for your time.
0 Response to Difficulty aggregating variables to the unit of analysis without losing or altering data
Post a Comment