Hello,

I'm a first year grad student that is fairly new to Stata. I'm working on a project in which I have to clean and analyze a secondary dataset found online. The dataset I am using is Cook County, IL sentencing data. The link to the data is: https://datacatalog.cookcountyil.gov...tg8v-tm6u/data.

The dataset has 221,170 observations, with each observation being a separate charge. Multiple charges can belong to the same court case, so multiple rows can pertain to the same value for the case_id variable.There are 37 variables, but many are repetitive for the analysis I will be doing (e.g. arrest date/arraignment date/sentencing date when I only need the year). I do not need to use all of them. Essentially the data look something like this:

Charge_Id Case_Id Offense Category Race Age Gender Charge Count
1 1 Narcotics White 27 Male 2
2 1 Narcotics White 27 Male 1
3 1 Robbery White 27 Male 1
4 2 Retail Theft Black 33 Female 3
5 2 Burglary Black 33 Female 1
6 3 Homicide Hispanic 46 Male 1
7 4 Battery White 22 Female 1

My question is: "Does the total amount of charges that a defendant receives on average vary by race?" My unit of analysis would be case_id, in other words each individual court case possibly containing multiple charges and offense categories. My DV would be total charge count, and my IV would be defendant race. I would then control for variables such as: offense category, law enforcement agency that made the arrest, age of defendant, gender, etc. The dataset is made entirely of string variables. So I destrung the ones I will use, and dummied out the non-numeric ones such as gender and race.

Where I am having difficulty is aggregating the data so that each row is a single case ID. As I said, each row represents a charge that is filed, and each charge filed could contain multiple counts. So one row might be the charge of robbery, but you were charged on 3 counts of it. Multiple rows, or charges, can pertain to a single case_id number. Therefore, each case_id number can contain charges from different offense categories (e.g. narcotics and robbery).

In an ideal world, my data would look something like this:
Case_Id Offense Category Race Age Gender Charge Count
1 Narcotics, Robbery White 27 Male 4
2 Retail theft, Burglary Black 33 Female 4
3 Homicide Hispanic 46 Male 1
4 Battery White 22 Female 1
My main issue is aggregating the data to the case_id level without altering the values for age, race, gender, etc. I am also unsure if it is possible to combine the different offense categories from multiple charges into one variable that contains them all.

What I have tried is collapsing by case_id, but that seems to mess with some of these data values. For example, I coded race as white=0; black=1; hispanic=2, But some of the values are showing up as decimals. Because all of the values should be the same for the same case_id, I figured taking the mean value would ensure the values stayed the same. I suspect missing values or errors in the recording are responsible for the values changing when I collapse. For reference, here is my code after destringing and/or dummying out the variables:

Code:
collapse (mean) sent_phase (mean) age (mean) sex ///
 (mean) def_race (sum) charge_count (mean) arrest_year, by(case_id)
This method does give me the total charge count I want, but at the expense of altering other variables which is obviously a serious issue. I am pretty stuck with this right now and am not sure exactly the best way to proceed. Should I be approaching the collapse command differently, or not be using collapse at all to solve this issue?

Thank you in advance for your time.