Difficulty aggregating variables to the unit of analysis without losing or altering data

Hello,

I'm a first year grad student that is fairly new to Stata. I'm working on a project in which I have to clean and analyze a secondary dataset found online. The dataset I am using is Cook County, IL sentencing data. The link to the data is: https://datacatalog.cookcountyil.gov...tg8v-tm6u/data.

The dataset has 221,170 observations, with each observation being a separate charge. Multiple charges can belong to the same court case, so multiple rows can pertain to the same value for the case_id variable.There are 37 variables, but many are repetitive for the analysis I will be doing (e.g. arrest date/arraignment date/sentencing date when I only need the year). I do not need to use all of them. Essentially the data look something like this:

Charge_Id	Case_Id	Offense Category	Race	Age	Gender	Charge Count
1	1	Narcotics	White	27	Male	2
2	1	Narcotics	White	27	Male	1
3	1	Robbery	White	27	Male	1
4	2	Retail Theft	Black	33	Female	3
5	2	Burglary	Black	33	Female	1
6	3	Homicide	Hispanic	46	Male	1
7	4	Battery	White	22	Female	1

My question is: "Does the total amount of charges that a defendant receives on average vary by race?" My unit of analysis would be case_id, in other words each individual court case possibly containing multiple charges and offense categories. My DV would be total charge count, and my IV would be defendant race. I would then control for variables such as: offense category, law enforcement agency that made the arrest, age of defendant, gender, etc. The dataset is made entirely of string variables. So I destrung the ones I will use, and dummied out the non-numeric ones such as gender and race.

Where I am having difficulty is aggregating the data so that each row is a single case ID. As I said, each row represents a charge that is filed, and each charge filed could contain multiple counts. So one row might be the charge of robbery, but you were charged on 3 counts of it. Multiple rows, or charges, can pertain to a single case_id number. Therefore, each case_id number can contain charges from different offense categories (e.g. narcotics and robbery).

In an ideal world, my data would look something like this:

Case_Id	Offense Category	Race	Age	Gender	Charge Count
1	Narcotics, Robbery	White	27	Male	4
2	Retail theft, Burglary	Black	33	Female	4
3	Homicide	Hispanic	46	Male	1
4	Battery	White	22	Female	1

My main issue is aggregating the data to the case_id level without altering the values for age, race, gender, etc. I am also unsure if it is possible to combine the different offense categories from multiple charges into one variable that contains them all.

What I have tried is collapsing by case_id, but that seems to mess with some of these data values. For example, I coded race as white=0; black=1; hispanic=2, But some of the values are showing up as decimals. Because all of the values should be the same for the same case_id, I figured taking the mean value would ensure the values stayed the same. I suspect missing values or errors in the recording are responsible for the values changing when I collapse. For reference, here is my code after destringing and/or dummying out the variables:

Code:

collapse (mean) sent_phase (mean) age (mean) sex ///
 (mean) def_race (sum) charge_count (mean) arrest_year, by(case_id)

This method does give me the total charge count I want, but at the expense of altering other variables which is obviously a serious issue. I am pretty stuck with this right now and am not sure exactly the best way to proceed. Should I be approaching the collapse command differently, or not be using collapse at all to solve this issue?

Thank you in advance for your time.

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Difficulty aggregating variables to the unit of analysis without losing or altering data
Difficulty aggregating variables to the unit of analysis without losing or altering data

0 Response to Difficulty aggregating variables to the unit of analysis without losing or altering data

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Difficulty aggregating variables to the unit of analysis without losing or altering data Difficulty aggregating variables to the unit of analysis without losing or altering data

Related Posts with Difficulty aggregating variables to the unit of analysis without losing or altering data

0 Response to Difficulty aggregating variables to the unit of analysis without losing or altering data

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Difficulty aggregating variables to the unit of analysis without losing or altering data
Difficulty aggregating variables to the unit of analysis without losing or altering data