Bug: encoding ID string with tens of millions of unique IDs

I have found a bug in Stata 17 MP, where my dataset has hundreds of millions of observations and tens of millions of unique ID strings (six letters, hashed). I want to change the ID string variable into a numeric variable, because my aim is to do a panel regression, although xtset requires the panel variable to be numeric. If one tries

Code:

encode id, g(id_num)

Stata tells you that there are too many values. I tried

Code:

egen id_num = group(id)

instead, but after running

Code:

format id_num %12.0f
by id_num tax_year, sort: gen n = _n
tab n, miss

by id_num: egen n_max = max(n)
br id tax_year id_num n n_max if n_max>1

I see different id's for the same id_num!

A colleague recommended slowing down the group() function:

Code:

cap drop id_num n n_max
by id, sort: gen id_num=1  if _n==1
replace id_num = sum(id_num)   // cumulative sum
replace id_num =.  if missing(id)
format id_num %12.0f

by id_num tax_year, sort: gen n = _n
tab n, miss

by id_num: egen n_max = max(n)
br id tax_year id_num n n_max if n_max>1

However, I get the same problem. Can you confirm this, with a test dataset? I cannot share some of the data that I am doing this on, as it is sensitive.

If this is a problem with the Stata program, is probably due to the vast number of unique IDs. Is this the correct platform to provide feedback to StataCorp about a shortcoming with the program?

In Python, the following works (after which I need to re-save it in Stata, so that the xtset works properly):

Code:

import pandas as pd
from sklearn.preprocessing import LabelEncoder as le
ip = pd.read_stata('Income_panel.dta')
print(ip.columns)
ip['id_num'] = le().fit_transform(ip['id'])
ip.to_stata('Income_panel_Python.dta')

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Bug: encoding ID string with tens of millions of unique IDs
Bug: encoding ID string with tens of millions of unique IDs

0 Response to Bug: encoding ID string with tens of millions of unique IDs

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Bug: encoding ID string with tens of millions of unique IDs Bug: encoding ID string with tens of millions of unique IDs

Related Posts with Bug: encoding ID string with tens of millions of unique IDs

0 Response to Bug: encoding ID string with tens of millions of unique IDs

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Bug: encoding ID string with tens of millions of unique IDs
Bug: encoding ID string with tens of millions of unique IDs