I have found a bug in Stata 17 MP, where my dataset has hundreds of millions of observations and tens of millions of unique ID strings (six letters, hashed). I want to change the ID string variable into a numeric variable, because my aim is to do a panel regression, although xtset requires the panel variable to be numeric. If one tries
Code:
encode id, g(id_num)
Stata tells you that there are too many values. I tried
Code:
egen id_num = group(id)
instead, but after running
Code:
format id_num %12.0f
by id_num tax_year, sort: gen n = _n
tab n, miss

by id_num: egen n_max = max(n)
br id tax_year id_num n n_max if n_max>1
I see different id's for the same id_num!

A colleague recommended slowing down the group() function:
Code:
cap drop id_num n n_max
by id, sort: gen id_num=1  if _n==1
replace id_num = sum(id_num)   // cumulative sum
replace id_num =.  if missing(id)
format id_num %12.0f

by id_num tax_year, sort: gen n = _n
tab n, miss

by id_num: egen n_max = max(n)
br id tax_year id_num n n_max if n_max>1
However, I get the same problem. Can you confirm this, with a test dataset? I cannot share some of the data that I am doing this on, as it is sensitive.

If this is a problem with the Stata program, is probably due to the vast number of unique IDs. Is this the correct platform to provide feedback to StataCorp about a shortcoming with the program?

In Python, the following works (after which I need to re-save it in Stata, so that the xtset works properly):
Code:
import pandas as pd
from sklearn.preprocessing import LabelEncoder as le
ip = pd.read_stata('Income_panel.dta')
print(ip.columns)
ip['id_num'] = le().fit_transform(ip['id'])
ip.to_stata('Income_panel_Python.dta')