Code:
encode id, g(id_num)
Code:
egen id_num = group(id)
Code:
format id_num %12.0f by id_num tax_year, sort: gen n = _n tab n, miss by id_num: egen n_max = max(n) br id tax_year id_num n n_max if n_max>1
A colleague recommended slowing down the group() function:
Code:
cap drop id_num n n_max by id, sort: gen id_num=1 if _n==1 replace id_num = sum(id_num) // cumulative sum replace id_num =. if missing(id) format id_num %12.0f by id_num tax_year, sort: gen n = _n tab n, miss by id_num: egen n_max = max(n) br id tax_year id_num n n_max if n_max>1
If this is a problem with the Stata program, is probably due to the vast number of unique IDs. Is this the correct platform to provide feedback to StataCorp about a shortcoming with the program?
In Python, the following works (after which I need to re-save it in Stata, so that the xtset works properly):
Code:
import pandas as pd from sklearn.preprocessing import LabelEncoder as le ip = pd.read_stata('Income_panel.dta') print(ip.columns) ip['id_num'] = le().fit_transform(ip['id']) ip.to_stata('Income_panel_Python.dta')
0 Response to Bug: encoding ID string with tens of millions of unique IDs
Post a Comment