I am having troubles in "cleaning" my database, whose initial size is 22GBs (71 millions rows and 9 variables - 7 of which are strL variables).
I first run this code to detect which observation of my string variables are made of a "crazy high number" of characters.
Code:
* Tabulate number of characters of each obs for each string variable: foreach var of varlist _all { local variable_type: type `var' if substr("`variable_type'",1,3)=="str" { display in smcl as text "`var'" gen long l_var = length(`var') if !missing(`var') tab1 l_var, m drop l_var } }
The presence of those few "crazy" observations force Stata to store that variable in strL format, even if the most of the observations could have been stored with a less heavy format, say str30, thus saving disk space and boosting efficiency.
Therefore, I run the following code to detect and drop those observations which are "crazy", that is, whose number of characters exceed a certain arbitrary threshold.
Code:
gen long l_var = length(var1) if !missing(var1) drop if l_var>85 & l_var !=. gen str85 X = var1 drop var1 l_var rename X var1
- The database now weighs more than the initial one, namely 27GBs.
- The database rather than having missing values has empty cells.
Any advice and help is much appreciated.
Cordially,
Edoardo
0 Response to Reducing the size of your database.
Post a Comment