Dear all,
I am having troubles in "cleaning" my database, whose initial size is 22GBs (71 millions rows and 9 variables - 7 of which are strL variables).
I first run this code to detect which observation of my string variables are made of a "crazy high number" of characters.

Code:
* Tabulate number of characters of each obs for each string variable:
foreach var of varlist _all {
    local variable_type: type `var'
    if substr("`variable_type'",1,3)=="str" {
        
        display in smcl as text "`var'"
        gen long l_var = length(`var') if !missing(`var')
        tab1 l_var, m
        drop l_var
    }
}
From the output of the above code, I learn that only very few observations per strL variable are made of a "crazy high number" of characters. For instance, for one of my string variables, 99.9999999% of its observations have less than 30 characters, while the remaining ones have 2000 characters).
The presence of those few "crazy" observations force Stata to store that variable in strL format, even if the most of the observations could have been stored with a less heavy format, say str30, thus saving disk space and boosting efficiency.

Therefore, I run the following code to detect and drop those observations which are "crazy", that is, whose number of characters exceed a certain arbitrary threshold.
Code:
gen long l_var = length(var1) if !missing(var1)
drop if l_var>85 & l_var !=.
gen str85 X = var1
drop var1 l_var
rename X var1
Then I save my "cleaned" database. Outcome:
  1. The database now weighs more than the initial one, namely 27GBs.
  2. The database rather than having missing values has empty cells.
Anybody knows what is going on?

Any advice and help is much appreciated.
Cordially,
Edoardo