Dear Statalist,

I hope this finds you well and rested after the weekend.
I've recently been trying to generate a Stata dataset from SPSS files (.DAT & .SPS) the files are generated from a tertiary database software called OpenClinica is the progenitor and this is the code I run (please note this also needs to have - ingap.ado - installed).

* Generate Variable labels:
local SpsFile : dir . files "*.sps", respect
insheet using `SpsFile', delimiter(" ") clear

gen Keep = 1 if v1=="VARIABLE" & v2=="LABELS"
replace Keep=0 if v1=="VALUE" & v2=="LABELS"
replace Keep=Keep[_n-1] if Keep==.
keep if Keep==1
keep if v3=="/" | v4=="/"
drop if v1=="VARIABLE" & v2=="LABELS"

capture noisily assert _N==0
if _rc==0 {
set obs 1
gen CodeVar = "* No Variable Labels defined, or error in do file, please check"
di as err "No Variable Labels defined, or error in do file, please check"
pause
}

capture gen CodeVar=""
replace v3=v2 if v3=="/"
replace CodeVar= "lab var " + v1 +`" ""' + v3 + `"""'
list v1 v2 v3 v4 CodeVar
keep CodeVar
ingap
replace CodeVar=`"* Generate Variable labels from `SpsFile' "' in 1
save VariableLabels.dta, replace

* Generate value labels:
insheet using `SpsFile', delimiter(" ") clear
gen Keep = 1 if v1=="VARIABLE" & v2=="LABELS"
replace Keep=0 if v1=="VALUE" & v2=="LABELS"
replace Keep=Keep[_n-1] if Keep==.
keep if Keep==0
drop if v1=="VALUE" & v2=="LABELS"
drop if v1=="."
drop if v1=="EXECUTE."

capture noisily assert _N==0
if _rc==0 {
set obs 1
gen CodeVar = "* No Value Labels defined, or error in do file, please check"
di as err "No Value Labels defined, or error in do file, please check"
pause
}

capture gen CodeVar=""

gen Var=v1 if v2=="" & v1~="/"
replace Var=Var[_n-1] if Var==""

replace CodeVar="label define " + v1 if v2=="" & v1~="/" & CodeVar==""
drop if CodeVar=="label define "
gen Quot=`"""'
replace CodeVar= " " + v1 + " " + Quot + v2 + Quot if CodeVar==""
replace CodeVar= "; " + "label values " + Var + " " + Var + " " + ";" if v1=="/"
ingap
replace CodeVar="#delimit ;" in 1

ingap -1, after
replace CodeVar="#delimit cr ;" in l

list v1 v2 CodeVar, sepby(Var)

keep CodeVar
ingap
replace CodeVar=`"* Generate Value Labels from `SpsFile' "' in 1
save ValueLabels.dta, replace


clear

use VariableLabels.dta
gen VarLab=1
append using ValueLabels.dta
gen ValLabNum=_n if strmatch(CodeVar, "*label define *")
replace ValLabNum=ValLabNum[_n-1] if ValLabNum==.

capture erase VariableLabels.dta
capture erase ValueLabels.dta

* Get rid of prefixed $ symbol in varnames
replace CodeVar=subinstr(CodeVar, "v$", "v", .)

* Get rid of ' quotation mark around numerical codes
* (below needs changing if you have coded variables <-1000 or >1000)
forvalues Num = -1000/1000 {
replace CodeVar=subinstr(CodeVar, "'`Num''", "`Num'", .)
}

* Implement Dataset specific label changes below if required:

replace CodeVar=subinstr(CodeVar, "InterviewDateE", "InterviewDate_E", .)

format CodeVar %-20s
* Visual inspection:
list CodeVar if VarLab==1
pause Please check whether variable labelling commands look ok!

list CodeVar if VarLab==., sepby(ValLabNum)
pause Please check whether value labelling commands look ok!


drop VarLab ValLabNum
outfile using "LabVarsAndValues.do", noquote replace

clear
local DatFile : dir . files "*.dat", respect
insheet using `DatFile', clear case

* Implement Dataset specific Variable name changes below if required:



do "LabVarsAndValues.do"

local StataFile=subinstr(`DatFile', ".dat", ".dta",1)
save "`StataFile'", replace



In between the Label Variables and Values generation step and the importing of the data I'm uncertain of how best to deal with non-evaluable data sources - i.e. an "UNKNOWN" string in a date or binary field. Would it be best to include this during the generation of the Labels Variables and Values step (i.e. include a code for 5. "UNK") or replace all "UNK" fields?

Any advice or code on handling this would be welcome?

kind regards,
Marcus