I am attempting to use a list of two-digit numeric prefixes to tag observations that do not meet the criteria on the Tax ID string variable where the format is NN-NNNNNNN. The list of valid prefixes, while numeric, are not all sequential, and also have six values with leading zeroes 01, 02, 03, 04, 05 and 06. I already plan to use strmatch() function in the loop to identify the observations, but I am getting tripped up on the 01-06 prefixes. The TaxIDReview variable is an indicator of validity of the first two digits of the tax ID: if the first two digits are not in the specified ranges, we return a 1 on TaxIDReview. A simplifying assumption here is that besides an out of range prefix or matching the general NN-NNNNNNN pattern, there are no other ways that the Tax ID is invalid. The isEIN variable is an indicator to identify those Tax ids that are EINs (Not SSNs). The command thus far is:
capture gen TaxIDReview = .
foreach i of numlist 01/06 10/16 21/25 30 32 34/38 40 44 50/61 65 67 {
local `i' "`i'"
replace TaxIDReview = 0 if strmatch(TaxID, "`i'-???????") == 1 & isEIN == 1
}
Before:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str11 TaxID float(isEIN TaxIDReview) "12-442362" 1 . "62-1334690" 1 . "99-9345585" 1 . "32-3245682" 1 . "35255" 0 . "01-5188825" 1 . "234-55-0049" 0 . end
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str11 TaxID float(isEIN TaxIDReview) "12-442362" 1 . "62-1334690" 1 . "99-9345585" 1 . "32-3245682" 1 0 "35255" 0 . "01-5188825" 1 . "234-55-0049" 0 . end
For example, 12-442362 fails the condition (doesn't match string pattern) as does 99-9345585 (prefix out of range), so nothing is replaced on TaxIDReview, but 32-3245682 meets the condition and TaxIDReview is replaced with 0. Stata doesn't treat the 01-06 range as two-digit strings, so 01-5188825 should be tagged because it meets the condition, but it's not because Stata is comparing "1" to "NN" instead of "01".
My two questions:
- Is there a way (or is it even desirable for efficiency's sake) to try to do this in one loop versus just doing a separate loop with a separate foreachi of 01 02 03 04 05 06 { command?
- How do I replace TaxIDReview = 1 if the first two digits are not in the numlist above? The range of possible values is [00-99], so I could just do the same loop but with numlist of the "invalid" ranges. This strikes me as kludgey and prone to error, so I'd appreciate any guidance with respect to the approach.
0 Response to Efficient construction of loop commands involving strmatch() when numlist values have leading zeroes
Post a Comment