Hi Statalist,

I am new her but I have learned a lot from you so far, so thank you all.

I am using STATA IC 15.1 and the problem I am facing is as follows:

There are 17 variables e.g. n_1_0 to n_3_2 and the values in these variables like this: number ranged from 1001 -99999 and (.).
The code I want is:
1044 for the cases
Missing (.) values for controls
Any other numbers excludes 1044 and 99999 and . will represents other diseases group

The data look like this:
id n_1_0 n_1_2 n_1_3 n_1_4 n_1_5
1 - - - - -
2 1022 1075 - - -
3 - - 99999 - -
4 - 1044 - 1044 --
5 1044 - - - 1006
6 - - 1044 - -
7 1010 - - 1044 -


etc.

Now I have coded the cases just fine. The code is

gen status = .
replace status = 1 if n_1_0==1044 | n_1_2==1044 | n_1_3==1044 | n_1_4==1044 | n_1_5==1044 <<< any time number 1044 recorded that's why I used | (OR)

and I got 4,123 hits

similar to controls:
replace status=2 if n_1_0==. & n_1_2==. & n_1_3==. & n_1_4==. & n_1_5==. <<< It has to be missing in all variables to be control, that's why I used & (AND)

and I got 457,300

Here the problem arise every time I try to code for other diseases. And the code I used is:
replace status = 3 if n_1_0 >=1001 & n_1_0 <99999 & n_1_0 !=1044 and repeat it for other variables.

What happen after this command is that the number of cases reduced to 3,745 and I think the issue comes from examples id 4 where number 1044 occur twice and id 5 where there is different number such as 1044 and 1006 and the number 1044 comes first and vice versa in id 7.


I hope anyone help me with this problem and what is the best way to solve it as I am going to deal with much larger data sets like this.

Thanks!