I have some questions regarding an issue I have encountered when trying to change a string variable into a numeric (categorical) one. I am writing my question in a series of commands to show my steps.
My variable is called "confused" - with a 5-point Likert scale response for possible answers (1=Very unlikely, 2=Unlikely, 3=Neutral, 4=Likely, 5=Very likely). By chance, the category "Unlikely" appears not to have been selected by survey respondents for this particular variable, and there were 23 respondents who didn't complete the question.
Here is my variable:
. tab confused, missing
Consider a |
patient who |
you think |
might have |
delirium. How |
likely do the |
words li | Freq. Percent Cum.
--------------+-----------------------------------
| 23 10.13 10.13
Likely | 66 29.07 39.21
Neutral | 2 0.88 40.09
Very likely | 135 59.47 99.56
Very unlikely | 1 0.44 100.00
--------------+-----------------------------------
Total | 227 100.00
For analysis, I want to change the categorical Likert responses into numeric data with labels.
1) First, I tried destring:
. destring confused, replace
The error message was:
confused: contains nonnumeric characters; no replace
2) I wasn't sure why I got this message, so I tried encode:
encode confused, generate(confused2)
. d confused2
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------------------------------------------------------------
confused2 long %13.0g confused2
Consider a patient who you think might have delirium. How likely do the words li
3) This looks good, but I would like to ensure that the correct categories have the correct numeric labels for future analysis. I prefer to have numbers as the primary data source so that if I need to do things quickly with future commands, I can write <if confused ==1> instead of <if confused =="Very unlikely">.
Initially I thought:
. label define likertlbl 1 "Very unlikely" 2 "Unlikely" 3 "Neutral" 4 "Likely" 5 "Very likely"
. label values confused2 likertlbl
. tab confused2, missing
Consider a |
patient who |
you think |
might have |
delirium. How |
likely do the |
words li | Freq. Percent Cum.
--------------+-----------------------------------
Very unlikely | 66 29.07 29.07
Unlikely | 2 0.88 29.96
Neutral | 135 59.47 89.43
Likely | 1 0.44 89.87
. | 23 10.13 100.00
--------------+-----------------------------------
Total | 227 100.00
But the output "confused2" here doesn't match the original "confused" categories - the response "likely" should have had n=66 from the original data, but here it has been made to match the "very unlikely" response.
I think this is because the responses were not in hierarchical order in the "confused" variable as they appear alphabetical in the list. Therefore when I used the <label define> command, it assumed that everything was already in hierarchical order in "confused2" when it assigned labels.
FIRST QUESTION: Is my interpretation of this mismatch correct?
4) So, to try again, I performed the following commands
. generate confused3 = 1 if confused =="Very unlikely"
(226 missing values generated)
. replace confused3 = 2 if confused =="Unlikely"
(0 real changes made)
. replace confused3 = 3 if confused =="Neutral"
(2 real changes made)
. replace confused3 = 4 if confused =="Likely"
(66 real changes made)
. replace confused3 = 5 if confused =="Very likely"
(135 real changes made)
. tab confused3, missing
confused3 | Freq. Percent Cum.
------------+-----------------------------------
1 | 1 0.44 0.44
3 | 2 0.88 1.32
4 | 66 29.07 30.40
5 | 135 59.47 89.87
. | 23 10.13 100.00
------------+-----------------------------------
Total | 227 100.00
Now I have the correct "match" - I know there were 66 responses that selected "likely" where 4=likely.
I can now attach my labels like this:
. label values confused3 likertlbl
. tab confused3, missing
confused3 | Freq. Percent Cum.
--------------+-----------------------------------
Very unlikely | 1 0.44 0.44
Neutral | 2 0.88 1.32
Likely | 66 29.07 30.40
Very likely | 135 59.47 89.87
. | 23 10.13 100.00
--------------+-----------------------------------
Total | 227 100.00
So in the end, I have achieved what I wanted. However this was very labor intensive!
SECOND QUESTION: Is there an easier way to achieve what I wanted than going through the code in section 4?
Previously I have made the changes for the numeric categories in Excel, so that when I import into Stata, they are already numeric. And then I simply assign the variable labels. I was hoping to do everything in Stata at once so that there was a clear log file for all commands with data cleaning and manipulation.
Apologies if these are simple questions and I have made it a long-winded scenario!
Any assistance appreciated.
Related Posts with Recoding a string variable into numeric (instead of destring) and subsequent mismatch of values
Dropping overlapping eventsI am conducting an event study for my study, the effect of football transfers on stock prices. The p…
Creating new variables using the subset of an existing variableHello, I am pretty new user of Stata. I would like to create new variables by using the subset of a…
Generating Quaterly Data, Format RangeDear All, I would like to set a time series quaterly data from 1995q1 to 2019q4: (Here is my code)…
Difference in proportions between two samplesHello, I have a question about proportions. How do I check for difference in proportions between tw…
Cluster firms based on wage distribution on panel dataHi. I want to create 15 groups of firms based on their wage distribution. I have a 10-year long mont…
Subscribe to:
Post Comments (Atom)
0 Response to Recoding a string variable into numeric (instead of destring) and subsequent mismatch of values
Post a Comment