I am a young and new Stata user trying to learn how to draw conclusions from datasets.
I have a small set of survey responses (attached) that I am trying to analyze.
As I am self-taught, could a more experienced user review my work/reasoning and offer corrections/suggestions?

I am interested to know:
  • What errors I am making
  • What issues I am not addressing
  • How my thinking can be more sophisticated
  • Whether conclusions can be drawn from this regression model
The dataset is attached and my do-file is below:


I will explore descriptive statistics to become familiar with the data.

Code:
codebook
sum
misstable sum
I am interested in exploring the following variables:

Code:
tab1 Q2 Q11 Q14 Q15, miss
bysort Q11: tab Q2 Q14
bysort Q15: tab Q2 Q11
I will focus on the following variables:

Code:
sum Q14, detail
sum Q15, detail
tab Q14 Q15, chi2 lrchi2
I see there are negative values for missing responses which I will recode to be able to run correlations and regressions

Code:
replace Q14 = . if Q14 < 0
replace Q15 = . if Q15 < 0
I am interested in how strongly related these variables are.

Code:
pwcorr Q14 Q15, sig star(.05)
A Pearson’s correlation indicates a moderate positive correlation between Q14 and Q15 (r = .6, p < .00005), with social media as a news source explaining 36% of the variation in trusting the news.

I will further explore this relationship by building the following models:


Code:
set showbaselevels on
asdoc reg Q15 i.Q14, robust nest append
asdoc reg Q15 i.Q14 i.ideocat, robust nest append
asdoc reg Q15 i.Q14 i.ideocat i.gencat i.racecat i.gender, robust nest append
estat vif
I run regressions using robust standard errors to control for heteroskedasticity. The model explains nearly 40% of the variance in trust and shows a statistically significant relationship between Q15 and Q14 (p < .00005). The RMSE value (.61) indicates that the model can predict the data fairly accurately.

The degree to which news is acquired from social media, ideology, race, gender, and generation (Generation X and Baby Boomers) are statistically significant in explaining trust.

None of the predictors have VIF > 10 or 1/VIF < .1, suggesting that there is no multicollinearity.


Code:
pwcorr Q15 Q14 gencat ideocat racecat gender, sig star(.05)
I run a correlation matrix for all variables in the model. There is a strong relationship between using social media as a news source and trusting the news, more moderate correlations between generation and ideology, and inverse relationships for gender and race.

I can conclude that the degree to which one trusts social media as a news source depends on their race, gender, age, and ideology.