Regression problem with categorical/dummy variables that take on more than two values

Hi members,

I have a cross-sectional dataset with 167 observations, on which I am trying to run an OLS regression of the greenpremium (y-variable) on rating, currency, sector, maturity and issue amount (x-variables). Where rating, currency and sector are qualitative variables that are stored as strings on STATA. Maturity and issue amounts are quantitative variables.

So, first lets consider the case of rating. Using the <tabulate> command we can observe the following output:

Code:

tabulate rating

     Rating |      Freq.     Percent        Cum.
------------+-----------------------------------
          A |         48       28.74       28.74
         AA |         37       22.16       50.90
        AAA |         58       34.73       85.63
         BB |          1        0.60       86.23
        BBB |         23       13.77      100.00
------------+-----------------------------------
      Total |        167      100.00

But remember, since currency is stored as a string, we cannot use it in this format on STATA. Thus, I use the following command to generate dummy variables:

Code:

 encode rating, generate(ratingdummy)

This command assigns A=1, AA=2, AAA=3, BB=4 and BBB=5.

Then, because I want rating AAA to be my base/reference modality, I execute the following command:

Code:

 fvset base 3 rating dummy

Now, lets say I only want to regress greenpremium on rating A, AA and BBB, whilst leaving out the BB rating. I use the following command:

Code:

 regress greenpremium 1.ratingdummy 2.ratingdummy 5.ratingdummy

This results in the following output:

Code:

regress greenpremium 1.ratingdummy 2.ratingdummy 5.ratingdummy

      Source |       SS           df       MS      Number of obs   =       167
-------------+----------------------------------   F(3, 163)       =      1.48
       Model |  .027896206         3  .009298735   Prob > F        =    0.2226
    Residual |  1.02580512       163  .006293283   R-squared       =    0.0265
-------------+----------------------------------   Adj R-squared   =    0.0086
       Total |  1.05370132       166  .006347598   Root MSE        =    .07933

------------------------------------------------------------------------------
greenpremium |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 ratingdummy |
          A  |  -.0246367     .01542    -1.60   0.112    -.0550854     .005812
         AA  |   .0015211   .0166359     0.09   0.927    -.0313286    .0343709
        BBB  |  -.0272488   .0195009    -1.40   0.164    -.0657559    .0112582
             |
       _cons |   .0013823   .0103279     0.13   0.894    -.0190115     .021776
------------------------------------------------------------------------------

Now my issue is with the number of observations that STATA outputs. Remember from the <tabulate> command above, there is only one observation with rating BB. Thus, should the observations not be 166 observations?

I tested this further, and now let's also exclude rating A (48 observations) as well as BB (1 observation). Thus, we get the following output:

Code:

regress greenpremium 2.ratingdummy 5.ratingdummy

      Source |       SS           df       MS      Number of obs   =       167
-------------+----------------------------------   F(2, 164)       =      0.93
       Model |  .011831414         2  .005915707   Prob > F        =    0.3962
    Residual |  1.04186991       164  .006352865   R-squared       =    0.0112
-------------+----------------------------------   Adj R-squared   =   -0.0008
       Total |  1.05370132       166  .006347598   Root MSE        =     .0797

------------------------------------------------------------------------------
greenpremium |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 ratingdummy |
         AA  |   .0125731    .015201     0.83   0.409    -.0174419    .0425881
        BBB  |  -.0161968    .018319    -0.88   0.378    -.0523682    .0199746
             |
       _cons |  -.0096697   .0077054    -1.25   0.211    -.0248842    .0055448
------------------------------------------------------------------------------

Again, we get 167 observations. Should we not get 167-48-1=118 observations? I thought with the command that I have used, we are only including AA, AAA and BBB rating. Thus, it should be 118 obervations.

Any ideas on why it is like this?

Best regards,
Akshil Shah

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Regression problem with categorical/dummy variables that take on more than two values
Regression problem with categorical/dummy variables that take on more than two values

0 Response to Regression problem with categorical/dummy variables that take on more than two values

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Regression problem with categorical/dummy variables that take on more than two values Regression problem with categorical/dummy variables that take on more than two values

Related Posts with Regression problem with categorical/dummy variables that take on more than two values

0 Response to Regression problem with categorical/dummy variables that take on more than two values

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Regression problem with categorical/dummy variables that take on more than two values
Regression problem with categorical/dummy variables that take on more than two values