Hello,

my following problem seems to me as being pretty complicated. However, I will try to explain it as goood and structured as possible. I hope you can help me.

My dataset consists of the following variables: fund (identifier for each fund) mdate (yearmonth) number (number of stocks the fund holds) stock_id (identifier for each stock) markcap (market capitalization of each fund. Telling me basically how much the stock is worth at the stock market).
So if fund A holds lets say 5 stocks at January2005, there will be 5 observations for fund A in January2005, given that each stock is considered seperately within the stock_id column.

My goal: Each year in June (overall period is January2005 till December2018), use the median of markcap to split the stocks in two groups, named "small" and "big". If markcap > median = "big". If markcap < median = "small".

If markcap > median in June(t), the stock should then be labelled as "big" from July(t) till June(t+1). Then in June (t+1) stocks are divided again in "small"/"big" and then labelled from July(t+1) till June (t+2) and so on.

However, since one stock can be hold by multiple funds at each date, there are duplicates stock_id mdate markcap in my dataset. Therefore, before dividing the stocks into "small" and "big", I do need to exclude these duplicates so that every stock is considered only once per mdate. Therefore, I would either need to mark the duplicates and exclude them when performing the median calculation (but I do not know how to do this), or I copy the dataset, delete all duplicates, do the median calculation and labelling and then merge the datasets back together.

I am sorry that I can not provide one single code so far. But my stata knowledge for now is simply not good enough to know how to solve this problem. I guess for the labelling from July(t) till June(t+1) I could use rangejoin or a code that looks something like this (assuming that I did already define a small_big variable, indicating to which group each stock_id belongs)?

Code:
by stock_id(mdate), sort: replace small_big = small_big [_n-1] if missing(small_big)

And the creation of the small_big variable could perhaps look something like this (assuming that I did already create a markcap decile for each stock_id per year in June). Eventhough I am not sure if using deciles is the smartest way.
Code:
label define small_big 0 "small" 1 "big"
gen byte small_big= 1 if inlist(markcap_decile_june, 6,7,8,9, 10)
replace small_big= 0 if inlist(marpcap_decile_june, 1, 2,3,4,5)
label values small_big small_big
I would reall appreciate any help a lot.
Thank you very much.

Tim Wolf