This is relevant to the question I asked some days ago and can be found here:
I have encountered some difficulties due to the size of my database. Hence, I am starting a new thread regarding that topic.
I will start with the following example. Assume that we have a database with the following variables: id, year, treated, x1, and x2.
Treated is an indicator and informs us that a firm had an event at a specific year. My aim is to to construct a new variable that shows me the id of matched firms based on specific criteria that I name below.
There are three requirments that need to hold in order to match a firm.
1. If a company is treated at time t, the matched company must not be treated in years (-5, +5) from the event year of the treated firm. For example, if a treated firm had an event in 1990, the matched company must not be treated in the period [1985, 1995].
2. Create a variable z that is the ratio of x1 of the untreated firm to the value of x1 of the treated firm one year before the event. Then, based on z, the matched company should have values of z in the range [0.70 to 1.3]. When we compare z, we care about the values one year before the event.
3. The treated company's x2 one year before the event is closest to the x2 of the matched company one year before the event. So, for two companies that qualify for points 1 and 2, we choose as a match the company with the closest x2 to our treated company one year before the event.
In my previous thread, a solution was provided using the rangejoin command. Alas, although this works for this small sample, there are problems for large databases. In my case, there are more than 5,000 unique ids, 30+ years, and more than 20,000 events.
The problem arises with the rangejoin command and I get a Op. sys. refuses to provide memory error.
Is there a way to do the comparisons without the rangejoin command?
Thank you for your time.
A simplistic example is the following:
Code:
input id year treated x1 x2 1 1980 0 157 18 1 1981 0 165 6 1 1982 0 28 4 1 1983 0 167 13 1 1984 0 77 8 1 1985 0 25 3 1 1986 0 156 11 1 1987 0 159 2 1 1988 0 101 2 1 1989 0 128 2 1 1990 1 44 15 1 1991 0 161 15 1 1992 0 195 20 1 1993 0 10 19 1 1994 0 34 7 1 1995 0 88 4 1 1996 0 83 12 1 1997 0 191 7 1 1998 0 180 20 1 1999 0 96 7 1 2000 0 177 10 2 1980 0 119 20 2 1981 0 71 18 2 1982 0 192 15 2 1983 0 142 4 2 1984 0 187 12 2 1985 0 41 8 2 1986 0 68 17 2 1987 0 74 20 2 1988 0 192 3 2 1989 0 178 7 2 1990 0 159 6 2 1991 0 43 16 2 1992 1 146 20 2 1993 0 103 2 2 1994 0 169 19 2 1995 0 99 11 2 1996 0 194 9 2 1997 0 48 19 2 1998 0 103 4 2 1999 0 106 18 2 2000 0 65 11 3 1980 0 15 8 3 1981 0 188 7 3 1982 0 138 7 3 1983 0 85 18 3 1984 1 114 7 3 1985 0 50 17 3 1986 0 122 1 3 1987 0 180 13 3 1988 0 67 4 3 1989 0 48 13 3 1990 0 63 18 3 1991 0 23 11 3 1992 0 36 4 3 1993 0 95 16 3 1994 0 188 18 3 1995 0 31 18 3 1996 1 22 3 3 1997 0 10 1 3 1998 1 11 1 3 1999 0 60 7 3 2000 0 25 7 4 1980 0 112 18 4 1981 0 197 6 4 1982 0 96 2 4 1983 0 165 2 4 1984 0 64 4 4 1985 0 71 19 4 1986 0 15 17 4 1987 0 158 7 4 1988 0 140 11 4 1989 0 156 3 4 1990 0 38 14 4 1991 0 33 14 4 1992 0 17 10 4 1993 0 183 1 4 1994 0 111 20 4 1995 0 69 12 4 1996 0 127 15 4 1997 0 24 12 4 1998 0 102 2 4 1999 0 130 17 4 2000 0 130 15 end
0 Response to Identifying matched observations in a large database
Post a Comment