Dear Stata Users,

I am in the process of cleaning data on historical greenhouse gas emissions. Since the data came formatted with a separate row for each gas and each sector for a given country and year there are multiple lines for a single country and year (i.e. many duplicates). I created variables for each sector and gas combinations. Eventually, I hope to have one line for each year by country (i.e. no duplicates) with these variables of each sector/gas combo.

The problem is that I'm stuck halfway in cleaning. My data looks like this:

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str41 country int year float PIK_CH4_Agr
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850   7.60e-07
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1850          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851  8.075e-07
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1851          .
"Aruba" 1852  8.625e-07
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1852          .
"Aruba" 1853  9.275e-07
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1853          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854 1.0025e-06
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1854          .
"Aruba" 1855          .
"Aruba" 1855          .
"Aruba" 1855          .
"Aruba" 1855          .
"Aruba" 1855          .
end
Here you see CH4 emissions data for Aruba for a few early years. There is only one value for a year. It is periodic because the other missing values are lines corresponding to other sectors and gas combinations. My plan was to fill everything out by country and year for each variable and then trim it so there are no duplicates (maybe there is a better way to do this. But I've done similar things with panel data before).

My problem is filling out the data. There are a number of commands to fill down values including carryforward. But I can't fill backward. I tried Nick Cox's simple trick to use gsort -year and then carryforward but suffice to say this doesn't work. I am also unable to use anything with tsset because I have so many duplicates.

I am also limited in do this:

Code:
bysort ISO year : replace PIK_CH4_Agr = PIK_CH4_Agr[_n-1] if missing(PIK_CH4_Agr)
Because it will only fill the previous year. How many years I need to backfill, however, varies a lot.

One solution I came up with was to 1.) identify the first value where the value is not missing, 2.) identify the first occurance and 3.) replace the first occurrence with the first non-missing value 4.) from there I can carryforward. I'm stuck on #3.

My code so far is:

Code:
bysort country year: gen first_PIK_CH4_Agr = sum(PIK_CH4_Agr !=. ) == 1
bysort country year: carryforward PIK_CH4_Agr, replace
bysort country year: gen start_PIK_CH4_Agr = sum(PIK_CH4_Agr==. ) == 1
If someone can either help me finish this last step so I can replace the first occurrence with the first non-missing value that would be great. If someone has an even simpler solution to my whole cleaning approach, that would also be warmly welcomed.