My research team coded/classified a large set of documents. I took a random subset of 1000 from the coded docs to code and check the reliability. So, each document was coded by one of eight team members and a second time by me. However, because of the nature of the documents, some documents should be coded more than once because they meet multiple classification qualifications. But, not all were necessarily coded the same amount of times be myself and the other coder. (Sometimes I coded for 2+ classifications and the other did not or vice verse, depending on how thorough each was). About 6% of documents were coded more than once.

If the coder answered that the document was type 1, then they would code the T1 variables and the T2 variables would be missing. If it was type2, then they’d code the T2 variables. If it is both type 1 and 2, the coder should have coded it twice, once for each type. As the data is arrange now, each variable has four columns, the RA first coding, my first coding , the RA second coding (if applicable) and my second coding (if applicable)

doc_id type_ra1 type_me1 type_ra2 type_me2 T1_ra1 T1_me1 T1_ra2 T1_me2 T2_ra1 T2_me1 T2_ra2 T2_me2
1 1 1 . . 2 2 . . . . . .
2 2 2 . . . . . . 1 2 . .
3 1 1 2 1 1 . . . . 2
4 1 2 . 1 2 . . 2 . 2 . .
(Not all the numbers are necessarily 0 or 1, some vars go to 6. All the vars are categorical, not ordinal. )

So, in the above example, the first row should have perfect reliability between me and the RA.
The second row would be reliable for the type variable, but not the T2 variable.
The third row both coders coded it as type 1 the same, but I coded it a second time.
In the fourth row, my second coding matched the RA first coding, which should still count towards higher reliability, but it would have been better if the RA had coded a second time in the same way as my first.

For the ICR I’ve been grouping all attempts for comparable variables like this:
Kappaetc type_ra1 type_ra2 type_me1 type_me2
Kappaetc T1_ra1 T1_ra2 T1_me1 T1_me2



So, I have 2 questions:

1) Is there a better way to account for missing values here? When I make missing values 0 the reliability hits around .7 depending on the variable, but without it is around .85.

2) Is there a good way to measure total reliability? I want to be able to see if the document as a whole was reliably coded, not just the individual variables. Because there are a lot of variables coded for each document, it could be that most documents have something off when looking at the variables collectively, even though the individual variables tend to have high reliability.