Hi all! I've got a longitudinal dataset that consists of binary "ratings" collected from participants each day across two different methods, and I'm trying to figure out the best way to assess day-level agreement across the two methods. So, I have a long dataset in which the ratings collected are recorded in two separate variables (liver1, liver2, and dbs1, dbs2) each day for 30 days (studyday) by each participant (id). So, basically the data look like this:

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input double id float(studyday svysubmitdate liver1 liver2) byte(dbs1 dbs2)
 9  1 20607 0 1 0 0
 9  2 20608 1 . 0 .
 9  3 20609 1 . 0 .
 9  4 20610 0 . 0 .
 9  5 20611 0 . 0 .
 9  6 20612 0 0 0 0
 9  7 20613 0 1 0 0
 9  8 20614 1 1 0 0
 9  9 20615 1 1 0 0
 9 10 20616 0 1 0 0
 9 11 20617 0 0 0 0
 9 12 20618 0 0 0 0
 9 13 20619 0 0 0 0
 9 14 20620 0 1 0 0
 9 15 20621 0 1 0 0
 9 16 20622 1 1 0 0
 9 17 20623 1 1 0 0
 9 18 20624 0 0 0 0
 9 19 20625 0 0 0 0
 9 20 20626 0 0 0 0
 9 21 20627 0 0 0 0
 9 22 20628 0 1 0 0
 9 23 20629 1 1 0 0
 9 24 20630 1 1 0 0
 9 25 20631 1 0 0 0
 9 26 20632 0 0 0 0
 9 27 20633 0 0 0 0
 9 28 20634 1 0 0 0
 9 29 20635 1 1 0 0
17  1 20612 0 1 0 0
17  2 20613 0 0 0 0
17  3 20614 0 0 1 0
17  4 20615 0 1 1 0
17  5 20616 0 1 0 0
17  6 20617 0 1 1 0
17  7 20618 0 1 0 0
17  8 20619 0 0 0 0
17  9 20620 0 0 0 0
17 10 20621 0 0 0 0
17 11 20622 1 1 0 0
17 12 20623 0 1 0 0
17 13 20624 1 0 0 0
17 14 20625 0 0 1 0
17 15 20626 0 0 0 0
17 16 20627 0 0 0 0
17 17 20628 0 0 0 0
17 18 20629 0 1 0 0
17 19 20630 0 1 0 0
17 20 20631 0 0 0 0
17 21 20632 0 0 1 0
17 22 20633 0 0 1 0
17 23 20634 0 0 1 0
17 24 20635 0 0 1 0
17 25 20636 0 0 0 0
17 26 20637 0 1 1 0
17 27 20638 0 1 0 0
17 28 20639 0 1 0 0
17 29 20640 0 0 0 0
41  1 20607 1 1 1 0
41  2 20608 1 . 1 .
41  3 20609 1 . 1 .
41  4 20610 0 . 0 .
41  5 20611 1 . 1 .
41  6 20612 1 1 1 0
41  7 20613 1 1 1 0
41  8 20614 1 1 1 0
41  9 20615 1 1 1 0
41 10 20616 1 0 1 0
41 11 20617 0 0 1 0
41 12 20618 1 1 1 0
41 13 20619 1 1 1 0
41 14 20620 1 1 1 0
41 15 20621 1 1 1 0
41 16 20622 0 1 1 0
41 17 20623 1 1 1 0
41 18 20624 0 1 0 0
41 19 20625 . 1 . 0
41 20 20626 1 1 1 0
41 21 20627 1 1 1 0
41 22 20628 0 1 1 0
41 23 20629 0 0 1 0
41 24 20630 . 1 . 0
41 25 20631 1 1 1 0
41 26 20632 1 1 1 0
41 27 20633 . 1 . 0
41 28 20634 1 1 1 0
41 29 20635 1 1 1 0
end
format %tdNN/DD/YY svysubmitdate
Basically what I'd like to know is to what extent do the ratings provided via these two methods agree on a given day. But, there's a ton of different ways of calculating ICCs/Kappas, so I guess I'm most curious about which one might be best, given these repeated measures & longitudinal data. For example, it seems like one route might be to reshape long again to get a variable that reflects assessment method (1, 2) and calculate ICCs after a mixed model like:

Code:
reshape long liver dbs, i(id studyday) j(method)
xtmelogit liver || method: || id: , variance
estat icc
Or, maybe ICC(3) could be a decent fit, since I at least have a "random sample" and fixed raters? So something like:

Code:
 
 kappaetc liver1 liver2, icc(mixed)
Of course, these two methods (as well as the more general icc command produce pretty wildly different estimates. So, any thoughts about what might be the best fit for this? Or approaches I should explore a bit more? Thanks!