I am wondering if anyone has seen any kind of examination of the various matching methods available in the -matchit- function? I don't really understand the difference between bigram, ngram, ngram_circ, token, soundex and token_soundex. Also where do the different scoring options (jaccard, simple, and minisimple) excel?
Specifically, I have a dataset where I'm trying to match off of business name and address. Does one of the options work better at ignoring small (in my mind anyway) differences such as "LLC" vs. "Inc" vs. no modifier, such as often happens when business names are recorded? Does another one of these options give greater weight to differences in numbers, such as I'm seeing in addresses? Ex: "123 Main Street." should not be matched with "321 Main Street.", but should be matched with both "123 Main" and 123 Main St". Can someone, either by experience of reference, tell me where I might get the best results from? Is this a question that Julio Raffo has written about before? If so I can't seem to find it.
Thanks for any help you can provide.
Related Posts with Examining -Matchit- options for improving matches based on types of string variables
Picking initial values for a specificDear all, I have an unbalanced panel of 57,910 observations of country, industries and year (2003-2…
Importing and Assigning Data By YearHello, I'm working on a research paper in which I'm examining the relationship between sea surface …
Mulitple CV graphs for lasso/elasticnet estimatesHi, I am running different regularized estimators, elasticnet and linear lasso. I want to plot Cross…
Type mismatch when trying to get substringHi, I'm working with trade data and I want to get a substring from a numeric code that includes the …
Deleting a row header from a customized tableI am creating a Table 1 using the "collect" command in Stata. I would like to have a "Total" row at …
Subscribe to:
Post Comments (Atom)
0 Response to Examining -Matchit- options for improving matches based on types of string variables
Post a Comment