I have a dataset of electronic medical records, and I am trying to categorize diagnoses ("dx1") by body system ("system").
The problem is that there are misspellings and variants of certain words.
I am looking for a way to be able to categorize variants of a string into one category, e.g., if the string contains some variant of "mammary" (e.g., "mammery", "mamery", "mamary"), assign it a system value of 8 (based on the defined labels below).
Right now, I've been doing it manually (e.g., replace system=8 if strpos(dx1, "mammary")|strpos(dx1, "mammery")|strpos(dx1, "mamary") ) capturing all known instances but it is taking me way too long to do and it is a very large dataset!
I've tried using matchit, but I can only find documentation on using that to merge two datasets. Here, my strings are all part of one variable within one dataset.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str517 dx1 float system "abdominal masses cranial to right nipple (13.4x14.8mm, 7.6x8.1mm, 8.8x8.8mm) - r/o fibroadenoma, fibrosarcoma, lipoma, mammary carcinoma" . "adenocarcinoma mammary gland" . "adenocarcinoma mammary gland e" . "adenocarcinoma mammery gland" . "adenocarnicoma of the mammary glands" . "adenoma mammary gland benign" . "adenoma mammary gland benign (multiple)" . "adenoma mammary gland benign (removed 5/15/13)" . "adenosquamous carcinoma mammary gland" . "adenosquamous carcinoma, left fourth mammary gland (surgical excision 1/24/14, scar revision 2/7/14)" . "benign adenoma--mammary gland" . "benign mammary masses (adenoma)-surgically excised on 2/9/15" . "benign mammary tumors (adenomas)" . "bilateral high grade mammary adenocarcinoma" . "bilateral high-grade mamary gland adenocarcinoma" . "bleeding from mass suspected to be mammary gland adenocarcinoma" . "complex adenoma with atypia - right caudal/inguinal mammary gland" . "incision recheck - mammary mass excision l3, low-grade adenocarcinoma completely excised" . "inflammatory mammery adenosquamous carcinoma (l4, excised)" . "inguinal mass - r/o mammary adenocarcinoma" . "intraductal mammary papillary adenocarcinoma (r4)- excised 10/16/13" . "mammary adenocarcinoma (right 5th gland) with lymph node involvement" . "mammary adenocarcinoma (right fifth mammary gland)" . "mammary adenocarcinoma - right axillary gland (excised 11/2010)" . "mammary adenocarcinoma, grade 2" . "mammary adenocarinoma with metastasis to axillary lymph node" . "mammary adenoma (excised)" . "mammary adenosquamous carcinoma (l4, excised), with recurrence
 
 mammary nodules in r2,3,4 and l2" . "mammary cystadenocarcinoma" . "mammary gland adenocarcinoma - surgically excised 3/2012" . "mammary gland adenoma - left 4th gland, surgically removed 02/21/17" . "mammary gland fibroadenoma - right caudal mammary chain" . "mammary glands enlargement - likely mammary fibroadenomatous hyperplasia" . "mammary intraductal papillary adenoma (multiple) - l4 mammary gland - regional mastectomy 09/03/2014" . "mammary mass (likely adenocarcinoma)" . "mammary masses -- suspect adenocarcinoma" . "mammary masses-suspect mammary adenocarcinoma recurrence" . "mammary tumor: adenoma with an area with more malignant appearance. clean margins." . "mammary tumors- adenomas and carcinoma" . "mass along right caudal mammary chain (excised - adenoma)" . "mesenteric/jejunal lymphadenopathy" . "multiple mammary adenomas - completely excised 12/1" . "papillary mammary adenocarcinoma, glands 3 and 4 on the right side" . "post-op mammary adenoma and anal sac adenocarcinoma removal (3/14/12)" . "primary mammary adenocarcinoma" . "right maxillary mammary adenoma - removed 10/23/13" . "simple mammary adenocarcinoma" . end
label define System 1 "Oropharyngeal/nasal" 2 "Ocular" 3 "Aural" 4 "Respiratory" 5 "Cardiovascular/Hemotological" 6 "Gastrointestinal" 7 "Hepatobiliary" 8 "Urogenital" 9 "Musculoskeletal" 10 "Integument" 11 "Neurological" 12 "Behavior" 13 "Other" 14 "Healthy"
0 Response to Help with fuzzy matching
Post a Comment