Importing delimited csv with special characters including double quotes

I am trying to import a large comma-delimited csv (~1M rows). Variables are bound by double quotes because some of the variables contain commas. Additionally, some of the rows contain the two-character combination ", and occasionally this is the last sequence in the variable before the final binding double quote, so that two adjacent double-quotes appear.

I am struggling to find a way to get Stata to ignore these " combinations without removing the binding quotes and so also inadvertently interpreting interior commas as delimiters. It may be possible in another language to do a replace-all of " in the raw csvs, but as I am not very familiar with e.g. python, I am hoping there is a complete solution in Stata.

Here is an example mytest1.csv with six rows, including some of the problem rows:

Code:

"Kuzmin, S","UKR","0.0","Rakhmanin, Y","Lugansk","8","2019-07-21","14129876","Kuzmin, S","1741","20","14126630","Rakhmanin, Y","2053","0.00","-2.80","b","0","r"
"Kuzmin, S","UKR","0.0","Medvedsky, V","Lugansk","9","2019-07-21","14129876","Kuzmin, S","1741","20","14138395","Medvedsky, V","1817","0.00","-8.00","b","0","r"
"Vysochin, S","UKR","1.0","Drobot, S","\"Cup Independence - 2019 - \"A\" \"Open\"","1","2019-08-23","14103516","Vysochin, S","2493","10","14131129","Drobot, S","2093","1.00","0.80","b","0","r"
"Piesik, P","POL","1.0","Kaluzny, K","Turniej Szachowy \"Ferie Zimowe 2009' - grupa A - o Puchar Burmistrza Malborka","3","2009-02-03","1136194","Piesik, P","2197","15","1147285","Kaluzny, K","1847","1.00","1.65","w","0","r"
"Gopal, , K.n.","IND","1.0","Karthik, P","Namuduru,","1","2009-01-27","5001447","Gopal, , K.n.","2204","15","5089719","Karthik, P","1854","1.00","1.65","w","0","r"
"Blackman, J","BAR","1.0","Wilson, A",\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N

I have played around with the quote options for import delimited, but with no success. For example, leaving the default options generates a dataset with problems in rows 3 and 4; here are the relevant variables:

Code:

import delimited "mytest1",  clear

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str12 v4 strL v5 str34 v6 str36 v7
"Rakhmanin, Y" "Lugansk"                                                                                                              "8"                                    "2019-07-21"                            
"Medvedsky, V" "Lugansk"                                                                                                              "9"                                    "2019-07-21"                            
"Drobot, S"    `"\"Cup Independence - 2019 - \"A\" \"Open\","1","2019-08-23","14103516","Vysochin"'                                   `" S","2493","10","14131129","Drobot"' `" S","2093","1.00","0.80","b","0","r""'
"Kaluzny, K"   `"Turniej Szachowy \"Ferie Zimowe 2009' - grupa A - o Puchar Burmistrza Malborka","3","2009-02-03","1136194","Piesik"' `" P","2197","15","1147285","Kaluzny"' `" K","1847","1.00","1.65","w","0","r""'
"Karthik, P"   "Namuduru,"                                                                                                            "1"                                    "2009-01-27"                            
"Wilson, A"    "\N"                                                                                                                   "\N"                                   "\N"                                    
end

Specifying bindquotes(nobind) fixes this issue:

Code:

import delimited "mytest1", bindquotes(nobind)  clear

...but causes Stata to interpret the extra commas in row 5 as delimiters, generating:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str9 v1 str3 v2 str6 v3 str5 v4 str10 v5 str8 v6 str80 v7
`""Kuzmin"'   `" S""' `""UKR""'  `""0.0""' `""Rakhmanin"' `" Y""'      `""Lugansk""'                                                                       
`""Kuzmin"'   `" S""' `""UKR""'  `""0.0""' `""Medvedsky"' `" V""'      `""Lugansk""'                                                                       
`""Vysochin"' `" S""' `""UKR""'  `""1.0""' `""Drobot"'    `" S""'      `""\"Cup Independence - 2019 - \"A\" \"Open\""'                                     
`""Piesik"'   `" P""' `""POL""'  `""1.0""' `""Kaluzny"'   `" K""'      `""Turniej Szachowy \"Ferie Zimowe 2009' - grupa A - o Puchar Burmistrza Malborka""'
`""Gopal"'    " "     `" K.n.""' `""IND""' `""1.0""'      `""Karthik"' `" P""'                                                                             
`""Blackman"' `" J""' `""BAR""'  `""1.0""' `""Wilson"'    `" A""'      "\N"                                                                                
end

I also tried importing without stripping the quotes:

Code:

import delimited "mytest1", stripquotes(nobind)  clear

...but this didn't help:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str15 v1 str5(v2 v3) str14 v4 strL v5 str34 v6
`""Kuzmin, S""'     `""UKR""' `""0.0""' `""Rakhmanin, Y""' `""Lugansk""'                                                                                                           `""8""'                               
`""Kuzmin, S""'     `""UKR""' `""0.0""' `""Medvedsky, V""' `""Lugansk""'                                                                                                           `""9""'                               
`""Vysochin, S""'   `""UKR""' `""1.0""' `""Drobot, S""'    `""\"Cup Independence - 2019 - \"A\" \"Open\"","1","2019-08-23","14103516","Vysochin"'                                  `" S","2493","10","14131129","Drobot"'
`""Piesik, P""'     `""POL""' `""1.0""' `""Kaluzny, K""'   `""Turniej Szachowy \"Ferie Zimowe 2009' - grupa A - o Puchar Burmistrza Malborka","3","2009-02-03","1136194","Piesik"' `" P","2197","15","1147285","Kaluzny"'
`""Gopal, , K.n.""' `""IND""' `""1.0""' `""Karthik, P""'   `""Namuduru,""'                                                                                                         `""1""'                               
`""Blackman, J""'   `""BAR""' `""1.0""' `""Wilson, A""'    "\N"                                                                                                                    "\N"                                  
end

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Importing delimited csv with special characters including double quotes
Importing delimited csv with special characters including double quotes

0 Response to Importing delimited csv with special characters including double quotes

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Importing delimited csv with special characters including double quotes Importing delimited csv with special characters including double quotes

Related Posts with Importing delimited csv with special characters including double quotes

0 Response to Importing delimited csv with special characters including double quotes

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Importing delimited csv with special characters including double quotes
Importing delimited csv with special characters including double quotes