Removing middle of a string between certain characters

Hello Statalist,

I have a string variable which is interspersed with HTML tags (e.g. " " or ""). I want to get rid of all these tags that are identified by angled brackets.

To make things complicated:
1) There is a large variety of these tags, so I cannot simply run a "subinstr()" for a select list of them - I need something that catches them in an automated way via the angled brackets.
2) There can be more than one of these tags per observation.

I tried the following code (looping it 9 times to remove up to 9 tags):

Code:

foreach num of numlist 1/9 {
   gen htmltag`num'=substr(textwithtags,strpos(textwithtags,"<"),strpos(textwithtags,">"))
   replace textwithtags=subinstr(textwithtags,htmltag`num',"",.)
}

But this doesn't work well for cases with multiple tags. Take the following example: " Does NAME have an AGREEMENT or CONTRACT to return?" - this approach doesn't know which pair of brackets belong together as one tag, and in consequence some of the text between brackets is also removed...

Any help would be much appreciated!

Best,
Felix

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Removing middle of a string between certain characters
Removing middle of a string between certain characters

0 Response to Removing middle of a string between certain characters

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Removing middle of a string between certain characters Removing middle of a string between certain characters

Related Posts with Removing middle of a string between certain characters

0 Response to Removing middle of a string between certain characters

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Removing middle of a string between certain characters
Removing middle of a string between certain characters