SOLVED Exclude set of keywords from string (using Regex)

Hello community!

I am currently using this piece of code

import System.Text.RegularExpressions;

var theString: String = "word1 and word3 in the word6 blah blah"
var editString : String = Regex.Replace(theString, "( and )+", " ");
editString = Regex.Replace(editString, "( in )+", " ");
editString = Regex.Replace(editString, "( the )+", " ");
editString = Regex.Replace(editString, "( )+", " ");

in order to exclude some common words from a string, which I then split at the white spaces to get an array of the words. No matter the different syntaxes I tried (and the research I’ve done), I couldn’t figure out how to combine the (at least first three) above “Replace” lines into one. Is it possible? As indicated from the code sample, I am using Unity Javascript and the Regex namespace. Suggestions to generally optimize the method are welcome of course.

EDIT: Just noticed that the ( keyword )+ method will replace only the first match, so please let me know how I would be able to replace all the matches in the string.

EDIT2: My actual goal is to create a keyword search method, where from a string input, which represents several words, I get every single keyword in a different string (so let’s say an array of the substring keywords), excluding some predefined terms. I don’t necessarily want to use regular expressions, but I thought it would be the most straightforward way to do it. I am now thinking that I might create the array with ALL the keyword substrings first and then edit this array to remove unwanted inputs…I’ll give it a try, but if a regex could do the job, I’d be pleased to learn something new! :slight_smile:

FINAL EDIT: Note, that Unityscript will not accept a single slash symbol, so it needs a second slash. In another case, for example, "
" would be needed instead of "
" to represent the line break character.

I think I’ve got this right, but I’m by no means a regex expert, and I also am not sure on what exactly you want.

I do know that you can combine different regex expression matches using a grouping with the alternator symbol, “|”. I also think you might want to use a different method than just the space character to represent word boundaries. “\W” can represent non-word characters. Add to that, where you might have some beginning lines or ending lines (“^” and “$”) when one of the words starts a sentence. And what if the words appear in sequence (“…in the…”)? You might want to also match a single space character in your sequence. “+” gets you one or more of the preceding expression, which will be useful for all of these elements. Take that all together, and you get something like:

(\W|^)+(and|in|the| )+(\W|$)+

Check out this link, where I tested the Regex out.

In implementation, the backslash "" character is actually the escape character. So you’ll need to escape it by using a second "" character, such as the following:

var editKeyword : String = Regex.Replace(newKeyword, "(\\W|^)+(and|in|the| )+(\\W|$)+", " ");

Edit: Revised to include the final scripted version.