Unable to use RegularExpression Pattern in Unity

Hi Guys,

I have a simple Regex.Match method trying to use this pattern: <?PDA No="(.+?)". The problem being that it keeps saying: error CS1009: Unrecognized escape sequence `?'.

I’ve tried cutting this into several strings and adding them together, however it doesn’t seem to work. Does anyone know of a workaround?

Thanks!

If you’re looking to capture a question mark, you need to either use a double backslash “\?” or use a verbatim string literal @“?”

Regex regex = new Regex(@"\?");
Match match = regex.Match("example?");
Debug.Log(match.Value);

I think kru’s suggestion is likely correct, but for syntax errors, it’s pretty much required to see the entire line of code causing the error in order to really know where the problem is.

The problem here is C#'s parsing of your regex string:

in code I imagine it looks like this

Regex regex = new Regex("<\?PDA No=\"(.+?)\"");

Forget that this is a regex and try to interpret this string as the C# parser.

First, while scanning, the parser will look for quotation marks. Once it finds one, it will switch modes to label any characters as part of a string. If it finds another quotation mark, it will switch back into regular mode.

Following just that simple rule, this part appears to be a string:

<\?PDA No=\

Followed by some nonsensical characters:

(.+?)\

And followed finally with an empty string:

Hopefully you spotted the problem: that simple rule doesn’t allow quotation marks to exist within strings. For instance, how would you make this a string in C#?

"

The answer is done with what we call escape characters. In C#, the escape character is the backslash. Whenever the parser is in “string mode” and comes across a backslash, it expects the next character(s) to combine with the \ to be a special “escape sequence”. This escape sequence is then interpreted back to a literal character rather than having the usual effect (like a " ending the string)
So this:

"\""

is actually interpreted simply as "
And this:

"\\"

is actually interpreted simply as \

So considering these two rules, let’s take another look at your regex

Regex regex = new Regex("<\?PDA No=\"(.+?)\"");

Scanning from left to right, let’s take up the role of the parser just before hitting the first quotation mark.

The first character we spot is ", so switch to string mode
Next, we spot <, so add < to the string we are building
Spotted: , so interpret the next character as an escape sequence.
Spotted: ?, unfortunately, none of the known escape sequences start with ?, so we’ve come across an “unrecognized escape sequence”

I’m guessing that what you really wanted is the backslash to be apart of the regex, in which case you should do:

Regex regex = new Regex("<\\?PDA No=\"(.+?)\"");

However, I think you’ll start realizing there might be more problems with your regex… let’s parse it again:
Spotted: ", so switch to string mode
Spotted: <, so add < to string (<)
Spotted: , so interpret next characters as an escape sequence
Spotted: , full escape sequence so add \ to string (<)
Spotted: ?, so add ? to string (<?)
… let’s skip to just before the = symbol
Spotted: =, so add = to string (<?PDA No=)
Spotted: , so interpret the next characters as an escape sequence
Spotted: “, full escape sequence so add " to string (<?PDA No=”)

I’ll stop here because I think this is not what you expected. The parser has eaten your \ in order to stay in “string mode”. That is why you don’t get weird compile errors from the following characters. However, it will also result in a string you didn’t expect.

I think you might already have an understanding of escape characters because you have slashes in the regex itself. So now you’ve come across one of the unfortunate side effects of embedding an escaped string within a language which itself uses escape characters. The problem is compounded because the languages (regex and C#) happen to use the same escape character. To see this compound effect, check out this C# code just try to find a single, literal backslash using regex:

Regex regex = new Regex("\\\\");

The result of patching up your regex is then,

Regex regex = new Regex("<\\?PDA No=\\\"(.+?)\\\"");

kru showed you another technique you can use to eliminate some of the confusion. C# provides the @ symbol to describe a “verbatim” string. verbatim strings are not processed for escape characters and therefore cannot have escaped characters except the special double quote character which is escaped by placing two double quotes back to back.

The verbatim string will also fix the second problem I demonstrated, with the backslashes which appear later in your regex but you’ll need to use the double quotation trick to add the quotations to your regex.

Regex regex = new Regex(@"<\?PDA No=\""(.+?)\""");

You’ll notice that the code highlighting system on this forum fails for verbatim strings…

Additional reading:
https://msdn.microsoft.com/en-us/library/362314fe.aspx

1 Like

Thank you so much for the long reply there! That’s the gist of where I found my problem. That both C# and RegularExpressions used the backslash, and by using the @ symbol before the string begun, it would eliminate the ? expression issue, however it wouldn’t allow me to remove the speech marks. For now, I’ve shortened the expression to PDA No="(.+?)". However I don’t generally like hacks or workaround.

When you use the @ symbol, which is called ‘verbatim literal’ which turns off \ as the escape character (the @ symbol is used, because usually these strings are used for folder paths, which use \ to denote folders… get it ‘AT address’). Thing is… we still need an escape character, so when in @ or verbatim literal mode, the " becomes the escape character.

So, just like in a regular string, you’d escape \ by saying \.

In a verbatim literal string, you’d escape " by saying “”:

Regex rx = new Regex(@"<\?PDS No=\""(.+?)\""");

Note, ironically, VB.Net operates default in verbatim literal. Making swapping between VB.Net and C# rather confusing at times.

1 Like

Awesome, thanks a lot! :slight_smile: