Non-Greedy Regex

So I have a number of items being returned of the general format

<itemData>
     <dataItem>Stuff here</dataItem>
</itemData>

<itemData>
     <dataItem>Stuff here</dataItem>
</itemData>

The problem I’m having with a Regex for this is keeping it from grabbing both objects in a single, giant sweep as it matches everything from the first line to the last line. I know *? will match what comes before as soon as possible, but I’m not clear on how that can be applied to an entire grouping since I obviously want to match the 1st “” as soon as it appears so I wind up with 2 objects in my MatchCollection.

For those who might run into a similar situation :slight_smile:

I’ll ask the same question I asked before - why are you running Regular expressions against this instead of parsing it as an XML string?

There’s no guarantee of the data being returned staying in the same format. As this is beyond our control, it makes more sense to code using a Regex which can handle any kind of formatting they might use in the future.

No it does not. When the question is “How do I parse XML?” the answer is never with RegEx.

You do know that it is literally (and I mean, literally) impossible to parse XML properly with regular expressions?

In fact, XML it’s no more than a structured plain text. It’s not elegant to use regular expression over XML, but not impossible. I never had troubles to parse XML using regular expressions.

Chomsky hierarchies. blah blah. type 3 grammars. yada yada. infinite recursion. wah wah.

This is a good regex resource: http://www.regular-expressions.info/

There are also a few good books on regular expressions too.

You should never really use regular expressions to parse any kind of XML-style language. It’s just a really bad idea. Possible to do, but a really bad idea. Regular expressions are not capable of parsing regular grammars, of which, XML is included in that set.

Anybody that purports to parse any (X)ML-like regular grammar using a regular expression** is deluding themselves. They are parsing only a small subset of the grammar. You can do it in a pinch, but why bother constructing a brittle solution when so many more better solutions are available for free and readily available?

I would recommend any of the jquery-like libraries for .NET that offer the very powerful CSS selectors ability for pulling out the data you need. Heck, use LINQ if you need to, but don’t use regex. You can NuGet the SharpKit or any one of a half-dozen jquery and jquery-like packages or even dedicated CSS selectors packages available that will solve your problem inside of 2 minutes. There is of course the venerable Fizzler and HTMLAgility that are awesome solutions and are what I use on my own projects. http://code.google.com/p/fizzler/ SharpQuery and SgmlReader are also good in combination together and probably more lightweight for use in Unity. http://code.google.com/p/sharp-query/

Note the “properly” clause. fholm speaks the truth.

Search SO for more concrete proof of these arguments.

In other words; Just… just don’t.

** unless they have a computer equipped with infinite memory, in which case, all bets are off.

This throws up a red flag for me. Are you saying you can’t guarantee that you will always get XML? You can design around that, so that you can substitute an XML parser for a json parser (or some other format they choose), and still keep the request and end data structure intact. And it would be far simpler to support the formats you know rather than trying to build something that could parse any known format they could change to(this strikes me as inordinately difficult to protect against something that may or may not happen), because regardless you are going to have to recode the parsing aspects it if they give you a new format. I can’t see how using Regex would improve the situation should they change formats.

Not true … you may want to do so for performance reasons.

If you know the rules of your input and those rules are more constraining than pure XML then you can be much more efficient. Particular when you only need a subset of the information contained in an XML document.

Rather topical for me at the moment as the major bottleneck in a messaging solution I did some remediation work on was the XML parsing.

EDIT: I’m not saying this is applicable in this scenario, and probably never crops up in Games Development.

True, but you won’t be parsing a regular grammar, XML or otherwise, except in a specifically defined case for a non-generic scenario. Which is then not XML or a regular grammar. You’re not parsing XML. You’re parsing something that looks like XML.

Quite so, but then you’re not parsing XML. You’re parsing something that looks like XML.

Quite so, but then you’re not parsing XML. You’re parsing something that looks like XML.

Its rather common to (for example) have a grammar defined by an XML schema (or even a document, agreement over water cooler, whatever). In fact without some kind of restriction its quite difficult to find a use for XML.

Also isnt their an equivalent regex for any regular grammar…

EDIT: Sorry being a bit cheeky, I think the issue is that XML is not regular, however some subset of XML as defined by some schema (in whatever flavour you prefer) may be.