Personally I wouldn’t use regex, even for multiple lines.
Usually some language uses a line terminator… like a ‘;’ or a linefeed (just end of line). The semi-colon being nice because you can have multi-line (human readable) lines, by ignoring linefeeds.
I’ve written a few parsers over the years, and there’s several trade-offs to make in design. A huge one could be memory… take for example this parser I wrote here:
This only evaluates arithmetic (with some functions tossed in and the ability access properties of a single contextual object).
The code of it may seem very adhoc in part. But I wrote it as such to avoid as much memory allocation and GC as possible. I use a TextReader because with that I can even evaluate text files without even having to load the entire thing (streaming string data is more efficient than loading it entirely). Though usually it still comes as strings from the inspector, so I use a StringReader for that (a custom one I wrote that can be recycled).
…
A while back for a company I worked for I wrote another one.
That one required the ability to define functions and object types. Was a more robust language. So I tokenized everything. There were actual object types that represented these things, like ‘functions’. So as I parsed I chained together objects that represented the code… this of course used more memory, but wasn’t so GC sensitive so it wasn’t a big deal.
…
But in both the gritty parsing was basically the same. I read each character one by one and branched from there through a state machine.
When a line starts I expect only a handful of things.
Like in my Evaluator, I skip over whitespace (cause it’s ignored), and get to my first character I find. And what do I do?
if(!IsValidWordPrefix(_current)) throw new System.InvalidOperationException("Failed to parse the command.");
//....
private static bool IsValidWordPrefix(char c)
{
return char.IsLetterOrDigit(c) || c == '
I test if it’s a valid word… and a valid word MUST start with 1 of those few options. We can’t start with a multiplication, that doesn’t make sense. We know at what point in the flow of the program we are right now, so we branch accordingly.
Lets take C# as a more complicated example…
when we start parsing a class file we really only have a small handful of options to start with…
// - comments
- inline commands
using - for using statement
namespace - for namespacing
public/private/protected - for modifiers
class/enum/interface/struct/etc - for type declaration
That’s all we look for contextually, because that’s all that’s allowed. And they all are pretty unique… u,n,p,c,e,i,s.
Once we get into a class, our context has changed (our state has changed), and we know what we’re looking for now…
public/private/protected - for modifiers
… - type declarations for fields
method/property declaration
etc
Once in the body of a method we again are only looking for specifically contextual things. Like a line must start with a handful of things…
variable declaration
variable/field assignment
loop/eval statement
method call
goto
etc
So really we’re only hunting for what we need when we need it.
…
Anyways, hopefully that could give you a start. || c == ‘_’ || c == ‘+’ || c == ‘-’ || c == ‘(’;
}
I test if it's a valid word... and a valid word MUST start with 1 of those few options. We can't start with a multiplication, that doesn't make sense. We know at what point in the flow of the program we are right now, so we branch accordingly.
Lets take C# as a more complicated example...
when we start parsing a class file we really only have a small handful of options to start with...
// - comments
# - inline commands
using - for using statement
namespace - for namespacing
public/private/protected - for modifiers
class/enum/interface/struct/etc - for type declaration
That's all we look for contextually, because that's all that's allowed. And they all are pretty unique... u,n,p,c,e,i,s.
Once we get into a class, our context has changed (our state has changed), and we know what we're looking for now...
public/private/protected - for modifiers
... - type declarations for fields
method/property declaration
etc
Once in the body of a method we again are only looking for specifically contextual things. Like a line must start with a handful of things...
variable declaration
variable/field assignment
loop/eval statement
method call
goto
etc
So really we're only hunting for what we need when we need it.
...
Anyways, hopefully that could give you a start.