Parsing a Custom Scripting Language?

I want to extract and/or convert specific parts of data from a custom scripting language into Unity. However, I’m very new at this and not exactly sure what is the best way to go about doing it. So I’ve come to the community for advice. Here’s a basic example of the data structures I’m working with:

VERTEX   -18.700 23.668 0.000;#0
VERTEX   24.113 22.306 0.000;#1

WALL    wall1 32 0    0    1    0.000 0.000;#0
WALL    wall1 26 1    0    1    0.000 0.000;#1

BMAP wall1_map, <black.pcx>,0,0,10,10;
TEXTURE wall1_tex { BMAPS wall1_map; }
WALL wall1 { TEXTURE wall1_tex; }

SKILL startScreenSize     { VAL 0; }

ACTION init_game {
    SET    SOUND_VOL,    1;
    CALL    initParameters;

    LOAD_INFO "info", 0;
    IF (RESULT != -1) {
        SET    curResolution, defResolution;
    }
    IF (curResolution == 1) {
        SET    COMMAND_LINE,    lowres_str;
    }

IFDEF    SUPPORT_MEDRES;
    SET    glbSupportMedRes,    1;
ENDIF;

    SET    EACH_TICK.6,    NULL;
}

Exacting data form a single line of code like the “WALL” keyword is super easy. Using System.IO and the StreamReader I simply read each line and use the String.StartsWith() function to find the correct line and then split up that line of text based on white spaces.

However, the more complex code is a lot harder for me to figure out. For example, how can I extract just the data between the brackets while breaking each part at a comma or semicolon, but also remove those? I’m now looking into regular expressions since it seems like the best way to go here, but I’m very new to all of this and could use any advice people can offer me on the topic.

I primarily just want to extract the the simpler VERTEX, WALL and SKILL data types, but understanding how to also extract ACTION data would be super helpful too. Although I think it would probably be best to convert that type of data directly into a JS or C# script and that might be a bit too much for me right now.

Note: I only really need to import this data at editor time.

String searches and regular expressions are often enough to grab simple information from this kind of data. Particularly the simple samples you mentioned: VERTEX, WALL and SKILL. However, things get quite a bit more complex for something like ACTION.

The general solution to what you are trying to solve is called the Interpreter design pattern. You may also have luck searching for “lexer and parser”.

The gist of this technique is to imagine the text as a series of symbols (words or numbers in some sensible arrangement). For instance, VERTEX appears to be a keyword which is always followed by 3 floating point numbers and then a semicolon. WALL appears to be a keyword which is always followed by a texture identifier, then 4 integers, then 2 floating point numbers, and finally a semicolon. SKILL appears to be followed by an identifier, and then some more text in brackets.

In general, these “acceptable patterns” of symbols make up something called a language’s grammar. That is, they are the rules that describe what is a well formed statement and what is not. Writing down a language’s grammar is the first step in building an interpreter.

Once you have the grammar, it is possible to build a state machine that will scan across each symbol and build an expression tree. This is where a familiarity with various data structures and programming techniques is really important. I’m not sure what your level of expertise is but if you want to pursue this, you’ll need to understand what a state machine is, what a data tree is and how to traverse trees using visitors.

Ultimately, the interpreter’s job is to take this flat file of text and build a tree of various expression types. You could then traverse this tree and query those expressions for their properties.

3 Likes

Thank you for the very detailed reply @eisenpony ! I understand what a state machine is, but I’m not familiar with data trees. I’ll do some research on data trees and the Interpreter design pattern. However, if parsing something like the ACTION above is this much work I may just ignore that for now.

In the meantime, do you think regular expressions are a good use for parsing simple data types like TEXTURE or SKILL that contain data between brackets?

Regular expressions should be able to handle simple data like the example you’ve already given. On the other hand, a simple string.Split can probably handle what you’ve shown. Is there a more complex example you think requires a regex?

Guess I’m just looking for the most elegant way extract the content between the brackets and/or text spread across multiple lines.

Right now this seems to work fine for a single line element:

string[] texturePrams = text.Split(new[] { "TEXTURE", "{", "}", ";" }, StringSplitOptions.RemoveEmptyEntries);

However, regular expressions seem better for multiple lines, since I can provide the whole text file to the regex matcher like this:

            var regex = new Regex(@"{.*?}");
            var matches = regex.Matches(input);

I think this returns all content between all the curly brackets, but I haven’t verified that yet. Still learning how to use them and need to match the keyword along with it. Just don’t know how to do that yet.

That’s true, given the correct regex, you can keep your C# code pretty short. The trouble with regular expressions is that they hide a lot of complexity. Building or understanding a sophisticated regex takes a lot of time and effort, even for an experienced programmer. Even though it would be more lines of C# to iterate over the entire text, it might be easier to read and interpret.

I’m not a regex master myself, but I believe something like this will get you started:

new Regex(@“VERTEX\s+(-?\d+.\d+)\s+(-?\d+.\d+)\s+(-?\d+.\d+);”);
new Regex(@“TEXTURE\s+(\w+)\s*{\s*(\w+)\s+(\w+);\s*}”);

Personally I wouldn’t use regex, even for multiple lines.

Usually some language uses a line terminator… like a ‘;’ or a linefeed (just end of line). The semi-colon being nice because you can have multi-line (human readable) lines, by ignoring linefeeds.

I’ve written a few parsers over the years, and there’s several trade-offs to make in design. A huge one could be memory… take for example this parser I wrote here:

This only evaluates arithmetic (with some functions tossed in and the ability access properties of a single contextual object).

The code of it may seem very adhoc in part. But I wrote it as such to avoid as much memory allocation and GC as possible. I use a TextReader because with that I can even evaluate text files without even having to load the entire thing (streaming string data is more efficient than loading it entirely). Though usually it still comes as strings from the inspector, so I use a StringReader for that (a custom one I wrote that can be recycled).

A while back for a company I worked for I wrote another one.

That one required the ability to define functions and object types. Was a more robust language. So I tokenized everything. There were actual object types that represented these things, like ‘functions’. So as I parsed I chained together objects that represented the code… this of course used more memory, but wasn’t so GC sensitive so it wasn’t a big deal.

But in both the gritty parsing was basically the same. I read each character one by one and branched from there through a state machine.

When a line starts I expect only a handful of things.

Like in my Evaluator, I skip over whitespace (cause it’s ignored), and get to my first character I find. And what do I do?

if(!IsValidWordPrefix(_current)) throw new System.InvalidOperationException("Failed to parse the command.");

//....

        private static bool IsValidWordPrefix(char c)
        {
            return char.IsLetterOrDigit(c) || c == '

I test if it’s a valid word… and a valid word MUST start with 1 of those few options. We can’t start with a multiplication, that doesn’t make sense. We know at what point in the flow of the program we are right now, so we branch accordingly.

Lets take C# as a more complicated example…

when we start parsing a class file we really only have a small handful of options to start with…
// - comments

- inline commands

using - for using statement
namespace - for namespacing
public/private/protected - for modifiers
class/enum/interface/struct/etc - for type declaration

That’s all we look for contextually, because that’s all that’s allowed. And they all are pretty unique… u,n,p,c,e,i,s.

Once we get into a class, our context has changed (our state has changed), and we know what we’re looking for now…

public/private/protected - for modifiers
… - type declarations for fields
method/property declaration
etc

Once in the body of a method we again are only looking for specifically contextual things. Like a line must start with a handful of things…

variable declaration
variable/field assignment
loop/eval statement
method call
goto
etc

So really we’re only hunting for what we need when we need it.

Anyways, hopefully that could give you a start. || c == ‘_’ || c == ‘+’ || c == ‘-’ || c == ‘(’;
}


I test if it's a valid word... and a valid word MUST start with 1 of those few options. We can't start with a multiplication, that doesn't make sense. We know at what point in the flow of the program we are right now, so we branch accordingly.

Lets take C# as a more complicated example...

when we start parsing a class file we really only have a small handful of options to start with...
// - comments
# - inline commands
using - for using statement
namespace - for namespacing
public/private/protected - for modifiers
class/enum/interface/struct/etc - for type declaration

That's all we look for contextually, because that's all that's allowed. And they all are pretty unique... u,n,p,c,e,i,s.

Once we get into a class, our context has changed (our state has changed), and we know what we're looking for now...

public/private/protected - for modifiers
... - type declarations for fields
method/property declaration
etc

Once in the body of a method we again are only looking for specifically contextual things. Like a line must start with a handful of things...

variable declaration
variable/field assignment
loop/eval statement
method call
goto
etc

So really we're only hunting for what we need when we need it.

...

Anyways, hopefully that could give you a start.
1 Like

So far regex seems a lot simpler (to use, not build) than standard string methods to me. My code started to feel very repetitive when using the latter, but maybe I’m just doing it wrong.

text = text.Replace("VERTEX", " ");
text = text.Replace(";", " ");
string[] split = text.Split();

foreach (var s in split)
{
     s.Trim();
}

With regex I can do something like this to get only the elements I actually want all at once:

Regex.Matches(text, @"([+-]?[A-Za-z0-9_]+(?:\.[0-9]*)?)")

However, I guess which method is better is a bit subjective?