Parsing localization from csv file

Hello, help with parsing from a CSV file
I added several lines that need to be parsed, but I still have symbols that are not needed or, on the contrary, the symbols that are needed disappear, I don’t know very well how to use regex

Moon_Description,"""Moon description""","""Луна описание"""
Coin_Name,Coin,Монета
Coin_Description,"Coin description A = {0}, B = {2} C = {1}","Монета описание A = {0}, B = {2} C = {1}"

if csv adds " when downloading a file, then when parsing " which are at the end of the line remain, while other " disappear

“”“Moon description”“”

turns into

Moon description

and should turn into

“Moon description”

“”“Луна описание”“”

turns into

Луна описание"

and should turn into

“Луна описание”

also a line

“Монета описание A = {0}, B = {2} C = {1}”

turns into

Монета описание A = {0}, B = {2} C = {1}

and should turn into

Монета описание A = {0}, B = {2} C = {1}"

here is the code that parses

var languageValues = Enum.GetValues(typeof(ELanguage));
var csvParser = new Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
var rows = csvRawData.Split(GetPlatformSpecificLineEnd());
var data = new Dictionary<ELanguage, Dictionary<string, string>>();

foreach (var lang in languageValues.Cast<ELanguage>())            
    data.Add(lang, new Dictionary<string, string>());

for (var rowId = 1; rowId < rows.Length; rowId++)
{
    var cells = csvParser.Split(rows[rowId]);
    var id = cells[0];
    for(var cellId = 1; cellId < math.min(cells.Length, languageValues.Length + 1); cellId++)
    {
        var cell = cells[cellId];
        cell = cell.Trim('"');
        cell = cell.Replace("\"\"", "\"");
        data[(ELanguage)cellId].Add(id, cell);
    }
}

return data;

Using regular expressions to part CSV, especially when quotes are optional probably won’t work out in the end. Here’s a whole SO question with countless of different answers and suggestions any pretty much all of them have edge cases where their solution fails.

It would be much easier and much more reliable to just parse the input string character by character and keeping track of the quoted state. So you simply have a token StringBuilder where you add the next character unless it’s a comma or a double quote. When you hit a double quote and you’re currently not inside a quoted state (boolean) than you toggle the quoted state on. If you hit a double quote while inside the quoted state, you just look at the next character. If it’s also a double quote, you add a single double quote to the token and skip those two double quotes in the input string and continue. Of course while the quoted state is on, you would ignore commas and newline characters and simply add them to the token. When quoted state is false and you hit a comma, you take the token, make it a value, clear out the string builder and continue after the comma. Likewise when hitting a newline, you start a new row. That is quite simple logic, will be extremely fast, does work as it should and can be adjusted easily for any additional edge cases.

A huge issue with CSV is that it’s not really a standard. There are many different variations out there. Some use double quote duplication, some use backkslash escaping, others don’t support quoted text at all. So there’s no general “right” way to do it. An additional issue is that the CSV format may even vary depending on the culture settings. Since here in germany we use a comma as a decimal point, an excel CSV export will use a semicolon as field seperator and not a comma. So providing general CSV import functionality is almost impossible. That’s why a lot of software that can import CSV allows you to choose the settings and delimiters.

Some of your examples and what it “should” look like confuse me a bit. Why should there be a single double quote at the end of your third example when the whole text is just quoted once?

So when you approach this, you should define for yourself what cases you actually want. Note that the way double quotes should be handled in your case seems to be similar to how C#'s @strings work. So when you have two double quotes right next to each other inside the string, it would result in a literal double quote in the actual string.

string s1 = @"He said: ""This is an example"".";
string s2 = "He said: \"This is an example\".";

Here s1 and s2 result in the exact same string. The @strings have the advantage that no other characters need to be escaped and you have have actual newline characters inside the string.

string s1 = @"He said: ""This is an example
and it spans over
multiple lines"".
The End";
string s2 = "He said: \"This is an example\nand it spans over\nmultiple lines\".\nThe End";

Here again s1 and s2 are the same thing. Though be careful in this case, as depending on the files line ending encodings the newlines may be “\n” or “\r\n” or even just “\r” in rare cases.