XML and BOM encoding.

Hi, I’m downloading an XML file from a server, saving it locally and then parsing it. When I try to load it, I receive this error:

Text node cannot appear in this state. Line 1, position 1.

Apparently this is to do with BOM encoding as discussed here:

http://answers.unity3d.com/questions/10904/xmlexception-text-node-canot-appear-in-this-state.html

http://answers.unity3d.com/questions/349977/upgrading-to-40-breaks-my-xml-reader.html

Here is the code I’m using:

	IEnumerator DownloadXML ()
	{

		// If there is an existing file, delete it.
		if (System.IO.File.Exists (localFile)) {
			Debug.Log ("Exists: " + localFile);
			System.IO.File.Delete (localFile);
               
			print ("Deleted: " + localFile);
		}
     
		WWW wwwfile = new WWW (remoteFile);
		yield return wwwfile;
		
		// After it's downloaded.
		print ("File Size : " + wwwfile.bytes.Length);
         
		print("Cache Location:"+Application.temporaryCachePath);
		
		// Write to local file.
		
		
		System.IO.File.WriteAllBytes (localFile, wwwfile.bytes);
   
		Debug.Log ("Cache saved: " + localFile);
		Debug.Log ("File downloaded");
     
		if (System.IO.File.Exists (localFile)) {
			Debug.Log (" file does exist");
		} else {
			Debug.Log (" file does not exist");
		}
		
	 	ReadXML ();
		
	}

So I write this file out locally and by loading it in notepad I can see it’s fine. But when I come to load it:

	XmlDocument xmlDoc = new XmlDocument (); // xmlDoc is the new xml document.
		xmlDoc.LoadXml (localFile); // load the file.

I receive the ‘Text node cannot appear in this state. Line 1, position 1.’ error.

Apparently BOM encoding is stored within the first byte (and this causes the issue) so I then tried to strip out the first byte:

		// Create an array with one less element than the file
		byte[] fileWithoutBom = new byte[wwwfile.bytes.Length-1];
		
		for (int index=1; index<fileWithoutBom.Length; index++)
		{
			fileWithoutBom[index-1] = wwwfile.bytes[index];	
			
		}

But the error persists. Can anyone help? Thanks!

Thanks in advance for any help

You’re trying to call LoadXml with a file, but it expects the whole string.

So you should do either

xmlDoc.LoadXml(System.IO.File.ReadAllText(localFile));

or

xmlDoc.Load(localFile)

because Load expects a filename. See here and here.

First of all the Byte Order Mark (BOM) is not an encoding, it’s a special character which tells the receiving side in which order bytes form integer values (Endianness).

Next thing is in UTF8 the BOM character is made up by 3 bytes, not by one.

To actually fix your problem you should save your file without BOM character. Just open the xml file in Notepad++ and save it as “UTF8 without BOM”.

As alternative, use the text property of your “wwwfile” which will be interpreted as unicode string and the BOM should be represented with a single character. That’s not the case as a byte array since UTF8 / UTF16 / … uses multiple bytes to encode some characters.