How often does downloaded content get corrupted?

My app will download some data files (not asset bundles). Just wondering how important it is to do some kind of check on the files.

If the file download completes without error then what’s the chance of it being corrupted?

It probably depends on the mechanism you’re using. If it’s a standard WWW or web request then it’s TCP/IP and if it completes successfully then you’re probably ok. Of course this can also depend on how you’re serving up the data. It can also depend on whether you’re downloading it all in one go or chunking it and reassembling client side.

One thing you can do is calculate and MD5 hash and compare a hash provided by the server (or a previously known hash for the content if it’s not dynamic) and see if they match. Generating and verifying the checksum isn’t a terribly slow thing to do but it’s not something you’d want to do for time critical data.

Check this out. A video artist created a video and then uploaded and downloaded it 1000 times to see what happens.

This is more an artefact of switching video formats 2000 times, and has little to do with the actual data transfer. You could get the same effect just by switching formats on your own PC.

2 Likes

Yeah, I get that this has very little to do with the kind of uploading/downloading we do, but as I understand it it was more about the Internet’s own compression rather than a change in format. But still, it shows the extreme. I’d bet that if you upload/download the same exe 1000 times, the likelihood of it becoming corrupt beyond use is pretty high. However, that is not a very likely to happen.

It is, however, creepy as hell and quite entertaining. Would actually be a great way to generate cyber-demon or alien voices for a game!

3 Likes

As long as there’s not a fatal error during transfer, the likelihood is exactly zero because you download exactly the same bytes that you upload. In the example above if the artist actually uploaded and downloaded the same video there would have been no change. It would have been byte for byte identical. So, either he’s uploading to a service (like YouTube) that re-encodes the video or he’s downloading by streaming from a service that’s compressing it. If he were downloading the raw video itself, even if it were gzipped and uncompressed, there would be no difference.

3 Likes

I thought that all file transfer protocols compressed files to some degree.

Nope. In fact, the transfer protocols themselves are just specifications and implementations, they don’t compress anything. HTTP for example is a protocol, but it sits on top of TCP/IP. There is no inherent compression there. However, servers have implemented GZip and Deflate compression which they do on top of HTTP. They send a header that tells the client what compression was used and the client decompresses.

There’s also a difference between regular file compression and audio/video compression. The concepts are the same but the implementation is different. As an example, you can take a video file and zip and and unzip it 1000 times… the unzipped version will be identical to the original version every time. However, zipping the file will not yield you very good results, especially in compact formats like MP4. Those already implement compression, but it’s permanent and not reversible. They do things like varying the framerate so they can combine frames where there is no motion and averaging pixels, etc. Just like how image compression works.

5 Likes

Yes, this is what I’ve been looking at. Using the .NET MD5 functionality seems quick enough and I can just embed the hashes in my app.

One though that did cross my mind - if I did query the hash from a server then how do I know that the download from the server itself is valid?

At some point you just have to trust what comes from the server. Personally I wouldn’t probably even bother with the hash unless you’re trying to verify whether a save file was tampered with or something.

From http://noahdavids.org/self_published/CRC_and_checksum.html :

“In “Performance of Checksums and CRCs over Real Data” [1] Stone and Partridge estimated that between 1 in 16 million and 1 in 10 billion TCP segments will have corrupt data and a correct TCP checksum. This estimate is based on their analysis of TCP segments with invalid checksums taken from several very different types of networks. The wide range of the estimate reflects the wide range of traffic patterns and hardware in those networks. One in 10 billion sounds like a lot until you realize that 10 billion maximum length Ethernet frames (1526 bytes including Ethernet Preamble) can be sent in a little over 33.91 hours on a gigabit network (10 * 10^9 * 1526 * 8 / 10^9 / 60 / 60 = 33.91 hours), or about 26 days over a T3.”

I also find this discussion about it http://stackoverflow.com/questions/3830206/can-a-tcp-checksum-produce-a-false-positive-if-yes-how-is-this-dealt-with

There is also the possibility that the data will be corrupted in RAM or on the HD.

What conclusion you draw from this would depend on the amount of data you download, and how critical it is that it’s correct.

For your average game, I’m with @Dustin-Horne , I’d just skip the hash. An option is a repair utility (separate program or menu option) that will check hashes and re-download corrupt data if the player experiences any problems.

1 Like

This is saying that the data is corrupt but the hash is still correct which can happen. It is possible (and happens) that two different pieces of data result in the same hash. The odds are pretty low that you’re going to get corrupt data and even if you are hashing it’s possible that your checksum could falsely claim you’re not corrupted (even though those odds are even lower as seen above.

That all being said, if you want you can include the checksum with your response from the server. You can stick it in the ETag header value. So you send the checksum back with the response headers along with the chunk of data.