Would it be possible to look into changing the unity package format from TGZ to something a bit more sane, such as ZIP?
TGZ is an archaic format made specifically for saving a compressed archive to a tape drive. (That’s the T in TGZ.) While TGZ defenders claim that it has a better compression ratio than more modern archive formats due to treating the entire archive as a single block for compression purposes, whereas ZIP compresses every file separately, this is actually of minimal benefit to Unity for two reasons:
The sliding window used in the compressor is only a few kilobytes in size, so you only get this benefit at the edges of files. When dealing with large quantities of small text files, which is what TGZ was invented for back in the 70s on a UNIX system where everything is text, that can be a significant benefit. When dealing with multi-megabyte binary assets… not so much.
Compression works mostly by finding and “squeezing out” repetition in the data it’s given. This means that large amounts of homogeneous data compress extremely well. But a Unity package isn’t homogeneous; it contains several different types of files, both text and binary, all mixed together.
So the technical advantages of TGZ are negligible, but the disadvantage it poses is massive: linearity. Anyone who’s ever downloaded a large asset from the Asset Store knows that it takes forever to open the package and pull up the list of files inside. Why? Because it has to decompress and scan the entire archive to produce that list; it has to read every file one-by-one to see what’s even in the archive. Meanwhile in a ZIP file, each file inside the archive is compressed separately and an index is written at the end of the archive that can be read immediately. Switching to ZIP would speed up the common case noticeably, and immensely speed up other cases such as only wanting to grab a few items out of a package rather than the entire thing, or indexing packages with something like Asset Inventory.
It’s just like how everyone uses array-backed lists instead of linked lists these days. Random access has definitively proven it has overwhelming advantages over linear-only access. Unity’s package system would be greatly improved if it were to upgrade to a random-access-capable format.
Regarding tar as the format… What is wrong with it? It’s being updated regularly unlike zip, so why change it?
Guess I misunderstood this thread…
Edit: I mean like new/better compression algorithms, and not bug fixes, etc…
Edit 2: Technically just changing the algorithm should fix you problems, no? As tar itself is just for archiving and not compression…
I don’t think they will touch anything until they move the store over to UPM packages. Although AFAIK those are TGZ too.
I only see one way for them to change the compression, if they move to nuget, I think, which won’t happen.
The problem with TGZ isn’t the GZ component; the problem is that tar is a linear-only archive format, and linear archives are obsolete garbage.
Tar was designed as a Tape ARchive, which means the entire thing has to be able to be processed conceptually as a tape, ie. in a single, forward-only pass. And that’s a terrible way to construct an archive of non-trivial size, for the reasons I mentioned in my first post. The random-accessibility of the ZIP format is the principal reason why it’s taken over the world and become the de facto standard for file archival over the past ~35 years. TAR is a format that no one should be using anymore in the 21st century unless they are literally doing tape backups. (Which do in fact still exist, but they’re quite rare, for plenty of good reasons.) It’s incredibly ill-suited to something like Unity packages, and replacing gzip with more modern compression won’t fix that.
Oh so do you want something like in .rar or .zip where one can pull out a single file without having to decompress everything? So exactly the opposite of what solid compression does…
I can understand where you’re coming from, but the issue here mainly lies in decompression speed no? If you can decompress files faster than zip (especially when wanting to decompress a whole child folder) then it’s better…
Tar has better compression ratios anyways (probably one of the reasons Unity is using it) and sure if you want to extract a single script file out of a whole package with multiple high quality models then it will be a hassle, but how often do you do that anyways?
I think our differences lie in what has more precedence for us…
Either slow single file decompression or fast multiple file decompression…
And I’m on the fast multiple file decompression side!
@MasonWheeler In my experience, the long load time of unitypackage index isn’t solely caused by TGZ compression. Unity extracts the assets’ icons from the archive (or maybe even extracting the whole archive in advance?) and that alone takes a lot of time. I’ve observed this while writing my custom unitypackage extractor. You can give it a try to see the time difference for yourself.
Yeah I think that is the OP’s issue with .tar…
Tar requires first dearchiving everything, before decompression…
But that makes it extremely fast…
As a reference:
Archiving (what tar does) is the act of taking folders and outputting a single file without compressing it.
Compression takes either a folder (zip) or a single file (the resulting tar file) and losslessly minimizes bytes stored on the disk.
Tar just provides a compression algorithm with a singular file to compress. Every compression algorithm has to somehow archive a folder, but may decide to do it differently!
I addressed this in my initial post. This is theoretically true, but in practice, due to the details of how a compressor works, you’ll only notice the difference if you have a large quantity of small, homogeneous files, which is not what Unity packages typically look like.
Tell me you’ve never used Asset Inventory without saying you’ve never used Asset Inventory.
This is entirely backwards. A TGZ is a group of files that have been tar’d first, and the resulting archive is then compressed with gzip. So it requires decompressing everything first before dearchiving anything, and that makes it extremely slow.
Yes, O(n) to do something like extracting a certain file can be a pain in the rear. I get it. But don’t describe the tape archive with factual inaccuracies just to make a point.
It’s funny to make it sound bad that the “tape archive” format is designed for magnetic tape in the 70s. Like, those machines had only a few kilobytes of RAM and were happy to break 1MHz, and they still got their work done. And the entire design philosophy of the Unix operating system centered around “everything is a file, including devices and processes” and “every process uses one-directional streams to communicate as a pipeline.” This makes network sockets trivial, because they work just like any other stream. Even over long distance wired or undersea or radio or satellite links, because they work just like any other stream. There’s a reason industrial servers today like Unix [Linux], there’s a reason Steve Jobs liked Unix [NeXT, MacOSX], there’s a reason both iOS and Android have their underpinnings based on Unix [FreeBSD, Linux] and they use compressed tar streams for everything because it’s tried and true and optimized like hell for a ton of uses.
This is actually not true. In a tarball, everything necessary to decrypt during extraction is provided ahead of the files. You decompress as you go, and when you find the file entries you want, you remember that part of the stream. If you only need something that is 30% into the tarball, you only read 30% of the tarball to get it. Written correctly, you never need to back up and start over to do any read operation twice in a tarball. It behaves like a tape drive, it recognizes that the physical limitations of tape drives means it’s difficult or impossible to go backwards without manual intervention.
Now, could Unity packages do something smart and put all of the icons at the beginning of the stream? Sure. Could they provide a metadata or manifest file up front, like Android APKs do, to make it trivial to fetch the summary first? Sure. For your use case, suddenly it’s a lot faster. But it’s not the tarball format that is the problem, it’s that Unity didn’t think of your use case when they decided to use an industry standard.
Yes. But we’re not laboring under those limitations anymore, so we’re able to afford the luxury of random access.
Yeah, we can do that too. And the really good streams provide for random access. There are some streams that are linear-only, and a lot of code will break if you use them, so you have to read them out into a MemoryStream first, which adds a big linear scan of the entire stream’s length worth of overhead before you can do anything. (Hmm, sounds familiar…)
Theoretically, yes. But we’re not talking about such abstract use cases; we’re talking about Unity packages. What is the first thing Unity does when you open a Unity package? It gets a list of all of the files. On a modern archive format, this is instantaneous. On a TGZ it takes a full scan. Even if you only want one file, it still takes a full scan to produce the list of files, because 1) that’s the way Unity works and 2) you probably don’t know precisely what that one file you want is until you’re able to see the list anyway.
You really ought to look up the implementation details of the APK file format. Or the JAR format. Or ODF, or OOXML, or… well, you get the picture. Virtually every modern archival system is built on top of ZIP. There is no “metadata up front file”; there’s the index in a well-defined location in the ZIP archive, which makes fetching any file trivial. This is what I meant when I said that ZIP has taken over the world. When Sun developed Java on Solaris Unix systems, it used ZIP, not TGZ, for its files. When Google built Android on top of Linux, it used ZIP, not TGZ, for its packages. It’s been the same story again and again for decades now.
The second technology advanced far enough to make random access feasible, TAR became obsolete. And that was over 30 years ago. Today, if you’re using it for anything at all other than literally archiving to tape, you are doing it wrong.
Well yeah, I agree with you MasonWheeler to go to ZIP format. I don’t understand it so much as you, but I understand what do you mean, and it is very logical. I have checked, that even Microsoft decided to use ZIP for NuGet and not Tar, and in my opinion, the very best thing Unity can do in this area is to switch to NuGet, it would be just perfect
I don’t understand why not. Are there any reasons why not to use NuGet?
I believe Microsoft would be willing to maybe improve NuGet to suite Unity needs
The big obvious one is authorization. NuGet is a system designed primarily to freely distribute open-source packages. I’m not 100% sure about this, but as far as I can tell, it doesn’t have an authorization system at all; if a package is on NuGet and you know the name of it, you can download it.
Unity, on the other hand, is running a store. Making people unable to download something they haven’t purchased is fundamentally important to the system. It might be possible for Unity to spin up their own NuGet repo and bolt on the same authorization that they use for the existing Asset Store, but I wouldn’t count on it being easy to get right.
Well that is because it was originally designed for MS DOS, which is why Microsoft supported it in the beginning and now is locked in to doing so… (Also for the sake of using the same argument… DOS is a very gosh darn buggy and ancient piece of an OS )
See .zip under this section: https://en.wikipedia.org/wiki/List_of_archive_formats#Archiving_and_compression
Counterpoint: as I mentioned earlier, even major *NIX projects are choosing ZIP for their packaging needs, and have been for a long time now (Java was developed over 30 years ago!) because it’s just an objectively better format.
@MasonWheeler , yeah, the packages on NuGet are easy to get, but don’t think that everything on NuGet is free and open source, there are still possibilities how those NuGet packages can have some license included, so you download them easily, but you have to pay to be able to use them. So yeah exactly, this is something Unity could do, they can create their own authorization on top of NuGet. Or as I said, Microsoft may be willing to cooperate, as far as I know, they are already cooperating with Unity and it is incredible great for them, to have such a big and popular game engine to run .NET. And soon even run on top of modern .NETs with latest C#, there are surely very happy for it They also owns Xbox, they can run it through cloud, they have invested a lots and lots of money into gaming and surely they really appreciate their close relations with Unity. So I can’t believe, using NuGet in Unity is something not possible.
And it would have many great advantages, just simple united packaging system to get all the code you need into Unity. I actually came across some use case, in which I needed to install standard NuGet package, because the required functionality was not provided by any Unity package… And it is possible to install NuGet packages into Unity, but you first need to install some other package for it to work, and even then, it is far from the user experience NuGet provides in Visual Studio…
Pleeeeeaaaase, think about it, it would surely be appreciated by many to have such a clean easy user experience with packages You are already modernizing Unity so much, so why not to make another nice modernization