Zip vs Tar.gz Files Explained and Compared (Archiving and the DEFLATE algorithm)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi if you're into computers you've probably heard of a couple file formats maybe zip if you're on windows and tar gz if you're on linux and if you've done a bit of digging you might have noticed that both of these happen to use the same compression algorithm that compression algorithm is called deflate and it's probably the most common lossless compression algorithm used in the world so i guess let's start with the obvious question what is a zip file in the tar file if you're on windows you've probably used a zip file for one of two things first of all you can take large folders and you can compress them into a single item this is obviously easier to stand over the internet and at the other end it can be extracted and you can get the full folder structure out of it there's also a second reason you would zip an item and that is for compression if you have a lot of small text files they take up more space than they need to and you can losslessly compress all those text files and that's what a zip file does attar file in linux is something similar when you tar a directory it will take all the items in the directory and essentially append them to the end of each other this has the benefit of making it very easy to back large directories up a tar file on linux will take a bunch of directories or and files in them and it will basically just append them to the end of each other so you end up with one continuous archive tar stands for tape archive and that's presumably why it was done this way if you have a large linear magnetic tape you want to take all your files append them to the end and then you can write it in a linear fashion this does what the first part of zip does it takes large amounts of files or directories or entire systems and it will basically turn it into a linear format this makes it again easy to backup or send now tar files by on their own are actually not compressed this actually lends to one of the benefits of the unix philosophy where we can take the same tar file and we can compress it in a bunch of different ways the most common and one of the oldest compression methods in use is gzip gzip uses the deflate algorithm just like zip files do it's one of the most widely used lossless compression algorithms it's fast and it has a decent compression ratio especially when you're giving it text files obviously it has been superseded in algorithms that are both faster or better compressing these days that you would and you would probably pick one of those instead nonetheless gzip remains installed on almost every unix based computer or linux computer that this makes it very useful because it's very widely compatible and you can be assured that if you send your fellow linux friend a gzip file they'll probably be able to open it in fact modern browsers also support gzipping entire web pages if you have a large web page you can just losslessly compress it and send it this makes sense when your cpu is faster than your internet link because this way you can use less internet bandwidth to transfer a large amount of data it just takes a cpu trade-off to both compress and decompress that on each end so at this point then it's probably sounding like both of these are similar oh and there's a couple other things to tar the reason you would use a tar file in linux instead of a zip file mainly comes down to the fact that tar files preserve all the file attributes properly where zip files don't this is mainly because unix-based systems have more file attributes than windows file systems do so on a unix system you would want to save creation date last access date last modified date you also have your user permissions group permissions and permissions for everyone else so all of these need to be saved as part of the archive so that when you when you expand it you get all of it back this might not matter so much if you're just sending a file to a friend but this matters a lot when you're backing into our systems if i back up a server i want to be able to restore from that backup while maintaining all the file permissions it wouldn't make much sense if i restored from a backup and everything in my system was accessible to everyone that would kind of ruin the whole point of having it in the first place so tar files being developed for that purpose uh store more metadata than windows so then both of these are similar formats but so how do they differ beyond just one storing more file system attributes well one difference is that when you're doing a tar.gz you're first compressing everything into a tar file and then you're compressing the result of it this means that step one is to append all the files and directories to each other so that it becomes a linear format and then you take this linear format and often you just stream it or you pipe it into a compression algorithm such as gzip by the way the fact that this is done in two steps first first turned into archive and then compressing it is what also makes it very easy to substitute compression algorithms take out gzip and you can add bzip2 or on a modern computer you can add xz or lz or z standard or lz4 point being there's a lot of different compression algorithms you can use and it's essentially interchangeable the first step is separate from the second but where zip and tar.gz files truly differ is the order in which those two steps are done a zip file is much more like a gz.tar file than it is at tar.gz what does this mean well on a zip file you first compress each file individually and then you append the compressed versions whereas on atar.gz file you first appended everything and then you compress the result there are two different approaches and they both have their benefits and drawbacks let's talk about why you might want to compress everything first individually and then append them well first of all it means that if i want to extract a particular file i don't need to decompress the entire archive if i have files numbered 1 through 100 and i want to access the 50th file you just take where the 50th file is you decompress it and you can view it this makes it a lot faster when you want to decompress individual files so that you can view them this is useful because if i sent you 100 pictures and you wanted to view the 50th picture just you just take the 50 picture and you can decompress it you don't need to decompress everything leading up to that point and part of this is because zip files store the actual system information and metadata at the start which means that you know you know precisely where each file is going to start and how long each file is so to read the 50th file i can just look that up in the table hey it starts here and it's this long and i just read for that length and then you can you decompress that and you view it now another advantage at least on modern computers is if you're compressing each file individually that's very highly parallelizable if i have a thousand files and i have eight cpu cores we can compress eight files at a time without really impacting the compression uh ratio whereas if i had a large thing to compress and now you can still you can still parallelize it but due to the nature of how you're dividing it into chunks you're going to you're going to lose some of the you're going to lose some of the compression ratio and you're going to get slightly worse compression is it a big deal on modern systems no but it is still something to consider now let's talk about some of the drawbacks of compressing each file individually the most obvious one is that you lose out on compression ratio if i have two text files that are identical and back to back if i'm compressing them individually i'm going to compress both the files and just end up appending them to each other when that redundancy could have been removed this is an advantage when it comes to compress to appending first and compressing the linear data later in that case if i since all the text files would have been appended first when gzip was running through it and it saw these redundancies the redundancies would have been removed i'm not implying that compression is a deduplicator it in no way takes the place of the duplication among the chain you should still duplicate your files to save on space but i'm saying that if two files are very similar within the same window size that redundancy will be removed now as i just said this only works if the two files are within the same window size if i'm only compressing and looking up stuff that was in the past megabyte and an identical file happened four megabytes ago obviously the compressor will have no idea of it in this case it would have already gone out of the compression pipeline and you wouldn't get the benefit in that case this is where modern systems with a lot of ram come in because you can have much larger windows these days because you know we can store and index that properly so in short the main advantage is that you would save on these redundancies second advantage is that it's a uniform system um you can just decompress it and you get the full archive back in its original form this makes it easy for example if you're going to back up the tape or something right you can compress the entire archive back it up and then you would just uncompress it or you would first decompress it and then you have your original backup file which might be useful now the obvious disadvantage of this approach is that if i want to get the 50th file again i need to decompress everything leading up to that point this makes tar files slower to work with usually different compression algorithms obviously bear differently with this something like b zip two is slower than g zip x is actually often faster or something that has very high compression ratio is probably going to be a lot slower so this means that if i have this one contiguous compressed archive and i want to get to this particular point i first have to decompress everything and once again depending on the algorithm that can be a painfully slow process so that's one thing that this approach is lacking now part of the reason this is also bad is because tar files i believe store the metadata near the end due to this nature of the fact that compression happens after everything is appended tar files cannot keep an index of where each file is going to begin at the start because depending on your compression that's not really known whereas on something like 7-zip 7zip uses continuous compression just like tar files do but they still save the index at the start which is why you can view the directory listing in a zip file there's actually ways of getting that ziplike effect in tar files unusual convention i guess that some people use gz.tar files that is when you gzip each file individually and then you make a tar file out of the result of this now this behaves like zip and it has advantages and disadvantages that you would think unfortunately most archivers cannot deal with these files so if you extract this and you want to get the 50th item you'll probably have to enter the command to decompress it manually i don't know of very many archive managers that can just decompress the file like that they'll open the tar file no problem but then you'll just have a bunch of files with gz like jesus appended to the end and so to leave any of them you would have to you know decompress the file individually anyway it is something to consider it is faster and it can be parallelized but it has not really any of the benefits that a zip file does while having the drawbacks of compressing each file individually so i don't really recommend it instead another approach is to use indexed tar files this happens in the form of several compression algorithms for example pig z which stands for parallel indexed gzip there's also pixy which is parallel indexed x-zip now these have couple advantages a often they're multi-threaded implementations so normally your regular tar.gz could only be compressed or decompressed on a single cpu core which is the main drawback of compressing is a contiguous archive so of course these days as i just mentioned there are multi-threaded versions so for example pig z will split your data up compress it on multiple cpu cores you lose a bit of ratio but it's still way better than compressing each file individually and the second benefit is it'll actually index where each of the files are so you don't need to open and read the entire tar file to get that one file similarly pxe will is a parallel implementation of xz now xc on its own already happens to be pretty good at being multi-threaded i believe all those changes have since been merged but it also indexes the tar files and this means that and these files are completely compatible with regular tar usually so if you just have a regular gzip installed you will still be able to decompress it but you'll have to decompress the full archive whereas if you have the index version installed it'll be able to look at that index and basically just fast forward to where it needs to go and read from there so that you can get the file you want this is actually a similar approach to what 7-zip does the way 7-zip works is it'll actually use solid blocks just like tar does where they'll store files in large chunks and then compress the whole chunk once again this has the benefit that it essentially will get rid of redundancies between multiple files if i have source code files that are all similar but slightly different you can get rid of that redundancy or at least a good compression algorithm would but while it's doing the 7zip will also store the index of all the files up front this is why you can open up a 7zip file and it'll get the directory and file listing instantly you don't have to wait at all for it to figure out where each file is because that's already stored whereas on a tar file if i want to get the listing of all the files and directories in here you need to wait for it to read the whole archive as i just said exception to this is if you're using a indexed tar such as pixie or pixi that's kind of dark overview with 7-zip you could disable that feature but then your compression ratio would be just like zips and 7-zip for those of you curious just uses lzmi 2 by default and a surprise lcma lzma2 is the same compression algorithm that that xz happens to use so if you have a tar.xz file it's essentially the same algorithm that 7z files are using and this is why you will get very similar compression ratios out of the two because they're both turning files into solid box and compressing that i guess the difference is that 7-zip will open slightly faster and usually it can get a slightly better ratio because it doesn't have to support all the other attributes that you know linux and other unix-based systems do so what should you use if you're on linux you can probably keep using tar.xe or in this case i would recommend looking at tar.zstd first because it happens to be a lot faster than xz while maintaining pretty much the same compression ratio the compression is a bit faster but the main advantage is that decompression is almost five to six times faster than regular xz files so i strongly recommend you take a look at that if you're going to make backups on the linux system you don't want to use 7zip or anything that will get rid of all your file attributes which will be a separate headache on its own so you should stick the tar if you're still if you're intent on using xc though do take a look at pixi which is parallel index.c it might help make it easier when you need to decompress it so that you can view one file and you'll be able to get those listings pretty much instantaneously if you have it installed for those of you that want an easier approach i guess you can use 7-zip because pixi and pixie don't have very good guise if they do at all on nix systems so you can still use 7z it's you know it's cross-compatible with windows and mac and linux so that happens to make it that happens to be a great favor and support of it and it also happens to get a lot better compression ratio than regular zip files do so that's another great benefit of it um i think the benefits of just using zip or like gz.tar files is gone because you know modern compression algorithms like z standard and xz tend to scale well to multiple processors on their own anyway so you don't need to compress each file individually and then you know append the result of that the parallelization advantage is kind of gone the decompressing files individually advantage is shrinking especially now that we we have index star files and you know and 7zip which is also equally indexed thank you so much for watching this rather long video on compression systems and the advantage and drawbacks of compressing each file individually appending it versus making a linear file format first and then compressing the result of that as a solid archive hopefully you learned something hopefully you found this interesting um hopefully it sheds some light into why things are the way they are why you might have experienced some slow downs on tar.gz for example versus why zip might open instantaneously um and also i guess you just maybe learned some trivia facts about what deflate is or what lcm a2 is for more reading on those obviously check out the wikipedia articles um i will have them linked down below so that you guys can go learn more uh thank you so much for watching if you enjoyed this video and you learned something consider leaving a like i guess you can subscribe to the channel as well if you want to see more future videos like this uh so thanks for watching hope you hope you enjoyed it see you next time [Music] follow me again [Music] everything we wanted [Music]
Info
Channel: Tony Tascioglu
Views: 9,355
Rating: undefined out of 5
Keywords: Technology, Linux, Compression, Archive, Data, Tar, Gzip, Deflate, gz, Zip, 7zip, dar, tape, file, file compression, linux files, linux archive, linux, tar, xz, bz, lz77, zip vs tar, tar vs zip, what is tar, what is zip, 7z, files, filesystem, send, zip, unix, posix, gnu
Id: nO27DqT9RCQ
Channel Id: undefined
Length: 16min 32sec (992 seconds)
Published: Mon May 16 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.