What does gzip format mean




















Of course, the decoder has the easy part The encoder is responsible for assigning the high-frequency symbols to short codes, and the low-frequency symbols to longer codes to achieve optimal compression. For a decoder to build a Huffman tree, then, all you need as input is a list of ranges and bit lengths.

Define a new range structure as shown in listing Once these ranges are defined, building the Huffman tree, following the rules described above, is done as described in RFC There's actually one more bit of "header" data that precedes even the lengths of the Huffman codes - a set of fixed-length declarations that describe how many Huffman codes follow, to allow interpretation of the two tables of Huffman code lengths that follow.

This isn't strange by itself - this is just little-endian ordering Intel-based processors do this by default. However, the contents themselves are interpreted in big-endian format! So, a five-bit code followed by a three-bit code would be packed into a single 8-bit byte as in figure If codes cross byte boundaries which is the rule rather than the exception for variable-length codes , the bytes themselves should be read sequentially of course , but interpreting the bits within them is still done right-to-left, but then reversed for interpretation.

For instance, a 6-bit code followed by a 5-bit code followed by a 3-bit code would be streamed as the two bytes show in figure All of this is fairly confusing, but fortunately you can hide the complexity in a few convenience routines. Finally, since the bits should be read in little-endian form but interpreted in big-endian form, define a convenience method to read the bits but then turn around and invert them for interpretation as shown in listing Now you can start actually reading in compressed data.

The first section of a GZIP-compressed block is three integers indicating the number of length codes, the number of literal codes, and the number of distance codes. Attachment 1 is a gzipped input file it's the gzipped representation of the source code for this article. These codes are the ones that describe both the literals, distance and length codes that actually comprise the gzipped payload.

Where does the magic number come from? Well, that's literal codes the 8-bit input range , one "stop" code, and at least one length code if there isn't at least one length code, then there's no point in applying LZ77 in the first place. However, even Huffman-encoded, literal codes is quite a few, especially considering that many input documents will not include every possible 8-bit combination of input bytes.

For space efficiency, then, it's necessary to make it easy to exclude long ranges of the input space and not use up a Huffman code on literals that will never appear. So, the Huffman code lengths that follow the code length bits can fall into two separate categories. The first is a literal length - declaring, for instance, that the Huffman code for the literal byte '32' is 9 bits long.

The second category is a run-length declaration; this tells the decoder that either the previous code, or the value "0" indicating that the given literal never appears in the gzipped document , occurs multiple times. This method of using a single value to indicate multiple consecutive values in the data is a third compression technique, referred to as "run-length-encoding" [4]. Therefore, the hclen code length bits that follow the pre-header are either length codes, or repetition codes.

The number n follows the repeat codes and is encoded without compression in 2, 3 or 7 bits, respectively. To make this all just a little bit more confusing - these length codes are given out of order. The first code given isn't the code for the length '0'; it's actually the code for "repeat the previous character n times" This is followed by 17 and 18, then 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1 and finally Why the schizophrenic order?

Because codes of lengths 15, 1, 14 and 2 are likely to be very rare in real-world data - the codes themselves are given in order of expected frequency. This way, if the literal and distance code tables that follow don't have any codes of length 1, there's no need to even declare them.

In fact, if you notice, attachment 1 only declares 11 code length codes - this means that codes of length 12, 3, 13, 2, 14, 1 and 15 just don't occur in the following table, so there's no need to waste output space declaring them.

After all, a bit code would have to almost never occur in order to be space efficient; a 1-bit code would have to occur in almost every position for the encoding to be worth using up a whole prefix for a single value. This whole process probably seems a little abstract at this point - hopefully an example will help clear things up.

Attachment 1 starts with the fixed bit pre-header:. Remember that the bits are read left-to-right, but then swapped before interpretation, so they appear in little-endian form. These appear as:. To do that, a range must be built - in this case, it is:.

Where a bit-length of 0 indicates that the length doesn't occur in the table that follows. Following the rules of Huffman codes described above, this works out to the Huffman tree:. What this means is that, in the data that follows, two consecutive zero bits indicates a value of 7. You might want to work through this to make sure you understand how the range table converts to the Huffman table before continuing. Listing 15 illustrates how to read this first part of the deflated input and build a Huffman tree from it.

I won't reproduce the whole thing here, but I'll illustrate the first few values, so you can see how it's interpreted. The beginning of the table is:. Interpreting this via the Huffman table decoded in table 1, you see this works out to: 00 17 repeat 10 7 18 repeat 21 5 9 8 Aside: why 11?

That's the longest number that can be encoded using the repeat count declarator 17, which is followed by a 3-bit count to which 3 is added. Next is 5, 9, 8 and These bit lengths will be used to generate another Huffman table. The first codes in this array are the literal bytes, which appeared in the input that was supplied to the LZ77 compression algorithm in the first place.

This is followed by the special "stop" code and up to 30 length declarators, indicating backpointers. Finally, there will be up to 32 distance code lengths. This is a bit dense, but implements the logic for using the code lengths Huffman tree to build two new Huffman trees which can then be used to decode the final LZcompressed data. Check to see if it's a literal e.

If it's the stop code e. If it's a length code e. Interpreting backpointers is the most complex part of inflating gzipped input.

Similar to the "sliding scales" that were used by the code lengths Huffman tree, where a 17 was followed by three bits indicating the actual length and 18 was followed by 7 bits, different length codes are followed by variable numbers of extra bits. If the length code is between and , subtract from it - this is the length of the backpointed range, without any extra length bits. These bits are then added to another code whose value depends on the length code itself to get the actual length of the range.

In this way, very large backpointers can be represented, but the common cases of short lengths can be coded efficiently. A length code is always followed by a distance code, indicating how far back in the input buffer the matched range was found.

The lengths codes can range from , but the distance codes can range from - which means that, while decompressing, it's necessary to keep track of at least the previous 32, input characters. Listing 17 should be fairly straightforward to understand at this point - read Huffman codes, one after another, and interpret them as literals or backpointers as described previously. Note that this routine, as implemented, can't decode more than bytes of output, and that it assumes that the output is ASCII-formatted text.

A more general-purpose gunzip routine would handle larger volumes of data and would return the result back to the caller for interpretation. If you've made it this far, you're through the hard parts. Everything else involved in unzipping a gzipped file is boilerplate. Popularity rank by frequency of use gzip 1 Select another language:. Please enter your email address: Subscribe. Discuss these gzip definitions with the community: 0 Comments.

Notify me of new comments via email. Cancel Report. Create a new account. Log In. Powered by CITE. Are we missing a good definition for gzip?

Don't keep it to yourself Submit Definition. In this case, when you try to open a. From then on, opening a. Click here to fix. Corel WinZip 16 Pro. This software allows users to choose the level of compression and the compression method that they want to integrate into their files and folders.

All Major compressed formats can be extracted by this application, and this compression and decompression software runs in Microsoft Windows XP, Vista And window 7. Internet connectivity is needed for activating this program. Corel WinZip 16 Pro can provide users with access to Zipsend, which is use to compress and send large files through email.

This software may also provide users with access to Zipshare, which is use to upload compressed files to various social Websites. Backups are necessary for the important data of the user, and Corel WinZip 16 Pro provides an automated process for backing up files. Visit Developer Website. Other features include windows shell, file manager, command line version, FAR manager plug-in and localizations for languages up to This will not change the file type.

Only special conversion software can change a file from one file type to another. Windows often associates a default program to each file extension, so that when you double-click the file, the program launches automatically.

When that program is no longer on your PC, you can sometimes get an error when you try to open the associated file. Related Compressed Files. Are You Sure? Compliance Unless otherwise indicated below, a compliant decompressor must be able to accept and decompress any file that conforms to all the specifications presented here; a compliant compressor must produce files that conform to all the specifications presented here.

The material in the appendices is not part of the specification per se and is not relevant to compliance. Definitions of terms and conventions used byte: 8 bits stored or transmitted as a unit same as an octet.

For this specification, a byte is exactly 8 bits, even on machines which store a character on a number of bits different from 8. See below for the numbering of bits within a byte. Changes from previous versions There have been no technical changes to the gzip format since version 4. In version 4. Version 4. Detailed specification 2. Bytes stored within a computer do not have a "bit order", since they are always treated as a unit. However, a byte considered as an integer between 0 and does have a most- and least- significant bit, and since we write numbers with the most- significant digit on the left, we also write bytes with the most- significant bit on the left.

In the diagrams below, we number the bits of a byte so that bit 0 is the least-significant bit, i. Within a computer, a number may occupy multiple bytes.

All multi-byte numbers in the format described here are stored with the least-significant byte first at the lower memory address. File format A gzip file consists of a series of "members" compressed data sets. The format of each member is specified in the following section.

The members simply appear one after another in the file, with no additional information before, between, or after them. XLEN bytes of "extra field" CM Compression Method This identifies the compression method used in the file.



0コメント

  • 1000 / 1000