aboutsummaryrefslogtreecommitdiff
path: root/doc/xz-file-format.txt
diff options
context:
space:
mode:
authorLasse Collin <lasse.collin@tukaani.org>2009-05-01 11:20:23 +0300
committerLasse Collin <lasse.collin@tukaani.org>2009-05-01 11:20:23 +0300
commit1496ff437c46f38303e0e94c511ca604b3a11f85 (patch)
tree9bff5cc1342c15f1313d871bdd5a8fe52b98cd43 /doc/xz-file-format.txt
parentFixed a crash in liblzma. (diff)
downloadxz-1496ff437c46f38303e0e94c511ca604b3a11f85.tar.xz
Renamed the file format specification to xz-file-format.txt
which is the filename used on the WWW.
Diffstat (limited to 'doc/xz-file-format.txt')
-rw-r--r--doc/xz-file-format.txt1127
1 files changed, 1127 insertions, 0 deletions
diff --git a/doc/xz-file-format.txt b/doc/xz-file-format.txt
new file mode 100644
index 00000000..fa2b3340
--- /dev/null
+++ b/doc/xz-file-format.txt
@@ -0,0 +1,1127 @@
+
+The .xz File Format
+===================
+
+Version 1.0.0 (2009-01-14)
+
+
+ 0. Preface
+ 0.1. Notices and Acknowledgements
+ 0.2. Getting the Latest Version
+ 0.3. Version History
+ 1. Conventions
+ 1.1. Byte and Its Representation
+ 1.2. Multibyte Integers
+ 2. Overall Structure of .xz File
+ 2.1. Stream
+ 2.1.1. Stream Header
+ 2.1.1.1. Header Magic Bytes
+ 2.1.1.2. Stream Flags
+ 2.1.1.3. CRC32
+ 2.1.2. Stream Footer
+ 2.1.2.1. CRC32
+ 2.1.2.2. Backward Size
+ 2.1.2.3. Stream Flags
+ 2.1.2.4. Footer Magic Bytes
+ 2.2. Stream Padding
+ 3. Block
+ 3.1. Block Header
+ 3.1.1. Block Header Size
+ 3.1.2. Block Flags
+ 3.1.3. Compressed Size
+ 3.1.4. Uncompressed Size
+ 3.1.5. List of Filter Flags
+ 3.1.6. Header Padding
+ 3.1.7. CRC32
+ 3.2. Compressed Data
+ 3.3. Block Padding
+ 3.4. Check
+ 4. Index
+ 4.1. Index Indicator
+ 4.2. Number of Records
+ 4.3. List of Records
+ 4.3.1. Unpadded Size
+ 4.3.2. Uncompressed Size
+ 4.4. Index Padding
+ 4.5. CRC32
+ 5. Filter Chains
+ 5.1. Alignment
+ 5.2. Security
+ 5.3. Filters
+ 5.3.1. LZMA2
+ 5.3.2. Branch/Call/Jump Filters for Executables
+ 5.3.3. Delta
+ 5.3.3.1. Format of the Encoded Output
+ 5.4. Custom Filter IDs
+ 5.4.1. Reserved Custom Filter ID Ranges
+ 6. Cyclic Redundancy Checks
+ 7. References
+
+
+0. Preface
+
+ This document describes the .xz file format (filename suffix
+ ".xz", MIME type "application/x-xz"). It is intended that this
+ this format replace the old .lzma format used by LZMA SDK and
+ LZMA Utils.
+
+
+0.1. Notices and Acknowledgements
+
+ This file format was designed by Lasse Collin
+ <lasse.collin@tukaani.org> and Igor Pavlov.
+
+ Special thanks for helping with this document goes to
+ Ville Koskinen. Thanks for helping with this document goes to
+ Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
+
+ This document has been put into the public domain.
+
+
+0.2. Getting the Latest Version
+
+ The latest official version of this document can be downloaded
+ from <http://tukaani.org/xz/xz-file-format.txt>.
+
+ Specific versions of this document have a filename
+ xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
+ For example, the version 1.0.0 of this document is available
+ at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.
+
+
+0.3. Version History
+
+ Version Date Description
+ 1.0.0 2008-01-14 The first official version
+
+
+1. Conventions
+
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
+ "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
+ document are to be interpreted as described in [RFC-2119].
+
+ Indicating a warning means displaying a message, returning
+ appropriate exit status, or doing something else to let the
+ user know that something worth warning occurred. The operation
+ SHOULD still finish if a warning is indicated.
+
+ Indicating an error means displaying a message, returning
+ appropriate exit status, or doing something else to let the
+ user know that something prevented successfully finishing the
+ operation. The operation MUST be aborted once an error has
+ been indicated.
+
+
+1.1. Byte and Its Representation
+
+ In this document, byte is always 8 bits.
+
+ A "null byte" has all bits unset. That is, the value of a null
+ byte is 0x00.
+
+ To represent byte blocks, this document uses notation that
+ is similar to the notation used in [RFC-1952]:
+
+ +-------+
+ | Foo | One byte.
+ +-------+
+
+ +---+---+
+ | Foo | Two bytes; that is, some of the vertical bars
+ +---+---+ can be missing.
+
+ +=======+
+ | Foo | Zero or more bytes.
+ +=======+
+
+ In this document, a boxed byte or a byte sequence declared
+ using this notation is called "a field". The example field
+ above would be called "the Foo field" or plain "Foo".
+
+ If there are many fields, they may be split to multiple lines.
+ This is indicated with an arrow ("--->"):
+
+ +=====+
+ | Foo |
+ +=====+
+
+ +=====+
+ ---> | Bar |
+ +=====+
+
+ The above is equivalent to this:
+
+ +=====+=====+
+ | Foo | Bar |
+ +=====+=====+
+
+
+1.2. Multibyte Integers
+
+ Multibyte integers of static length, such as CRC values,
+ are stored in little endian byte order (least significant
+ byte first).
+
+ When smaller values are more likely than bigger values (for
+ example file sizes), multibyte integers are encoded in a
+ variable-length representation:
+ - Numbers in the range [0, 127] are copied as is, and take
+ one byte of space.
+ - Bigger numbers will occupy two or more bytes. All but the
+ last byte of the multibyte representation have the highest
+ (eighth) bit set.
+
+ For now, the value of the variable-length integers is limited
+ to 63 bits, which limits the encoded size of the integer to
+ nine bytes. These limits may be increased in future if needed.
+
+ The following C code illustrates encoding and decoding of
+ variable-length integers. The functions return the number of
+ bytes occupied by the integer (1-9), or zero on error.
+
+ #include <stddef.h>
+ #include <inttypes.h>
+
+ size_t
+ encode(uint8_t buf[static 9], uint64_t num)
+ {
+ if (num > UINT64_MAX / 2)
+ return 0;
+
+ size_t i = 0;
+
+ while (num >= 0x80) {
+ buf[i++] = (uint8_t)(num) | 0x80;
+ num >>= 7;
+ }
+
+ buf[i++] = (uint8_t)(num);
+
+ return i;
+ }
+
+ size_t
+ decode(const uint8_t buf[], size_t size_max, uint64_t *num)
+ {
+ if (size_max == 0)
+ return 0;
+
+ if (size_max > 9)
+ size_max = 9;
+
+ *num = buf[0] & 0x7F;
+ size_t i = 0;
+
+ while (buf[i++] & 0x80) {
+ if (i >= size_max || buf[i] == 0x00)
+ return 0;
+
+ *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
+ }
+
+ return i;
+ }
+
+
+2. Overall Structure of .xz File
+
+ A standalone .xz files consist of one or more Streams which may
+ have Stream Padding between or after them:
+
+ +========+================+========+================+
+ | Stream | Stream Padding | Stream | Stream Padding | ...
+ +========+================+========+================+
+
+ While a typical file contains only one Stream and no Stream
+ Padding, a decoder handling standalone .xz files SHOULD support
+ files that have more than one Stream or Stream Padding.
+
+ In contrast to standalone .xz files, when the .xz file format
+ is used as an internal part of some other file format or
+ communication protocol, it usually is expected that the decoder
+ stops after the first Stream, and doesn't look for Stream
+ Padding or possibly other Streams.
+
+
+2.1. Stream
+
+ +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
+ | Stream Header | Block | Block | ... | Block |
+ +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
+
+ +=======+-+-+-+-+-+-+-+-+-+-+-+-+
+ ---> | Index | Stream Footer |
+ +=======+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ All the above fields have a size that is a multiple of four. If
+ Stream is used as an internal part of another file format, it
+ is RECOMMENDED to make the Stream start at an offset that is
+ a multiple of four bytes.
+
+ Stream Header, Index, and Stream Footer are always present in
+ a Stream. The maximum size of the Index field is 16 GiB (2^34).
+
+ There are zero or more Blocks. The maximum number of Blocks is
+ limited only by the maximum size of the Index field.
+
+ Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
+ The same limit applies to the total amount of uncompressed
+ data stored in a Stream.
+
+ If an implementation supports handling .xz files with multiple
+ concatenated Streams, it MAY apply the above limits to the file
+ as a whole instead of limiting per Stream basis.
+
+
+2.1.1. Stream Header
+
+ +---+---+---+---+---+---+-------+------+--+--+--+--+
+ | Header Magic Bytes | Stream Flags | CRC32 |
+ +---+---+---+---+---+---+-------+------+--+--+--+--+
+
+
+2.1.1.1. Header Magic Bytes
+
+ The first six (6) bytes of the Stream are so called Header
+ Magic Bytes. They can be used to identify the file type.
+
+ Using a C array and ASCII:
+ const uint8_t HEADER_MAGIC[6]
+ = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
+
+ In plain hexadecimal:
+ FD 37 7A 58 5A 00
+
+ Notes:
+ - The first byte (0xFD) was chosen so that the files cannot
+ be erroneously detected as being in .lzma format, in which
+ the first byte is in the range [0x00, 0xE0].
+ - The sixth byte (0x00) was chosen to prevent applications
+ from misdetecting the file as a text file.
+
+ If the Header Magic Bytes don't match, the decoder MUST
+ indicate an error.
+
+
+2.1.1.2. Stream Flags
+
+ The first byte of Stream Flags is always a null byte. In future
+ this byte may be used to indicate new Stream version or other
+ Stream properties.
+
+ The second byte of Stream Flags is a bit field:
+
+ Bit(s) Mask Description
+ 0-3 0x0F Type of Check (see Section 3.4):
+ ID Size Check name
+ 0x00 0 bytes None
+ 0x01 4 bytes CRC32
+ 0x02 4 bytes (Reserved)
+ 0x03 4 bytes (Reserved)
+ 0x04 8 bytes CRC64
+ 0x05 8 bytes (Reserved)
+ 0x06 8 bytes (Reserved)
+ 0x07 16 bytes (Reserved)
+ 0x08 16 bytes (Reserved)
+ 0x09 16 bytes (Reserved)
+ 0x0A 32 bytes SHA-256
+ 0x0B 32 bytes (Reserved)
+ 0x0C 32 bytes (Reserved)
+ 0x0D 64 bytes (Reserved)
+ 0x0E 64 bytes (Reserved)
+ 0x0F 64 bytes (Reserved)
+ 4-7 0xF0 Reserved for future use; MUST be zero for now.
+
+ Implementations SHOULD support at least the Check IDs 0x00
+ (None) and 0x01 (CRC32). Supporting other Check IDs is
+ OPTIONAL. If an unsupported Check is used, the decoder SHOULD
+ indicate a warning or error.
+
+ If any reserved bit is set, the decoder MUST indicate an error.
+ It is possible that there is a new field present which the
+ decoder is not aware of, and can thus parse the Stream Header
+ incorrectly.
+
+
+2.1.1.3. CRC32
+
+ The CRC32 is calculated from the Stream Flags field. It is
+ stored as an unsigned 32-bit little endian integer. If the
+ calculated value does not match the stored one, the decoder
+ MUST indicate an error.
+
+ The idea is that Stream Flags would always be two bytes, even
+ if new features are needed. This way old decoders will be able
+ to verify the CRC32 calculated from Stream Flags, and thus
+ distinguish between corrupt files (CRC32 doesn't match) and
+ files that the decoder doesn't support (CRC32 matches but
+ Stream Flags has reserved bits set).
+
+
+2.1.2. Stream Footer
+
+ +-+-+-+-+---+---+---+---+-------+------+----------+---------+
+ | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
+ +-+-+-+-+---+---+---+---+-------+------+----------+---------+
+
+
+2.1.2.1. CRC32
+
+ The CRC32 is calculated from the Backward Size and Stream Flags
+ fields. It is stored as an unsigned 32-bit little endian
+ integer. If the calculated value does not match the stored one,
+ the decoder MUST indicate an error.
+
+ The reason to have the CRC32 field before the Backward Size and
+ Stream Flags fields is to keep the four-byte fields aligned to
+ a multiple of four bytes.
+
+
+2.1.2.2. Backward Size
+
+ Backward Size is stored as a 32-bit little endian integer,
+ which indicates the size of the Index field as multiple of
+ four bytes, minimum value being four bytes:
+
+ real_backward_size = (stored_backward_size + 1) * 4;
+
+ If the stored value does not match the real size of the Index
+ field, the decoder MUST indicate an error.
+
+ Using a fixed-size integer to store Backward Size makes
+ it slightly simpler to parse the Stream Footer when the
+ application needs to parse the Stream backwards.
+
+
+2.1.2.3. Stream Flags
+
+ This is a copy of the Stream Flags field from the Stream
+ Header. The information stored to Stream Flags is needed
+ when parsing the Stream backwards. The decoder MUST compare
+ the Stream Flags fields in both Stream Header and Stream
+ Footer, and indicate an error if they are not identical.
+
+
+2.1.2.4. Footer Magic Bytes
+
+ As the last step of the decoding process, the decoder MUST
+ verify the existence of Footer Magic Bytes. If they don't
+ match, an error MUST be indicated.
+
+ Using a C array and ASCII:
+ const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
+
+ In hexadecimal:
+ 59 5A
+
+ The primary reason to have Footer Magic Bytes is to make
+ it easier to detect incomplete files quickly, without
+ uncompressing. If the file does not end with Footer Magic Bytes
+ (excluding Stream Padding described in Section 2.2), it cannot
+ be undamaged, unless someone has intentionally appended garbage
+ after the end of the Stream.
+
+
+2.2. Stream Padding
+
+ Only the decoders that support decoding of concatenated Streams
+ MUST support Stream Padding.
+
+ Stream Padding MUST contain only null bytes. To preserve the
+ four-byte alignment of consecutive Streams, the size of Stream
+ Padding MUST be a multiple of four bytes. Empty Stream Padding
+ is allowed.
+
+ Note that non-empty Stream Padding is allowed at the end of the
+ file; there doesn't need to be a new Stream after non-empty
+ Stream Padding. This can be convenient in certain situations
+ [GNU-tar].
+
+ The possibility of Stream Padding MUST be taken into account
+ when designing an application that parses Streams backwards,
+ and the application supports concatenated Streams.
+
+
+3. Block
+
+ +==============+=================+===============+=======+
+ | Block Header | Compressed Data | Block Padding | Check |
+ +==============+=================+===============+=======+
+
+
+3.1. Block Header
+
+ +-------------------+-------------+=================+
+ | Block Header Size | Block Flags | Compressed Size |
+ +-------------------+-------------+=================+
+
+ +===================+======================+
+ ---> | Uncompressed Size | List of Filter Flags |
+ +===================+======================+
+
+ +================+--+--+--+--+
+ ---> | Header Padding | CRC32 |
+ +================+--+--+--+--+
+
+
+3.1.1. Block Header Size
+
+ This field overlaps with the Index Indicator field (see
+ Section 4.1).
+
+ This field contains the size of the Block Header field,
+ including the Block Header Size field itself. Valid values are
+ in the range [0x01, 0xFF], which indicate the size of the Block
+ Header as multiples of four bytes, minimum size being eight
+ bytes:
+
+ real_header_size = (encoded_header_size + 1) * 4;
+
+ If bigger Block Header is needed in future, a new field can be
+ added between the current Block Header and Compressed Data
+ fields. The presence of this new field would be indicated in
+ the Block Header.
+
+
+3.1.2. Block Flags
+
+ The first byte of the Block Flags field is a bit field:
+
+ Bit(s) Mask Description
+ 0-1 0x03 Number of filters (1-4)
+ 2-5 0x3C Reserved for future use; MUST be zero for now.
+ 6 0x40 The Compressed Size field is present.
+ 7 0x80 The Uncompressed Size field is present.
+
+ If any reserved bit is set, the decoder MUST indicate an error.
+ It is possible that there is a new field present which the
+ decoder is not aware of, and can thus parse the Block Header
+ incorrectly.
+
+
+3.1.3. Compressed Size
+
+ This field is present only if the appropriate bit is set in
+ the Block Flags field (see Section 3.1.2).
+
+ The Compressed Size field contains the size of the Compressed
+ Data field, which MUST be non-zero. Compressed Size is stored
+ using the encoding described in Section 1.2. If the Compressed
+ Size doesn't match the size of the Compressed Data field, the
+ decoder MUST indicate an error.
+
+
+3.1.4. Uncompressed Size
+
+ This field is present only if the appropriate bit is set in
+ the Block Flags field (see Section 3.1.2).
+
+ The Uncompressed Size field contains the size of the Block
+ after uncompressing. Uncompressed Size is stored using the
+ encoding described in Section 1.2. If the Uncompressed Size
+ does not match the real uncompressed size, the decoder MUST
+ indicate an error.
+
+ Storing the Compressed Size and Uncompressed Size fields serves
+ several purposes:
+ - The decoder knows how much memory it needs to allocate
+ for a temporary buffer in multithreaded mode.
+ - Simple error detection: wrong size indicates a broken file.
+ - Seeking forwards to a specific location in streamed mode.
+
+ It should be noted that the only reliable way to determine
+ the real uncompressed size is to uncompress the Block,
+ because the Block Header and Index fields may contain
+ (intentionally or unintentionally) invalid information.
+
+
+3.1.5. List of Filter Flags
+
+ +================+================+ +================+
+ | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
+ +================+================+ +================+
+
+ The number of Filter Flags fields is stored in the Block Flags
+ field (see Section 3.1.2).
+
+ The format of each Filter Flags field is as follows:
+
+ +===========+====================+===================+
+ | Filter ID | Size of Properties | Filter Properties |
+ +===========+====================+===================+
+
+ Both Filter ID and Size of Properties are stored using the
+ encoding described in Section 1.2. Size of Properties indicates
+ the size of the Filter Properties field as bytes. The list of
+ officially defined Filter IDs and the formats of their Filter
+ Properties are described in Section 5.3.
+
+ Filter IDs greater than or equal to 0x4000_0000_0000_0000
+ (2^62) are reserved for implementation-specific internal use.
+ These Filter IDs MUST never be used in List of Filter Flags.
+
+
+3.1.6. Header Padding
+
+ This field contains as many null byte as it is needed to make
+ the Block Header have the size specified in Block Header Size.
+ If any of the bytes are not null bytes, the decoder MUST
+ indicate an error. It is possible that there is a new field
+ present which the decoder is not aware of, and can thus parse
+ the Block Header incorrectly.
+
+
+3.1.7. CRC32
+
+ The CRC32 is calculated over everything in the Block Header
+ field except the CRC32 field itself. It is stored as an
+ unsigned 32-bit little endian integer. If the calculated
+ value does not match the stored one, the decoder MUST indicate
+ an error.
+
+ By verifying the CRC32 of the Block Header before parsing the
+ actual contents allows the decoder to distinguish between
+ corrupt and unsupported files.
+
+
+3.2. Compressed Data
+
+ The format of Compressed Data depends on Block Flags and List
+ of Filter Flags. Excluding the descriptions of the simplest
+ filters in Section 5.3, the format of the filter-specific
+ encoded data is out of scope of this document.
+
+
+3.3. Block Padding
+
+ Block Padding MUST contain 0-3 null bytes to make the size of
+ the Block a multiple of four bytes. This can be needed when
+ the size of Compressed Data is not a multiple of four.
+
+
+3.4. Check
+
+ The type and size of the Check field depends on which bits
+ are set in the Stream Flags field (see Section 2.1.1.2).
+
+ The Check, when used, is calculated from the original
+ uncompressed data. If the calculated Check does not match the
+ stored one, the decoder MUST indicate an error. If the selected
+ type of Check is not supported by the decoder, it SHOULD
+ indicate a warning or error.
+
+
+4. Index
+
+ +-----------------+===================+
+ | Index Indicator | Number of Records |
+ +-----------------+===================+
+
+ +=================+===============+-+-+-+-+
+ ---> | List of Records | Index Padding | CRC32 |
+ +=================+===============+-+-+-+-+
+
+ Index serves several purporses. Using it, one can
+ - verify that all Blocks in a Stream have been processed;
+ - find out the uncompressed size of a Stream; and
+ - quickly access the beginning of any Block (random access).
+
+
+4.1. Index Indicator
+
+ This field overlaps with the Block Header Size field (see
+ Section 3.1.1). The value of Index Indicator is always 0x00.
+
+
+4.2. Number of Records
+
+ This field indicates how many Records there are in the List
+ of Records field, and thus how many Blocks there are in the
+ Stream. The value is stored using the encoding described in
+ Section 1.2. If the decoder has decoded all the Blocks of the
+ Stream, and then notices that the Number of Records doesn't
+ match the real number of Blocks, the decoder MUST indicate an
+ error.
+
+
+4.3. List of Records
+
+ List of Records consists of as many Records as indicated by the
+ Number of Records field:
+
+ +========+========+
+ | Record | Record | ...
+ +========+========+
+
+ Each Record contains information about one Block:
+
+ +===============+===================+
+ | Unpadded Size | Uncompressed Size |
+ +===============+===================+
+
+ If the decoder has decoded all the Blocks of the Stream, it
+ MUST verify that the contents of the Records match the real
+ Unpadded Size and Uncompressed Size of the respective Blocks.
+
+ Implementation hint: It is possible to verify the Index with
+ constant memory usage by calculating for example SHA-256 of
+ both the real size values and the List of Records, then
+ comparing the hash values. Implementing this using
+ non-cryptographic hash like CRC32 SHOULD be avoided unless
+ small code size is important.
+
+ If the decoder supports random-access reading, it MUST verify
+ that Unpadded Size and Uncompressed Size of every completely
+ decoded Block match the sizes stored in the Index. If only
+ partial Block is decoded, the decoder MUST verify that the
+ processed sizes don't exceed the sizes stored in the Index.
+
+
+4.3.1. Unpadded Size
+
+ This field indicates the size of the Block excluding the Block
+ Padding field. That is, Unpadded Size is the size of the Block
+ Header, Compressed Data, and Check fields. Unpadded Size is
+ stored using the encoding described in Section 1.2. The value
+ MUST never be zero; with the current structure of Blocks, the
+ actual minimum value for Unpadded Size is five.
+
+ Implementation note: Because the size of the Block Padding
+ field is not included in Unpadded Size, calculating the total
+ size of a Stream or doing random-access reading requires
+ calculating the actual size of the Blocks by rounding Unpadded
+ Sizes up to the next multiple of four.
+
+ The reason to exclude Block Padding from Unpadded Size is to
+ ease making a raw copy of Compressed Data without Block
+ Padding. This can be useful, for example, if someone wants
+ to convert Streams to some other file format quickly.
+
+
+4.3.2. Uncompressed Size
+
+ This field indicates the Uncompressed Size of the respective
+ Block as bytes. The value is stored using the encoding
+ described in Section 1.2.
+
+
+4.4. Index Padding
+
+ This field MUST contain 0-3 null bytes to pad the Index to
+ a multiple of four bytes.
+
+
+4.5. CRC32
+
+ The CRC32 is calculated over everything in the Index field
+ except the CRC32 field itself. The CRC32 is stored as an
+ unsigned 32-bit little endian integer. If the calculated
+ value does not match the stored one, the decoder MUST indicate
+ an error.
+
+
+5. Filter Chains
+
+ The Block Flags field defines how many filters are used. When
+ more than one filter is used, the filters are chained; that is,
+ the output of one filter is the input of another filter. The
+ following figure illustrates the direction of data flow.
+
+ v Uncompressed Data ^
+ | Filter 0 |
+ Encoder | Filter 1 | Decoder
+ | Filter n |
+ v Compressed Data ^
+
+
+5.1. Alignment
+
+ Alignment of uncompressed input data is usually the job of
+ the application producing the data. For example, to get the
+ best results, an archiver tool should make sure that all
+ PowerPC executable files in the archive stream start at
+ offsets that are multiples of four bytes.
+
+ Some filters, for example LZMA2, can be configured to take
+ advantage of specified alignment of input data. Note that
+ taking advantage of aligned input can be benefical also when
+ a filter is not the first filter in the chain. For example,
+ if you compress PowerPC executables, you may want to use the
+ PowerPC filter and chain that with the LZMA2 filter. Because
+ not only the input but also the output alignment of the PowerPC
+ filter is four bytes, it is now benefical to set LZMA2 settings
+ so that the LZMA2 encoder can take advantage of its
+ four-byte-aligned input data.
+
+ The output of the last filter in the chain is stored to the
+ Compressed Data field, which is is guaranteed to be aligned
+ to a multiple of four bytes relative to the beginning of the
+ Stream. This can increase
+ - speed, if the filtered data is handled multiple bytes at
+ a time by the filter-specific encoder and decoder,
+ because accessing aligned data in computer memory is
+ usually faster; and
+ - compression ratio, if the output data is later compressed
+ with an external compression tool.
+
+
+5.2. Security
+
+ If filters would be allowed to be chained freely, it would be
+ possible to create malicious files, that would be very slow to
+ decode. Such files could be used to create denial of service
+ attacks.
+
+ Slow files could occur when multiple filters are chained:
+
+ v Compressed input data
+ | Filter 1 decoder (last filter)
+ | Filter 0 decoder (non-last filter)
+ v Uncompressed output data
+
+ The decoder of the last filter in the chain produces a lot of
+ output from little input. Another filter in the chain takes the
+ output of the last filter, and produces very little output
+ while consuming a lot of input. As a result, a lot of data is
+ moved inside the filter chain, but the filter chain as a whole
+ gets very little work done.
+
+ To prevent this kind of slow files, there are restrictions on
+ how the filters can be chained. These restrictions MUST be
+ taken into account when designing new filters.
+
+ The maximum number of filters in the chain has been limited to
+ four, thus there can be at maximum of three non-last filters.
+ Of these three non-last filters, only two are allowed to change
+ the size of the data.
+
+ The non-last filters, that change the size of the data, MUST
+ have a limit how much the decoder can compress the data: the
+ decoder SHOULD produce at least n bytes of output when the
+ filter is given 2n bytes of input. This limit is not
+ absolute, but significant deviations MUST be avoided.
+
+ The above limitations guarantee that if the last filter in the
+ chain produces 4n bytes of output, the chain as a whole will
+ produce at least n bytes of output.
+
+
+5.3. Filters
+
+5.3.1. LZMA2
+
+ LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
+ compression algorithm with high compression ratio and fast
+ decompression. LZMA is based on LZ77 and range coding
+ algorithms.
+
+ LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
+ LZMA internally, but adds support for flushing the encoder,
+ uncompressed chunks, eases stateful decoder implementations,
+ and improves support for multithreading. Thus, the plain LZMA
+ will not be supported in this file format.
+
+ Filter ID: 0x21
+ Size of Filter Properties: 1 byte
+ Changes size of data: Yes
+ Allow as a non-last filter: No
+ Allow as the last filter: Yes
+
+ Preferred alignment:
+ Input data: Adjustable to 1/2/4/8/16 byte(s)
+ Output data: 1 byte
+
+ The format of the one-byte Filter Properties field is as
+ follows:
+
+ Bits Mask Description
+ 0-5 0x3F Dictionary Size
+ 6-7 0xC0 Reserved for future use; MUST be zero for now.
+
+ Dictionary Size is encoded with one-bit mantissa and five-bit
+ exponent. The smallest dictionary size is 4 KiB and the biggest
+ is 4 GiB.
+
+ Raw value Mantissa Exponent Dictionary size
+ 0 2 11 4 KiB
+ 1 3 11 6 KiB
+ 2 2 12 8 KiB
+ 3 3 12 12 KiB
+ 4 2 13 16 KiB
+ 5 3 13 24 KiB
+ 6 2 14 32 KiB
+ ... ... ... ...
+ 35 3 27 768 MiB
+ 36 2 28 1024 MiB
+ 37 3 29 1536 MiB
+ 38 2 30 2048 MiB
+ 39 3 30 3072 MiB
+ 40 2 31 4096 MiB - 1 B
+
+ Instead of having a table in the decoder, the dictionary size
+ can be decoded using the following C code:
+
+ const uint8_t bits = get_dictionary_flags() & 0x3F;
+ if (bits > 40)
+ return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
+
+ uint32_t dictionary_size;
+ if (bits == 40) {
+ dictionary_size = UINT32_MAX;
+ } else {
+ dictionary_size = 2 | (bits & 1);
+ dictionary_size <<= bits / 2 + 11;
+ }
+
+
+5.3.2. Branch/Call/Jump Filters for Executables
+
+ These filters convert relative branch, call, and jump
+ instructions to their absolute counterparts in executable
+ files. This conversion increases redundancy and thus
+ compression ratio.
+
+ Size of Filter Properties: 0 or 4 bytes
+ Changes size of data: No
+ Allow as a non-last filter: Yes
+ Allow as the last filter: No
+
+ Below is the list of filters in this category. The alignment
+ is the same for both input and output data.
+
+ Filter ID Alignment Description
+ 0x04 1 byte x86 filter (BCJ)
+ 0x05 4 bytes PowerPC (big endian) filter
+ 0x06 16 bytes IA64 filter
+ 0x07 4 bytes ARM (little endian) filter
+ 0x08 2 bytes ARM Thumb (little endian) filter
+ 0x09 4 bytes SPARC filter
+
+ If the size of Filter Properties is four bytes, the Filter
+ Properties field contains the start offset used for address
+ conversions. It is stored as an unsigned 32-bit little endian
+ integer. If the size of Filter Properties is zero, the start
+ offset is zero.
+
+ Setting the start offset may be useful if an executable has
+ multiple sections, and there are many cross-section calls.
+ Taking advantage of this feature usually requires usage of
+ the Subblock filter, whose design is not complete yet.
+
+
+5.3.3. Delta
+
+ The Delta filter may increase compression ratio when the value
+ of the next byte correlates with the value of an earlier byte
+ at specified distance.
+
+ Filter ID: 0x03
+ Size of Filter Properties: 1 byte
+ Changes size of data: No
+ Allow as a non-last filter: Yes
+ Allow as the last filter: No
+
+ Preferred alignment:
+ Input data: 1 byte
+ Output data: Same as the original input data
+
+ The Properties byte indicates the delta distance, which can be
+ 1-256 bytes backwards from the current byte: 0x00 indicates
+ distance of 1 byte and 0xFF distance of 256 bytes.
+
+
+5.3.3.1. Format of the Encoded Output
+
+ The code below illustrates both encoding and decoding with
+ the Delta filter.
+
+ // Distance is in the range [1, 256].
+ const unsigned int distance = get_properties_byte() + 1;
+ uint8_t pos = 0;
+ uint8_t delta[256];
+
+ memset(delta, 0, sizeof(delta));
+
+ while (1) {
+ const int byte = read_byte();
+ if (byte == EOF)
+ break;
+
+ uint8_t tmp = delta[(uint8_t)(distance + pos)];
+ if (is_encoder) {
+ tmp = (uint8_t)(byte) - tmp;
+ delta[pos] = (uint8_t)(byte);
+ } else {
+ tmp = (uint8_t)(byte) + tmp;
+ delta[pos] = tmp;
+ }
+
+ write_byte(tmp);
+ --pos;
+ }
+
+
+5.4. Custom Filter IDs
+
+ If a developer wants to use custom Filter IDs, he has two
+ choices. The first choice is to contact Lasse Collin and ask
+ him to allocate a range of IDs for the developer.
+
+ The second choice is to generate a 40-bit random integer,
+ which the developer can use as his personal Developer ID.
+ To minimalize the risk of collisions, Developer ID has to be
+ a randomly generated integer, not manually selected "hex word".
+ The following command, which works on many free operating
+ systems, can be used to generate Developer ID:
+
+ dd if=/dev/urandom bs=5 count=1 | hexdump
+
+ The developer can then use his Developer ID to create unique
+ (well, hopefully unique) Filter IDs.
+
+ Bits Mask Description
+ 0-15 0x0000_0000_0000_FFFF Filter ID
+ 16-55 0x00FF_FFFF_FFFF_0000 Developer ID
+ 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F
+
+ The resulting 63-bit integer will use 9 bytes of space when
+ stored using the encoding described in Section 1.2. To get
+ a shorter ID, see the beginning of this Section how to
+ request a custom ID range.
+
+
+5.4.1. Reserved Custom Filter ID Ranges
+
+ Range Description
+ 0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility
+ 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility
+ 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility
+
+
+6. Cyclic Redundancy Checks
+
+ There are several incompatible variations to calculate CRC32
+ and CRC64. For simplicity and clarity, complete examples are
+ provided to calculate the checks as they are used in this file
+ format. Implementations MAY use different code as long as it
+ gives identical results.
+
+ The program below reads data from standard input, calculates
+ the CRC32 and CRC64 values, and prints the calculated values
+ as big endian hexadecimal strings to standard output.
+
+ #include <stddef.h>
+ #include <inttypes.h>
+ #include <stdio.h>
+
+ uint32_t crc32_table[256];
+ uint64_t crc64_table[256];
+
+ void
+ init(void)
+ {
+ static const uint32_t poly32 = UINT32_C(0xEDB88320);
+ static const uint64_t poly64
+ = UINT64_C(0xC96C5795D7870F42);
+
+ for (size_t i = 0; i < 256; ++i) {
+ uint32_t crc32 = i;
+ uint64_t crc64 = i;
+
+ for (size_t j = 0; j < 8; ++j) {
+ if (crc32 & 1)
+ crc32 = (crc32 >> 1) ^ poly32;
+ else
+ crc32 >>= 1;
+
+ if (crc64 & 1)
+ crc64 = (crc64 >> 1) ^ poly64;
+ else
+ crc64 >>= 1;
+ }
+
+ crc32_table[i] = crc32;
+ crc64_table[i] = crc64;
+ }
+ }
+
+ uint32_t
+ crc32(const uint8_t *buf, size_t size, uint32_t crc)
+ {
+ crc = ~crc;
+ for (size_t i = 0; i < size; ++i)
+ crc = crc32_table[buf[i] ^ (crc & 0xFF)]
+ ^ (crc >> 8);
+ return ~crc;
+ }
+
+ uint64_t
+ crc64(const uint8_t *buf, size_t size, uint64_t crc)
+ {
+ crc = ~crc;
+ for (size_t i = 0; i < size; ++i)
+ crc = crc64_table[buf[i] ^ (crc & 0xFF)]
+ ^ (crc >> 8);
+ return ~crc;
+ }
+
+ int
+ main()
+ {
+ init();
+
+ uint32_t value32 = 0;
+ uint64_t value64 = 0;
+ uint64_t total_size = 0;
+ uint8_t buf[8192];
+
+ while (1) {
+ const size_t buf_size
+ = fread(buf, 1, sizeof(buf), stdin);
+ if (buf_size == 0)
+ break;
+
+ total_size += buf_size;
+ value32 = crc32(buf, buf_size, value32);
+ value64 = crc64(buf, buf_size, value64);
+ }
+
+ printf("Bytes: %" PRIu64 "\n", total_size);
+ printf("CRC-32: 0x%08" PRIX32 "\n", value32);
+ printf("CRC-64: 0x%016" PRIX64 "\n", value64);
+
+ return 0;
+ }
+
+
+7. References
+
+ LZMA SDK - The original LZMA implementation
+ http://7-zip.org/sdk.html
+
+ LZMA Utils - LZMA adapted to POSIX-like systems
+ http://tukaani.org/lzma/
+
+ XZ Utils - The next generation of LZMA Utils
+ http://tukaani.org/xz/
+
+ [RFC-1952]
+ GZIP file format specification version 4.3
+ http://www.ietf.org/rfc/rfc1952.txt
+ - Notation of byte boxes in section "2.1. Overall conventions"
+
+ [RFC-2119]
+ Key words for use in RFCs to Indicate Requirement Levels
+ http://www.ietf.org/rfc/rfc2119.txt
+
+ [GNU-tar]
+ GNU tar 1.21 manual
+ http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
+ - Node 9.4.2 "Blocking Factor", paragraph that begins
+ "gzip will complain about trailing garbage"
+ - Note that this URL points to the latest version of the
+ manual, and may some day not contain the note which is in
+ 1.21. For the exact version of the manual, download GNU
+ tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz
+