Renamed the file format specification to xz-file-format.txt

which is the filename used on the WWW.
author: Lasse Collin <lasse.collin@tukaani.org> 2009-05-01 11:20:23 +0300
committer: Lasse Collin <lasse.collin@tukaani.org> 2009-05-01 11:20:23 +0300
commit: 1496ff437c46f38303e0e94c511ca604b3a11f85 (patch)
tree: 9bff5cc1342c15f1313d871bdd5a8fe52b98cd43 /doc/file-format.txt
parent: Fixed a crash in liblzma. (diff)
download: xz-1496ff437c46f38303e0e94c511ca604b3a11f85.tar.xz
1 files changed, 0 insertions, 1127 deletions
diff --git a/doc/file-format.txt b/doc/file-format.txt
deleted file mode 100644
index fa2b3340..00000000
--- a/doc/file-format.txt
+++ /dev/null
@@ -1,1127 +0,0 @@
-
-The .xz File Format
-===================
-
-Version 1.0.0 (2009-01-14)
-
-
-        0. Preface
-           0.1. Notices and Acknowledgements
-           0.2. Getting the Latest Version
-           0.3. Version History
-        1. Conventions
-           1.1. Byte and Its Representation
-           1.2. Multibyte Integers
-        2. Overall Structure of .xz File
-           2.1. Stream
-                2.1.1. Stream Header
-                       2.1.1.1. Header Magic Bytes
-                       2.1.1.2. Stream Flags
-                       2.1.1.3. CRC32
-                2.1.2. Stream Footer
-                       2.1.2.1. CRC32
-                       2.1.2.2. Backward Size
-                       2.1.2.3. Stream Flags
-                       2.1.2.4. Footer Magic Bytes
-           2.2. Stream Padding
-        3. Block
-           3.1. Block Header
-                3.1.1. Block Header Size
-                3.1.2. Block Flags
-                3.1.3. Compressed Size
-                3.1.4. Uncompressed Size
-                3.1.5. List of Filter Flags
-                3.1.6. Header Padding
-                3.1.7. CRC32
-           3.2. Compressed Data
-           3.3. Block Padding
-           3.4. Check
-        4. Index
-           4.1. Index Indicator
-           4.2. Number of Records
-           4.3. List of Records
-                4.3.1. Unpadded Size
-                4.3.2. Uncompressed Size
-           4.4. Index Padding
-           4.5. CRC32
-        5. Filter Chains
-           5.1. Alignment
-           5.2. Security
-           5.3. Filters
-                5.3.1. LZMA2
-                5.3.2. Branch/Call/Jump Filters for Executables
-                5.3.3. Delta
-                       5.3.3.1. Format of the Encoded Output
-           5.4. Custom Filter IDs
-                5.4.1. Reserved Custom Filter ID Ranges
-        6. Cyclic Redundancy Checks
-        7. References
-
-
-0. Preface
-
-        This document describes the .xz file format (filename suffix
-        ".xz", MIME type "application/x-xz"). It is intended that this
-        this format replace the old .lzma format used by LZMA SDK and
-        LZMA Utils.
-
-
-0.1. Notices and Acknowledgements
-
-        This file format was designed by Lasse Collin
-        <lasse.collin@tukaani.org> and Igor Pavlov.
-
-        Special thanks for helping with this document goes to
-        Ville Koskinen. Thanks for helping with this document goes to
-        Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
-
-        This document has been put into the public domain.
-
-
-0.2. Getting the Latest Version
-
-        The latest official version of this document can be downloaded
-        from <http://tukaani.org/xz/xz-file-format.txt>.
-
-        Specific versions of this document have a filename
-        xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
-        For example, the version 1.0.0 of this document is available
-        at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.
-
-
-0.3. Version History
-
-        Version   Date          Description
-        1.0.0     2008-01-14    The first official version
-
-
-1. Conventions
-
-        The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
-        "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
-        document are to be interpreted as described in [RFC-2119].
-
-        Indicating a warning means displaying a message, returning
-        appropriate exit status, or doing something else to let the
-        user know that something worth warning occurred. The operation
-        SHOULD still finish if a warning is indicated.
-
-        Indicating an error means displaying a message, returning
-        appropriate exit status, or doing something else to let the
-        user know that something prevented successfully finishing the
-        operation. The operation MUST be aborted once an error has
-        been indicated.
-
-
-1.1. Byte and Its Representation
-
-        In this document, byte is always 8 bits.
-
-        A "null byte" has all bits unset. That is, the value of a null
-        byte is 0x00.
-
-        To represent byte blocks, this document uses notation that
-        is similar to the notation used in [RFC-1952]:
-
-            +-------+
-            |  Foo  |   One byte.
-            +-------+
-
-            +---+---+
-            |  Foo  |   Two bytes; that is, some of the vertical bars
-            +---+---+   can be missing.
-
-            +=======+
-            |  Foo  |   Zero or more bytes.
-            +=======+
-
-        In this document, a boxed byte or a byte sequence declared
-        using this notation is called "a field". The example field
-        above would be called "the Foo field" or plain "Foo".
-
-        If there are many fields, they may be split to multiple lines.
-        This is indicated with an arrow ("--->"):
-
-            +=====+
-            | Foo |
-            +=====+
-
-                 +=====+
-            ---> | Bar |
-                 +=====+
-
-        The above is equivalent to this:
-
-            +=====+=====+
-            | Foo | Bar |
-            +=====+=====+
-
-
-1.2. Multibyte Integers
-
-        Multibyte integers of static length, such as CRC values,
-        are stored in little endian byte order (least significant
-        byte first).
-
-        When smaller values are more likely than bigger values (for
-        example file sizes), multibyte integers are encoded in a
-        variable-length representation:
-          - Numbers in the range [0, 127] are copied as is, and take
-            one byte of space.
-          - Bigger numbers will occupy two or more bytes. All but the
-            last byte of the multibyte representation have the highest
-            (eighth) bit set.
-
-        For now, the value of the variable-length integers is limited
-        to 63 bits, which limits the encoded size of the integer to
-        nine bytes. These limits may be increased in future if needed.
-
-        The following C code illustrates encoding and decoding of
-        variable-length integers. The functions return the number of
-        bytes occupied by the integer (1-9), or zero on error.
-
-            #include <stddef.h>
-            #include <inttypes.h>
-
-            size_t
-            encode(uint8_t buf[static 9], uint64_t num)
-            {
-                if (num > UINT64_MAX / 2)
-                    return 0;
-
-                size_t i = 0;
-
-                while (num >= 0x80) {
-                    buf[i++] = (uint8_t)(num) | 0x80;
-                    num >>= 7;
-                }
-
-                buf[i++] = (uint8_t)(num);
-
-                return i;
-            }
-
-            size_t
-            decode(const uint8_t buf[], size_t size_max, uint64_t *num)
-            {
-                if (size_max == 0)
-                    return 0;
-
-                if (size_max > 9)
-                    size_max = 9;
-
-                *num = buf[0] & 0x7F;
-                size_t i = 0;
-
-                while (buf[i++] & 0x80) {
-                    if (i >= size_max || buf[i] == 0x00)
-                        return 0;
-
-                    *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
-                }
-
-                return i;
-            }
-
-
-2. Overall Structure of .xz File
-
-        A standalone .xz files consist of one or more Streams which may
-        have Stream Padding between or after them:
-
-            +========+================+========+================+
-            | Stream | Stream Padding | Stream | Stream Padding | ...
-            +========+================+========+================+
-
-        While a typical file contains only one Stream and no Stream
-        Padding, a decoder handling standalone .xz files SHOULD support
-        files that have more than one Stream or Stream Padding.
-
-        In contrast to standalone .xz files, when the .xz file format
-        is used as an internal part of some other file format or
-        communication protocol, it usually is expected that the decoder
-        stops after the first Stream, and doesn't look for Stream
-        Padding or possibly other Streams.
-
-
-2.1. Stream
-
-        +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
-        |     Stream Header     | Block | Block | ... | Block |
-        +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
-
-             +=======+-+-+-+-+-+-+-+-+-+-+-+-+
-        ---> | Index |     Stream Footer     |
-             +=======+-+-+-+-+-+-+-+-+-+-+-+-+
-
-        All the above fields have a size that is a multiple of four. If
-        Stream is used as an internal part of another file format, it
-        is RECOMMENDED to make the Stream start at an offset that is
-        a multiple of four bytes.
-
-        Stream Header, Index, and Stream Footer are always present in
-        a Stream. The maximum size of the Index field is 16 GiB (2^34).
-
-        There are zero or more Blocks. The maximum number of Blocks is
-        limited only by the maximum size of the Index field.
-
-        Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
-        The same limit applies to the total amount of uncompressed
-        data stored in a Stream.
-
-        If an implementation supports handling .xz files with multiple
-        concatenated Streams, it MAY apply the above limits to the file
-        as a whole instead of limiting per Stream basis.
-
-
-2.1.1. Stream Header
-
-        +---+---+---+---+---+---+-------+------+--+--+--+--+
-        |  Header Magic Bytes   | Stream Flags |   CRC32   |
-        +---+---+---+---+---+---+-------+------+--+--+--+--+
-
-
-2.1.1.1. Header Magic Bytes
-
-        The first six (6) bytes of the Stream are so called Header
-        Magic Bytes. They can be used to identify the file type.
-
-            Using a C array and ASCII:
-            const uint8_t HEADER_MAGIC[6]
-                    = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
-
-            In plain hexadecimal:
-            FD 37 7A 58 5A 00
-
-        Notes:
-          - The first byte (0xFD) was chosen so that the files cannot
-            be erroneously detected as being in .lzma format, in which
-            the first byte is in the range [0x00, 0xE0].
-          - The sixth byte (0x00) was chosen to prevent applications
-            from misdetecting the file as a text file.
-
-        If the Header Magic Bytes don't match, the decoder MUST
-        indicate an error.
-
-
-2.1.1.2. Stream Flags
-
-        The first byte of Stream Flags is always a null byte. In future
-        this byte may be used to indicate new Stream version or other
-        Stream properties.
-
-        The second byte of Stream Flags is a bit field:
-
-            Bit(s)  Mask  Description
-             0-3    0x0F  Type of Check (see Section 3.4):
-                              ID    Size      Check name
-                              0x00   0 bytes  None
-                              0x01   4 bytes  CRC32
-                              0x02   4 bytes  (Reserved)
-                              0x03   4 bytes  (Reserved)
-                              0x04   8 bytes  CRC64
-                              0x05   8 bytes  (Reserved)
-                              0x06   8 bytes  (Reserved)
-                              0x07  16 bytes  (Reserved)
-                              0x08  16 bytes  (Reserved)
-                              0x09  16 bytes  (Reserved)
-                              0x0A  32 bytes  SHA-256
-                              0x0B  32 bytes  (Reserved)
-                              0x0C  32 bytes  (Reserved)
-                              0x0D  64 bytes  (Reserved)
-                              0x0E  64 bytes  (Reserved)
-                              0x0F  64 bytes  (Reserved)
-             4-7    0xF0  Reserved for future use; MUST be zero for now.
-
-        Implementations SHOULD support at least the Check IDs 0x00
-        (None) and 0x01 (CRC32). Supporting other Check IDs is
-        OPTIONAL. If an unsupported Check is used, the decoder SHOULD
-        indicate a warning or error.
-
-        If any reserved bit is set, the decoder MUST indicate an error.
-        It is possible that there is a new field present which the
-        decoder is not aware of, and can thus parse the Stream Header
-        incorrectly.
-
-
-2.1.1.3. CRC32
-
-        The CRC32 is calculated from the Stream Flags field. It is
-        stored as an unsigned 32-bit little endian integer. If the
-        calculated value does not match the stored one, the decoder
-        MUST indicate an error.
-
-        The idea is that Stream Flags would always be two bytes, even
-        if new features are needed. This way old decoders will be able
-        to verify the CRC32 calculated from Stream Flags, and thus
-        distinguish between corrupt files (CRC32 doesn't match) and
-        files that the decoder doesn't support (CRC32 matches but
-        Stream Flags has reserved bits set).
-
-
-2.1.2. Stream Footer
-
-        +-+-+-+-+---+---+---+---+-------+------+----------+---------+
-        | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
-        +-+-+-+-+---+---+---+---+-------+------+----------+---------+
-
-
-2.1.2.1. CRC32
-
-        The CRC32 is calculated from the Backward Size and Stream Flags
-        fields. It is stored as an unsigned 32-bit little endian
-        integer. If the calculated value does not match the stored one,
-        the decoder MUST indicate an error.
-
-        The reason to have the CRC32 field before the Backward Size and
-        Stream Flags fields is to keep the four-byte fields aligned to
-        a multiple of four bytes.
-
-
-2.1.2.2. Backward Size
-
-        Backward Size is stored as a 32-bit little endian integer,
-        which indicates the size of the Index field as multiple of
-        four bytes, minimum value being four bytes:
-
-            real_backward_size = (stored_backward_size + 1) * 4;
-
-        If the stored value does not match the real size of the Index
-        field, the decoder MUST indicate an error.
-
-        Using a fixed-size integer to store Backward Size makes
-        it slightly simpler to parse the Stream Footer when the
-        application needs to parse the Stream backwards.
-
-
-2.1.2.3. Stream Flags
-
-        This is a copy of the Stream Flags field from the Stream
-        Header. The information stored to Stream Flags is needed
-        when parsing the Stream backwards. The decoder MUST compare
-        the Stream Flags fields in both Stream Header and Stream
-        Footer, and indicate an error if they are not identical.
-
-
-2.1.2.4. Footer Magic Bytes
-
-        As the last step of the decoding process, the decoder MUST
-        verify the existence of Footer Magic Bytes. If they don't
-        match, an error MUST be indicated.
-
-            Using a C array and ASCII:
-            const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
-
-            In hexadecimal:
-            59 5A
-
-        The primary reason to have Footer Magic Bytes is to make
-        it easier to detect incomplete files quickly, without
-        uncompressing. If the file does not end with Footer Magic Bytes
-        (excluding Stream Padding described in Section 2.2), it cannot
-        be undamaged, unless someone has intentionally appended garbage
-        after the end of the Stream.
-
-
-2.2. Stream Padding
-
-        Only the decoders that support decoding of concatenated Streams
-        MUST support Stream Padding.
-
-        Stream Padding MUST contain only null bytes. To preserve the
-        four-byte alignment of consecutive Streams, the size of Stream
-        Padding MUST be a multiple of four bytes. Empty Stream Padding
-        is allowed.
-
-        Note that non-empty Stream Padding is allowed at the end of the
-        file; there doesn't need to be a new Stream after non-empty
-        Stream Padding. This can be convenient in certain situations
-        [GNU-tar].
-
-        The possibility of Stream Padding MUST be taken into account
-        when designing an application that parses Streams backwards,
-        and the application supports concatenated Streams.
-
-
-3. Block
-
-        +==============+=================+===============+=======+
-        | Block Header | Compressed Data | Block Padding | Check |
-        +==============+=================+===============+=======+
-
-
-3.1. Block Header
-
-        +-------------------+-------------+=================+
-        | Block Header Size | Block Flags | Compressed Size |
-        +-------------------+-------------+=================+
-
-             +===================+======================+
-        ---> | Uncompressed Size | List of Filter Flags |
-             +===================+======================+
-
-             +================+--+--+--+--+
-        ---> | Header Padding |   CRC32   |
-             +================+--+--+--+--+
-
-
-3.1.1. Block Header Size
-
-        This field overlaps with the Index Indicator field (see
-        Section 4.1).
-
-        This field contains the size of the Block Header field,
-        including the Block Header Size field itself. Valid values are
-        in the range [0x01, 0xFF], which indicate the size of the Block
-        Header as multiples of four bytes, minimum size being eight
-        bytes:
-
-            real_header_size = (encoded_header_size + 1) * 4;
-
-        If bigger Block Header is needed in future, a new field can be
-        added between the current Block Header and Compressed Data
-        fields. The presence of this new field would be indicated in
-        the Block Header.
-
-
-3.1.2. Block Flags
-
-        The first byte of the Block Flags field is a bit field:
-
-            Bit(s)  Mask  Description
-             0-1    0x03  Number of filters (1-4)
-             2-5    0x3C  Reserved for future use; MUST be zero for now.
-              6     0x40  The Compressed Size field is present.
-              7     0x80  The Uncompressed Size field is present.
-
-        If any reserved bit is set, the decoder MUST indicate an error.
-        It is possible that there is a new field present which the
-        decoder is not aware of, and can thus parse the Block Header
-        incorrectly.
-
-
-3.1.3. Compressed Size
-
-        This field is present only if the appropriate bit is set in
-        the Block Flags field (see Section 3.1.2).
-
-        The Compressed Size field contains the size of the Compressed
-        Data field, which MUST be non-zero. Compressed Size is stored
-        using the encoding described in Section 1.2. If the Compressed
-        Size doesn't match the size of the Compressed Data field, the
-        decoder MUST indicate an error.
-
-
-3.1.4. Uncompressed Size
-
-        This field is present only if the appropriate bit is set in
-        the Block Flags field (see Section 3.1.2).
-
-        The Uncompressed Size field contains the size of the Block
-        after uncompressing. Uncompressed Size is stored using the
-        encoding described in Section 1.2. If the Uncompressed Size
-        does not match the real uncompressed size, the decoder MUST
-        indicate an error.
-
-        Storing the Compressed Size and Uncompressed Size fields serves
-        several purposes:
-          - The decoder knows how much memory it needs to allocate
-            for a temporary buffer in multithreaded mode.
-          - Simple error detection: wrong size indicates a broken file.
-          - Seeking forwards to a specific location in streamed mode.
-
-        It should be noted that the only reliable way to determine
-        the real uncompressed size is to uncompress the Block,
-        because the Block Header and Index fields may contain
-        (intentionally or unintentionally) invalid information.
-
-
-3.1.5. List of Filter Flags
-
-        +================+================+     +================+
-        | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
-        +================+================+     +================+
-
-        The number of Filter Flags fields is stored in the Block Flags
-        field (see Section 3.1.2).
-
-        The format of each Filter Flags field is as follows:
-
-            +===========+====================+===================+
-            | Filter ID | Size of Properties | Filter Properties |
-            +===========+====================+===================+
-
-        Both Filter ID and Size of Properties are stored using the
-        encoding described in Section 1.2. Size of Properties indicates
-        the size of the Filter Properties field as bytes. The list of
-        officially defined Filter IDs and the formats of their Filter
-        Properties are described in Section 5.3.
-
-        Filter IDs greater than or equal to 0x4000_0000_0000_0000
-        (2^62) are reserved for implementation-specific internal use.
-        These Filter IDs MUST never be used in List of Filter Flags.
-
-
-3.1.6. Header Padding
-
-        This field contains as many null byte as it is needed to make
-        the Block Header have the size specified in Block Header Size.
-        If any of the bytes are not null bytes, the decoder MUST
-        indicate an error. It is possible that there is a new field
-        present which the decoder is not aware of, and can thus parse
-        the Block Header incorrectly.
-
-
-3.1.7. CRC32
-
-        The CRC32 is calculated over everything in the Block Header
-        field except the CRC32 field itself. It is stored as an
-        unsigned 32-bit little endian integer. If the calculated
-        value does not match the stored one, the decoder MUST indicate
-        an error.
-
-        By verifying the CRC32 of the Block Header before parsing the
-        actual contents allows the decoder to distinguish between
-        corrupt and unsupported files.
-
-
-3.2. Compressed Data
-
-        The format of Compressed Data depends on Block Flags and List
-        of Filter Flags. Excluding the descriptions of the simplest
-        filters in Section 5.3, the format of the filter-specific
-        encoded data is out of scope of this document.
-
-
-3.3. Block Padding
-
-        Block Padding MUST contain 0-3 null bytes to make the size of
-        the Block a multiple of four bytes. This can be needed when
-        the size of Compressed Data is not a multiple of four.
-
-
-3.4. Check
-
-        The type and size of the Check field depends on which bits
-        are set in the Stream Flags field (see Section 2.1.1.2).
-
-        The Check, when used, is calculated from the original
-        uncompressed data. If the calculated Check does not match the
-        stored one, the decoder MUST indicate an error. If the selected
-        type of Check is not supported by the decoder, it SHOULD
-        indicate a warning or error.
-
-
-4. Index
-
-        +-----------------+===================+
-        | Index Indicator | Number of Records |
-        +-----------------+===================+
-
-             +=================+===============+-+-+-+-+
-        ---> | List of Records | Index Padding | CRC32 |
-             +=================+===============+-+-+-+-+
-
-        Index serves several purporses. Using it, one can
-          - verify that all Blocks in a Stream have been processed;
-          - find out the uncompressed size of a Stream; and
-          - quickly access the beginning of any Block (random access).
-
-
-4.1. Index Indicator
-
-        This field overlaps with the Block Header Size field (see
-        Section 3.1.1). The value of Index Indicator is always 0x00.
-
-
-4.2. Number of Records
-
-        This field indicates how many Records there are in the List
-        of Records field, and thus how many Blocks there are in the
-        Stream. The value is stored using the encoding described in
-        Section 1.2. If the decoder has decoded all the Blocks of the
-        Stream, and then notices that the Number of Records doesn't
-        match the real number of Blocks, the decoder MUST indicate an
-        error.
-
-
-4.3. List of Records
-
-        List of Records consists of as many Records as indicated by the
-        Number of Records field:
-
-            +========+========+
-            | Record | Record | ...
-            +========+========+
-
-        Each Record contains information about one Block:
-
-            +===============+===================+
-            | Unpadded Size | Uncompressed Size |
-            +===============+===================+
-
-        If the decoder has decoded all the Blocks of the Stream, it
-        MUST verify that the contents of the Records match the real
-        Unpadded Size and Uncompressed Size of the respective Blocks.
-
-        Implementation hint: It is possible to verify the Index with
-        constant memory usage by calculating for example SHA-256 of
-        both the real size values and the List of Records, then
-        comparing the hash values. Implementing this using
-        non-cryptographic hash like CRC32 SHOULD be avoided unless
-        small code size is important.
-
-        If the decoder supports random-access reading, it MUST verify
-        that Unpadded Size and Uncompressed Size of every completely
-        decoded Block match the sizes stored in the Index. If only
-        partial Block is decoded, the decoder MUST verify that the
-        processed sizes don't exceed the sizes stored in the Index.
-
-
-4.3.1. Unpadded Size
-
-        This field indicates the size of the Block excluding the Block
-        Padding field. That is, Unpadded Size is the size of the Block
-        Header, Compressed Data, and Check fields. Unpadded Size is
-        stored using the encoding described in Section 1.2. The value
-        MUST never be zero; with the current structure of Blocks, the
-        actual minimum value for Unpadded Size is five.
-
-        Implementation note: Because the size of the Block Padding
-        field is not included in Unpadded Size, calculating the total
-        size of a Stream or doing random-access reading requires
-        calculating the actual size of the Blocks by rounding Unpadded
-        Sizes up to the next multiple of four.
-
-        The reason to exclude Block Padding from Unpadded Size is to
-        ease making a raw copy of Compressed Data without Block
-        Padding. This can be useful, for example, if someone wants
-        to convert Streams to some other file format quickly.
-
-
-4.3.2. Uncompressed Size
-
-        This field indicates the Uncompressed Size of the respective
-        Block as bytes. The value is stored using the encoding
-        described in Section 1.2.
-
-
-4.4. Index Padding
-
-        This field MUST contain 0-3 null bytes to pad the Index to
-        a multiple of four bytes.
-
-
-4.5. CRC32
-
-        The CRC32 is calculated over everything in the Index field
-        except the CRC32 field itself. The CRC32 is stored as an
-        unsigned 32-bit little endian integer. If the calculated
-        value does not match the stored one, the decoder MUST indicate
-        an error.
-
-
-5. Filter Chains
-
-        The Block Flags field defines how many filters are used. When
-        more than one filter is used, the filters are chained; that is,
-        the output of one filter is the input of another filter. The
-        following figure illustrates the direction of data flow.
-
-                    v   Uncompressed Data   ^
-                    |       Filter 0        |
-            Encoder |       Filter 1        | Decoder
-                    |       Filter n        |
-                    v    Compressed Data    ^
-
-
-5.1. Alignment
-
-        Alignment of uncompressed input data is usually the job of
-        the application producing the data. For example, to get the
-        best results, an archiver tool should make sure that all
-        PowerPC executable files in the archive stream start at
-        offsets that are multiples of four bytes.
-
-        Some filters, for example LZMA2, can be configured to take
-        advantage of specified alignment of input data. Note that
-        taking advantage of aligned input can be benefical also when
-        a filter is not the first filter in the chain. For example,
-        if you compress PowerPC executables, you may want to use the
-        PowerPC filter and chain that with the LZMA2 filter. Because
-        not only the input but also the output alignment of the PowerPC
-        filter is four bytes, it is now benefical to set LZMA2 settings
-        so that the LZMA2 encoder can take advantage of its
-        four-byte-aligned input data.
-
-        The output of the last filter in the chain is stored to the
-        Compressed Data field, which is is guaranteed to be aligned
-        to a multiple of four bytes relative to the beginning of the
-        Stream. This can increase
-          - speed, if the filtered data is handled multiple bytes at
-            a time by the filter-specific encoder and decoder,
-            because accessing aligned data in computer memory is
-            usually faster; and
-          - compression ratio, if the output data is later compressed
-            with an external compression tool.
-
-
-5.2. Security
-
-        If filters would be allowed to be chained freely, it would be
-        possible to create malicious files, that would be very slow to
-        decode. Such files could be used to create denial of service
-        attacks.
-
-        Slow files could occur when multiple filters are chained:
-
-            v   Compressed input data
-            |   Filter 1 decoder (last filter)
-            |   Filter 0 decoder (non-last filter)
-            v   Uncompressed output data
-
-        The decoder of the last filter in the chain produces a lot of
-        output from little input. Another filter in the chain takes the
-        output of the last filter, and produces very little output
-        while consuming a lot of input. As a result, a lot of data is
-        moved inside the filter chain, but the filter chain as a whole
-        gets very little work done.
-
-        To prevent this kind of slow files, there are restrictions on
-        how the filters can be chained. These restrictions MUST be
-        taken into account when designing new filters.
-
-        The maximum number of filters in the chain has been limited to
-        four, thus there can be at maximum of three non-last filters.
-        Of these three non-last filters, only two are allowed to change
-        the size of the data.
-
-        The non-last filters, that change the size of the data, MUST
-        have a limit how much the decoder can compress the data: the
-        decoder SHOULD produce at least n bytes of output when the
-        filter is given 2n bytes of input. This  limit is not
-        absolute, but significant deviations MUST be avoided.
-
-        The above limitations guarantee that if the last filter in the
-        chain produces 4n bytes of output, the chain as a whole will
-        produce at least n bytes of output.
-
-
-5.3. Filters
-
-5.3.1. LZMA2
-
-        LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
-        compression algorithm with high compression ratio and fast
-        decompression. LZMA is based on LZ77 and range coding
-        algorithms.
-
-        LZMA2 is an extensions on top of the original LZMA. LZMA2 uses
-        LZMA internally, but adds support for flushing the encoder,
-        uncompressed chunks, eases stateful decoder implementations,
-        and improves support for multithreading. Thus, the plain LZMA
-        will not be supported in this file format.
-
-            Filter ID:                  0x21
-            Size of Filter Properties:  1 byte
-            Changes size of data:       Yes
-            Allow as a non-last filter: No
-            Allow as the last filter:   Yes
-
-            Preferred alignment:
-                Input data:             Adjustable to 1/2/4/8/16 byte(s)
-                Output data:            1 byte
-
-        The format of the one-byte Filter Properties field is as
-        follows:
-
-            Bits   Mask   Description
-            0-5    0x3F   Dictionary Size
-            6-7    0xC0   Reserved for future use; MUST be zero for now.
-
-        Dictionary Size is encoded with one-bit mantissa and five-bit
-        exponent. The smallest dictionary size is 4 KiB and the biggest
-        is 4 GiB.
-
-            Raw value   Mantissa   Exponent   Dictionary size
-                0           2         11         4 KiB
-                1           3         11         6 KiB
-                2           2         12         8 KiB
-                3           3         12        12 KiB
-                4           2         13        16 KiB
-                5           3         13        24 KiB
-                6           2         14        32 KiB
-              ...         ...        ...      ...
-               35           3         27       768 MiB
-               36           2         28      1024 MiB
-               37           3         29      1536 MiB
-               38           2         30      2048 MiB
-               39           3         30      3072 MiB
-               40           2         31      4096 MiB - 1 B
-
-        Instead of having a table in the decoder, the dictionary size
-        can be decoded using the following C code:
-
-            const uint8_t bits = get_dictionary_flags() & 0x3F;
-            if (bits > 40)
-                return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
-
-            uint32_t dictionary_size;
-            if (bits == 40) {
-                dictionary_size = UINT32_MAX;
-            } else {
-                dictionary_size = 2 | (bits & 1);
-                dictionary_size <<= bits / 2 + 11;
-            }
-
-
-5.3.2. Branch/Call/Jump Filters for Executables
-
-        These filters convert relative branch, call, and jump
-        instructions to their absolute counterparts in executable
-        files. This conversion increases redundancy and thus
-        compression ratio.
-
-            Size of Filter Properties:  0 or 4 bytes
-            Changes size of data:       No
-            Allow as a non-last filter: Yes
-            Allow as the last filter:   No
-
-        Below is the list of filters in this category. The alignment
-        is the same for both input and output data.
-
-            Filter ID   Alignment   Description
-              0x04       1 byte     x86 filter (BCJ)
-              0x05       4 bytes    PowerPC (big endian) filter
-              0x06      16 bytes    IA64 filter
-              0x07       4 bytes    ARM (little endian) filter
-              0x08       2 bytes    ARM Thumb (little endian) filter
-              0x09       4 bytes    SPARC filter
-
-        If the size of Filter Properties is four bytes, the Filter
-        Properties field contains the start offset used for address
-        conversions. It is stored as an unsigned 32-bit little endian
-        integer. If the size of Filter Properties is zero, the start
-        offset is zero.
-
-        Setting the start offset may be useful if an executable has
-        multiple sections, and there are many cross-section calls.
-        Taking advantage of this feature usually requires usage of
-        the Subblock filter, whose design is not complete yet.
-
-
-5.3.3. Delta
-
-        The Delta filter may increase compression ratio when the value
-        of the next byte correlates with the value of an earlier byte
-        at specified distance.
-
-            Filter ID:                  0x03
-            Size of Filter Properties:  1 byte
-            Changes size of data:       No
-            Allow as a non-last filter: Yes
-            Allow as the last filter:   No
-
-            Preferred alignment:
-                Input data:             1 byte
-                Output data:            Same as the original input data
-
-        The Properties byte indicates the delta distance, which can be
-        1-256 bytes backwards from the current byte: 0x00 indicates
-        distance of 1 byte and 0xFF distance of 256 bytes.
-
-
-5.3.3.1. Format of the Encoded Output
-
-        The code below illustrates both encoding and decoding with
-        the Delta filter.
-
-            // Distance is in the range [1, 256].
-            const unsigned int distance = get_properties_byte() + 1;
-            uint8_t pos = 0;
-            uint8_t delta[256];
-
-            memset(delta, 0, sizeof(delta));
-
-            while (1) {
-                const int byte = read_byte();
-                if (byte == EOF)
-                    break;
-
-                uint8_t tmp = delta[(uint8_t)(distance + pos)];
-                if (is_encoder) {
-                    tmp = (uint8_t)(byte) - tmp;
-                    delta[pos] = (uint8_t)(byte);
-                } else {
-                    tmp = (uint8_t)(byte) + tmp;
-                    delta[pos] = tmp;
-                }
-
-                write_byte(tmp);
-                --pos;
-            }
-
-
-5.4. Custom Filter IDs
-
-        If a developer wants to use custom Filter IDs, he has two
-        choices. The first choice is to contact Lasse Collin and ask
-        him to allocate a range of IDs for the developer.
-
-        The second choice is to generate a 40-bit random integer,
-        which the developer can use as his personal Developer ID.
-        To minimalize the risk of collisions, Developer ID has to be
-        a randomly generated integer, not manually selected "hex word".
-        The following command, which works on many free operating
-        systems, can be used to generate Developer ID:
-
-            dd if=/dev/urandom bs=5 count=1 | hexdump
-
-        The developer can then use his Developer ID to create unique
-        (well, hopefully unique) Filter IDs.
-
-            Bits    Mask                    Description
-             0-15   0x0000_0000_0000_FFFF   Filter ID
-            16-55   0x00FF_FFFF_FFFF_0000   Developer ID
-            56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
-
-        The resulting 63-bit integer will use 9 bytes of space when
-        stored using the encoding described in Section 1.2. To get
-        a shorter ID, see the beginning of this Section how to
-        request a custom ID range.
-
-
-5.4.1. Reserved Custom Filter ID Ranges
-
-        Range                       Description
-        0x0000_0300 - 0x0000_04FF   Reserved to ease .7z compatibility
-        0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
-        0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
-
-
-6. Cyclic Redundancy Checks
-
-        There are several incompatible variations to calculate CRC32
-        and CRC64. For simplicity and clarity, complete examples are
-        provided to calculate the checks as they are used in this file
-        format. Implementations MAY use different code as long as it
-        gives identical results.
-
-        The program below reads data from standard input, calculates
-        the CRC32 and CRC64 values, and prints the calculated values
-        as big endian hexadecimal strings to standard output.
-
-            #include <stddef.h>
-            #include <inttypes.h>
-            #include <stdio.h>
-
-            uint32_t crc32_table[256];
-            uint64_t crc64_table[256];
-
-            void
-            init(void)
-            {
-                static const uint32_t poly32 = UINT32_C(0xEDB88320);
-                static const uint64_t poly64
-                        = UINT64_C(0xC96C5795D7870F42);
-
-                for (size_t i = 0; i < 256; ++i) {
-                    uint32_t crc32 = i;
-                    uint64_t crc64 = i;
-
-                    for (size_t j = 0; j < 8; ++j) {
-                        if (crc32 & 1)
-                            crc32 = (crc32 >> 1) ^ poly32;
-                        else
-                            crc32 >>= 1;
-
-                        if (crc64 & 1)
-                            crc64 = (crc64 >> 1) ^ poly64;
-                        else
-                            crc64 >>= 1;
-                    }
-
-                    crc32_table[i] = crc32;
-                    crc64_table[i] = crc64;
-                }
-            }
-
-            uint32_t
-            crc32(const uint8_t *buf, size_t size, uint32_t crc)
-            {
-                crc = ~crc;
-                for (size_t i = 0; i < size; ++i)
-                    crc = crc32_table[buf[i] ^ (crc & 0xFF)]
-                            ^ (crc >> 8);
-                return ~crc;
-            }
-
-            uint64_t
-            crc64(const uint8_t *buf, size_t size, uint64_t crc)
-            {
-                crc = ~crc;
-                for (size_t i = 0; i < size; ++i)
-                    crc = crc64_table[buf[i] ^ (crc & 0xFF)]
-                            ^ (crc >> 8);
-                return ~crc;
-            }
-
-            int
-            main()
-            {
-                init();
-
-                uint32_t value32 = 0;
-                uint64_t value64 = 0;
-                uint64_t total_size = 0;
-                uint8_t buf[8192];
-
-                while (1) {
-                    const size_t buf_size
-                            = fread(buf, 1, sizeof(buf), stdin);
-                    if (buf_size == 0)
-                        break;
-
-                    total_size += buf_size;
-                    value32 = crc32(buf, buf_size, value32);
-                    value64 = crc64(buf, buf_size, value64);
-                }
-
-                printf("Bytes:  %" PRIu64 "\n", total_size);
-                printf("CRC-32: 0x%08" PRIX32 "\n", value32);
-                printf("CRC-64: 0x%016" PRIX64 "\n", value64);
-
-                return 0;
-            }
-
-
-7. References
-
-        LZMA SDK - The original LZMA implementation
-        http://7-zip.org/sdk.html
-
-        LZMA Utils - LZMA adapted to POSIX-like systems
-        http://tukaani.org/lzma/
-
-        XZ Utils - The next generation of LZMA Utils
-        http://tukaani.org/xz/
-
-        [RFC-1952]
-        GZIP file format specification version 4.3
-        http://www.ietf.org/rfc/rfc1952.txt
-          - Notation of byte boxes in section "2.1. Overall conventions"
-
-        [RFC-2119]
-        Key words for use in RFCs to Indicate Requirement Levels
-        http://www.ietf.org/rfc/rfc2119.txt
-
-        [GNU-tar]
-        GNU tar 1.21 manual
-        http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
-          - Node 9.4.2 "Blocking Factor", paragraph that begins
-            "gzip will complain about trailing garbage"
-          - Note that this URL points to the latest version of the
-            manual, and may some day not contain the note which is in
-            1.21. For the exact version of the manual, download GNU
-            tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz
-
author	Lasse Collin <lasse.collin@tukaani.org>	2009-05-01 11:20:23 +0300
committer	Lasse Collin <lasse.collin@tukaani.org>	2009-05-01 11:20:23 +0300
commit	1496ff437c46f38303e0e94c511ca604b3a11f85 (patch)
tree	9bff5cc1342c15f1313d871bdd5a8fe52b98cd43 /doc/file-format.txt
parent	Fixed a crash in liblzma. (diff)
download	xz-1496ff437c46f38303e0e94c511ca604b3a11f85.tar.xz