aboutsummaryrefslogtreecommitdiff
path: root/doc/file-format.txt
diff options
context:
space:
mode:
authorLasse Collin <lasse.collin@tukaani.org>2008-06-17 15:03:46 +0300
committerLasse Collin <lasse.collin@tukaani.org>2008-06-17 15:03:46 +0300
commitbf6348d1a3ff09fdc06940468f318f75ffa6af11 (patch)
tree60db0660cd88e208997d1133a8bf089c83ab2ec8 /doc/file-format.txt
parentFix uninitialized variable in LZMA encoder. This was (diff)
downloadxz-bf6348d1a3ff09fdc06940468f318f75ffa6af11.tar.xz
Update the file format specification draft. The new one is
a lot simpler than the previous versions, but it also means that the existing code will change a lot.
Diffstat (limited to '')
-rw-r--r--doc/file-format.txt1794
1 files changed, 508 insertions, 1286 deletions
diff --git a/doc/file-format.txt b/doc/file-format.txt
index 2c8cd486..49c9a75f 100644
--- a/doc/file-format.txt
+++ b/doc/file-format.txt
@@ -3,82 +3,54 @@ The .lzma File Format
---------------------
0. Preface
- 0.1. Copyright Notices
- 0.2. Changes
+ 0.1. Copyright Notices
+ 0.2. Changes
1. Conventions
- 1.1. Byte and Its Representation
- 1.2. Multibyte Integers
- 2. Stream
- 2.1. Stream Types
- 2.1.1. Single-Block Stream
- 2.1.2. Multi-Block Stream
- 2.2. Stream Header
- 2.2.1. Header Magic Bytes
- 2.2.2. Stream Flags
- 2.2.3. CRC32
+ 1.1. Byte and Its Representation
+ 1.2. Multibyte Integers
+ 2. Overall Structure of .lzma File
+ 2.1. Stream
+ 2.1.1. Stream Header
+ 2.1.1.1. Header Magic Bytes
+ 2.1.1.2. Stream Flags
+ 2.1.1.3. CRC32
+ 2.1.2. Stream Footer
+ 2.1.2.1. CRC32
+ 2.1.2.2. Backward Size
+ 2.1.2.3. Stream Flags
+ 2.1.2.4. Footer Magic Bytes
+ 2.2. Stream Padding
3. Block
- 3.1. Block Header
- 3.1.1. Block Flags
- 3.1.2. Compressed Size
- 3.1.3. Uncompressed Size
- 3.1.4. List of Filter Flags
- 3.1.4.1. Misc
- 3.1.4.2. External ID
- 3.1.4.3. External Size of Properties
- 3.1.4.4. Filter Properties
- 3.1.5. CRC32
- 3.1.6. Header Padding
- 3.2. Compressed Data
- 3.3. Block Footer
- 3.3.1. Check
- 3.3.2. Stream Footer
- 3.3.2.1. Uncompressed Size
- 3.3.2.2. Backward Size
- 3.3.2.3. Stream Flags
- 3.3.2.4. Footer Magic Bytes
- 3.3.3. Footer Padding
- 4. Filters
- 4.1. Detecting when All Data Has Been Decoded
- 4.1.1. With Uncompressed Size
- 4.1.2. With End of Input
- 4.1.3. With End of Payload Marker
- 4.2. Alignment
- 4.3. Filters
- 4.3.1. Copy
- 4.3.2. Subblock
- 4.3.2.1. Format of the Encoded Output
- 4.3.3. Delta
- 4.3.3.1. Format of the Encoded Output
- 4.3.4. LZMA
- 4.3.4.1. LZMA Properties
- 4.3.4.2. Dictionary Flags
- 4.3.5. Branch/Call/Jump Filters for Executables
- 5. Metadata
- 5.1. Metadata Flags
- 5.2. Size of Header Metadata Block
- 5.3. Total Size
- 5.4. Uncompressed Size
- 5.5. Index
- 5.5.1. Number of Data Blocks
- 5.5.2. Total Sizes
- 5.5.3. Uncompressed Sizes
- 5.6. Extra
- 5.6.1. 0x00: Dummy/Padding
- 5.6.2. 0x01: OpenPGP Signature
- 5.6.3. 0x02: Filter Information
- 5.6.4. 0x03: Comment
- 5.6.5. 0x04: List of Checks
- 5.6.6. 0x05: Original Filename
- 5.6.7. 0x07: Modification Time
- 5.6.8. 0x09: High-Resolution Modification Time
- 5.6.9. 0x0B: MIME Type
- 5.6.10. 0x0D: Homepage URL
- 6. Custom Filter and Extra Record IDs
- 6.1. Reserved Custom Filter ID Ranges
- 7. Cyclic Redundancy Checks
- 8. References
- 8.1. Normative References
- 8.2. Informative References
+ 3.1. Block Header
+ 3.1.1. Block Header Size
+ 3.1.2. Block Flags
+ 3.1.3. Compressed Size
+ 3.1.4. Uncompressed Size
+ 3.1.5. List of Filter Flags
+ 3.1.6. Header Padding
+ 3.1.7. CRC32
+ 3.2. Compressed Data
+ 3.3. Check
+ 4. Index
+ 4.1. Index Indicator
+ 4.2. Number of Records
+ 4.3. List of Records
+ 4.3.1. Total Size
+ 4.3.2. Uncompressed Size
+ 4.4. Index Padding
+ 4.5. CRC32
+ 5. Filter Chains
+ 5.1. Alignment
+ 5.2. Security
+ 5.3. Filters
+ 5.3.1. LZMA2
+ 5.3.2. Branch/Call/Jump Filters for Executables
+ 5.3.3. Delta
+ 5.3.3.1. Format of the Encoded Output
+ 5.4. Custom Filter IDs
+ 5.4.1. Reserved Custom Filter ID Ranges
+ 6. Cyclic Redundancy Checks
+ 7. References
0. Preface
@@ -95,7 +67,7 @@ The .lzma File Format
0.1. Copyright Notices
- Copyright (C) 2006, 2007 Lasse Collin <lasse.collin@tukaani.org>
+ Copyright (C) 2006-2008 Lasse Collin <lasse.collin@tukaani.org>
Copyright (C) 2006 Ville Koskinen <w-ber@iki.fi>
Copying and distribution of this file, with or without
@@ -106,13 +78,14 @@ The .lzma File Format
All source code examples given in this document are put into
the public domain by the authors of this document.
- Thanks for helping with this document goes to Igor Pavlov,
- Mark Adler and Mikko Pouru.
+ Special thanks for helping with this document goes to
+ Igor Pavlov. Thanks for helping with this document goes to
+ Mark Adler, H. Peter Anvin, and Mikko Pouru.
0.2. Changes
- Last modified: 2008-02-01 19:25+0200
+ Last modified: 2008-06-17 14:10+0300
(A changelog will be kept once the first official version
is made.)
@@ -161,7 +134,7 @@ The .lzma File Format
In this document, a boxed byte or a byte sequence declared
using this notation is called `a field'. The example field
- above would be called called `the Foo field' or plain `Foo'.
+ above would be called `the Foo field' or plain `Foo'.
1.2. Multibyte Integers
@@ -170,39 +143,22 @@ The .lzma File Format
are stored in little endian byte order (least significant
byte first).
- When smaller values are more likely than bigger values (e.g.
- file sizes), multibyte integers are encoded in a simple
+ When smaller values are more likely than bigger values (for
+ example file sizes), multibyte integers are encoded in a
variable-length representation:
- Numbers in the range [0, 127] are copied as is, and take
one byte of space.
- - Bigger numbers will occupy two or more bytes. The lowest
- seven bits of every byte are used for data; the highest
- (eighth) bit indicates either that
- 0) the byte is in the middle of the byte sequence, or
- 1) the byte is the first or the last byte.
+ - Bigger numbers will occupy two or more bytes. All but the
+ last byte of the multibyte representation have the highest
+ (eighth) bit set.
For now, the value of the variable-length integers is limited
to 63 bits, which limits the encoded size of the integer to
nine bytes. These limits may be increased in future if needed.
- Note that the encoding is not as optimal as it could be. For
- example, it is possible to encode the number 42 using any
- number of bytes between one and nine. This is convenient
- for non-streamed encoders, that write Compressed Size or
- Uncompressed Size fields to the Block Header (see Section 3.1)
- after the Compressed Data field is written to the disk.
-
- In several situations, the decoder needs to compare that two
- fields contain identical information. When comparing fields
- using the encoding described in this Section, the decoder must
- consider two fields identical if their decoded values are
- identical; it does not matter if the encoded variable-length
- representations differ.
-
- The following C code illustrates encoding and decoding 63-bit
- variables; the highest bit of uint64_t must be unset. The
- functions return the number of bytes occupied by the integer
- (1-9), or zero on error.
+ The following C code illustrates encoding and decoding of
+ variable-length integers. The functions return the number of
+ bytes occupied by the integer (1-9), or zero on error.
#include <sys/types.h>
#include <inttypes.h>
@@ -210,20 +166,18 @@ The .lzma File Format
size_t
encode(uint8_t buf[static 9], uint64_t num)
{
- if (num >= (UINT64_C(1) << (9 * 7)))
+ if (num >= UINT64_MAX / 2)
return 0;
- if (num <= 0x7F) {
- buf[0] = num;
- return 1;
- }
- buf[0] = (num & 0x7F) | 0x80;
- num >>= 7;
- size_t i = 1;
+
+ size_t i = 0;
+
while (num >= 0x80) {
- buf[i++] = num & 0x7F;
+ buf[i++] = (uint8_t)(num) | 0x80;
num >>= 7;
}
- buf[i++] = num | 0x80;
+
+ buf[i++] = (uint8_t)(num);
+
return i;
}
@@ -232,46 +186,29 @@ The .lzma File Format
{
if (size_max == 0)
return 0;
+
if (size_max > 9)
size_max = 9;
+
*num = buf[0] & 0x7F;
- if (!(buf[0] & 0x80))
- return 1;
- size_t i = 1;
- do {
- if (i == size_max)
- return 0;
- *num |= (uint64_t)(buf[i] & 0x7F) << (7 * i);
- } while (!(buf[i++] & 0x80));
- return i;
- }
+ size_t i = 0;
- size_t
- decode_reverse(const uint8_t buf[], size_t size_max,
- uint64_t *num)
- {
- if (size_max == 0)
- return 0;
- const size_t end = size_max > 9 ? size_max - 9 : 0;
- size_t i = size_max - 1;
- *num = buf[i] & 0x7F;
- if (!(buf[i] & 0x80))
- return 1;
- do {
- if (i-- == end)
+ while (buf[i++] & 0x80) {
+ if (i > size_max || buf[i] == 0x00)
return 0;
- *num <<= 7;
- *num |= buf[i] & 0x7F;
- } while (!(buf[i] & 0x80));
- return size_max - i;
+
+ *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
+ }
+
+ return i;
}
-2. Stream
+2. Overall Structure of .lzma File
- +========+========+========+
- | Stream | Stream | Stream | ...
- +========+========+========+
+ +========+================+========+================+
+ | Stream | Stream Padding | Stream | Stream Padding | ...
+ +========+================+========+================+
A file contains usually only one Stream. However, it is
possible to concatenate multiple Streams together with no
@@ -280,53 +217,44 @@ The .lzma File Format
Stream once the end of the first Stream has been reached.
-2.1. Stream Types
+2.1. Stream
- There are two types of Streams: Single-Block Streams and
- Multi-Block Streams. Decoders conforming to this specification
- must support at least Single-Block Streams. Supporting
- Multi-Block Streams is optional. If the decoder supports only
- Single-Block Streams, the documentation of the decoder should
- mention this fact clearly.
+ +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
+ | Stream Header | Block | Block | ... | Block |
+ +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
+ +=======+-+-+-+-+-+-+-+-+-+-+-+-+
+ ---> | Index | Stream Footer |
+ +=======+-+-+-+-+-+-+-+-+-+-+-+-+
-2.1.1. Single-Block Stream
+ All the above fields have a size that is a multiple of four. If
+ Stream is used as an internal part of another file format, it
+ is recommended to make the Stream start at an offset that is
+ a multiple of four bytes.
- +===============+============+
- | Stream Header | Data Block |
- +===============+============+
+ Stream Header, Index, and Stream Footer are always present in
+ a Stream. The maximum size of the Index field is 16 GiB (2^34).
- As the name says, a Single-Block Stream has exactly one Block.
- The Block must be a Data Block; Metadata Blocks are not allowed
- in Single-Block Streams.
+ There are zero or more Blocks. The maximum number of Blocks is
+ limited only by the maximum size of the Index field.
+ Total size of a Stream must be less than 8 EiB (2^63 bytes).
+ The same limit applies to the total amount of uncompressed
+ data stored in a Stream.
-2.1.2. Multi-Block Stream
+ If an implementation supports handling .lzma files with
+ multiple concatenated Streams, it may apply the above limits
+ to the file as a whole instead of limiting per Stream basis.
- +===============+=======================+
- | Stream Header | Header Metadata Block |
- +===============+=======================+
-
- +============+ +============+=======================+
- ---> | Data Block | ... | Data Block | Footer Metadata Block |
- +============+ +============+=======================+
-
- Notes:
- - Stream Header is mandatory.
- - Header Metadata Block is optional.
- - Each Multi-Block Stream has at least one Data Block. The
- maximum number of Data Blocks is not limited.
- - Footer Metadata Block is mandatory.
+2.1.1. Stream Header
-2.2. Stream Header
-
- +---+---+---+---+---+---+--------------+--+--+--+--+
+ +---+---+---+---+---+---+-------+------+--+--+--+--+
| Header Magic Bytes | Stream Flags | CRC32 |
- +---+---+---+---+---+---+--------------+--+--+--+--+
+ +---+---+---+---+---+---+-------+------+--+--+--+--+
-2.2.1. Header Magic Bytes
+2.1.1.1. Header Magic Bytes
The first six (6) bytes of the Stream are so called Header
Magic Bytes. They can be used to identify the file type.
@@ -341,33 +269,47 @@ The .lzma File Format
Notes:
- The first byte (0xFF) was chosen so that the files cannot
be erroneously detected as being in LZMA_Alone format, in
- which the first byte is in the the range [0x00, 0xE0].
+ which the first byte is in the range [0x00, 0xE0].
- The sixth byte (0x00) was chosen to prevent applications
from misdetecting the file as a text file.
+ If the Header Magic Bytes don't match, the decoder must
+ indicate an error.
+
+
+2.1.1.2. Stream Flags
+
+ The first byte of Stream Flags is always a nul byte. In future
+ this byte may be used to indicate new Stream version or other
+ Stream properties.
+
+ The second byte of Stream Flags is a bit field:
-2.2.2. Stream Flags
-
- Bit(s) Mask Description
- 0-2 0x07 Type of Check (see Section 3.3.1):
- ID Size Check name
- 0x00 0 bytes None
- 0x01 4 bytes CRC32
- 0x02 4 bytes (Reserved)
- 0x03 8 bytes CRC64
- 0x04 16 bytes (Reserved)
- 0x05 32 bytes SHA-256
- 0x06 32 bytes (Reserved)
- 0x07 64 bytes (Reserved)
- 3 0x08 The CRC32 field is present in Block Headers.
- 4 0x10 If unset, this is a Single-Block Stream; if set,
- this is a Multi-Block Stream.
- 5-7 0xE0 Reserved for future use; must be zero for now.
+ Bit(s) Mask Description
+ 0-3 0x0F Type of Check (see Section 3.3):
+ ID Size Check name
+ 0x00 0 bytes None
+ 0x01 4 bytes CRC32
+ 0x02 4 bytes (Reserved)
+ 0x03 4 bytes (Reserved)
+ 0x04 8 bytes CRC64
+ 0x05 8 bytes (Reserved)
+ 0x06 8 bytes (Reserved)
+ 0x07 16 bytes (Reserved)
+ 0x08 16 bytes (Reserved)
+ 0x09 16 bytes (Reserved)
+ 0x0A 32 bytes SHA-256
+ 0x0B 32 bytes (Reserved)
+ 0x0C 32 bytes (Reserved)
+ 0x0D 64 bytes (Reserved)
+ 0x0E 64 bytes (Reserved)
+ 0x0F 64 bytes (Reserved)
+ 4-7 0xF0 Reserved for future use; must be zero for now.
Implementations must support at least the Check IDs 0x00 (None)
- and 0x01 (CRC32). Supporting other Check IDs is optional. If an
- unsupported Check is used, the decoder must indicate a warning
- or error.
+ and 0x01 (CRC32). Supporting other Check IDs is optional. If
+ an unsupported Check is used, the decoder should indicate a
+ warning or error.
If any reserved bit is set, the decoder must indicate an error.
It is possible that there is a new field present which the
@@ -375,256 +317,259 @@ The .lzma File Format
incorrectly.
-2.2.3. CRC32
+2.1.1.3. CRC32
The CRC32 is calculated from the Stream Flags field. It is
stored as an unsigned 32-bit little endian integer. If the
calculated value does not match the stored one, the decoder
must indicate an error.
- Note that this field is always present; the bit in Stream Flags
- controls only presence of CRC32 in Block Headers.
+ The idea is that Stream Flags would always be two bytes, even
+ if new features are needed. This way old decoders will be able
+ to verify the CRC32 calculated from Stream Flags, and thus
+ distinguish between corrupt files (CRC32 doesn't match) and
+ files that the decoder doesn't support (CRC32 matches but
+ Stream Flags has reserved bits set).
-3. Block
+2.1.2. Stream Footer
- +==============+=================+==============+
- | Block Header | Compressed Data | Block Footer |
- +==============+=================+==============+
+ +-+-+-+-+---+---+---+---+-------+------+----------+---------+
+ | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
+ +-+-+-+-+---+---+---+---+-------+------+----------+---------+
- There are two types of Blocks:
- - Data Blocks hold the actual compressed data.
- - Metadata Blocks hold the Index, Extra, and a few other
- non-data fields (see Section 5).
- The type of the Block is indicated by the corresponding bit
- in the Block Flags field (see Section 3.1.1).
+2.1.2.1. CRC32
+ The CRC32 is calculated from the Backward Size and Stream Flags
+ fields. It is stored as an unsigned 32-bit little endian
+ integer. If the calculated value does not match the stored one,
+ the decoder must indicate an error.
-3.1. Block Header
+ The reason to have the CRC32 field before the Backward Size and
+ Stream Flags fields is to keep the four-byte fields aligned to
+ a multiple of four bytes.
- +------+------+=================+===================+
- | Block Flags | Compressed Size | Uncompressed Size |
- +------+------+=================+===================+
- +======================+--+--+--+--+================+
- ---> | List of Filter Flags | CRC32 | Header Padding |
- +======================+--+--+--+--+================+
+2.1.2.2. Backward Size
+ Backward Size is stored as a 32-bit little endian integer,
+ which indicates the size of the Index field as multiple of
+ four bytes, minimum value being four bytes:
-3.1.1. Block Flags
+ real_backward_size = (stored_backward_size + 1) * 4;
- The first byte of the Block Flags field is a bit field:
+ Using a fixed-size integer to store this value makes it
+ slightly simpler to parse the Stream Footer when the
+ application needs to parse the Stream backwards.
- Bit(s) Mask Description
- 0-2 0x07 Number of filters (0-7)
- 3 0x08 Use End of Payload Marker (even if
- Uncompressed Size is stored to Block Header).
- 4 0x10 The Compressed Size field is present.
- 5 0x20 The Uncompressed Size field is present.
- 6 0x40 Reserved for future use; must be zero for now.
- 7 0x80 This is a Metadata Block.
- The second byte of the Block Flags field is also a bit field:
+2.1.2.3. Stream Flags
- Bit(s) Mask Description
- 0-4 0x1F Size of the Header Padding field (0-31 bytes)
- 5-7 0xE0 Reserved for future use; must be zero for now.
+ This is a copy of the Stream Flags field from the Stream
+ Header. The information stored to Stream Flags is needed
+ when parsing the Stream backwards. The decoder must compare
+ the Stream Flags fields in both Stream Header and Stream
+ Footer, and indicate an error if they are not identical.
- The decoder must indicate an error if End of Payload Marker
- is not used and Uncompressed Size is not stored to the Block
- Header. Because of this, the first byte of Block Flags can
- never be a nul byte. This is useful when detecting beginning
- of the Block after Footer Padding (see Section 3.3.3).
- If any reserved bit is set, the decoder must indicate an error.
- It is possible that there is a new field present which the
- decoder is not aware of, and can thus parse the Block Header
- incorrectly.
+2.1.2.4. Footer Magic Bytes
+ As the last step of the decoding process, the decoder must
+ verify the existence of Footer Magic Bytes. If they don't
+ match, an error must be indicated.
-3.1.2. Compressed Size
+ Using a C array and ASCII:
+ const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
- This field is present only if the appropriate bit is set in
- the Block Flags field (see Section 3.1.1).
+ In hexadecimal:
+ 59 5A
- This field contains the size of the Compressed Data field.
- The size is stored using the encoding described in Section 1.2.
- If the Compressed Size does not match the real size of the
- Compressed Data field, the decoder must indicate an error.
+ The primary reason to have Footer Magic Bytes is to make
+ it easier to detect incomplete files quickly, without
+ uncompressing. If the file does not end with Footer Magic Bytes
+ (excluding Stream Padding described in Section 2.2), it cannot
+ be undamaged, unless someone has intentionally appended garbage
+ after the end of the Stream.
- Having the Compressed Size field in the Block Header can be
- useful for multithreaded decoding when seeking is not possible.
- If the Blocks are small enough, the decoder can read multiple
- Blocks into its internal buffer, and decode the Blocks in
- parallel.
- Compressed Size can also be useful when seeking forwards to
- a specific location in streamed mode: the decoder can quickly
- skip over irrelevant Blocks, without decoding them.
+2.2. Stream Padding
+ Only the decoders that support decoding of concatenated Streams
+ must support Stream Padding.
-3.1.3. Uncompressed Size
+ Stream Padding must contain only nul bytes. Any non-nul byte
+ should be considered as the beginning of a new Stream. To
+ preserve the four-byte alignment of consecutive Streams, the
+ size of Stream Padding must be a multiple of four bytes. Empty
+ Stream Padding is allowed.
- This field is present only if the appropriate bit is set in
- the Block Flags field (see Section 3.1.1).
+ Note that non-empty Stream Padding is allowed at the end of the
+ file; there doesn't need to be a new Stream after non-empty
+ Stream Padding. This can be convenient in certain situations
+ [GNU-tar].
- The Uncompressed Size field contains the size of the Block
- after uncompressing.
+ The possibility of Padding should be taken into account when
+ designing an application that parses the Stream backwards.
- Storing Uncompressed Size serves several purposes:
- - The decoder will know when all of the data has been
- decoded without an explicit End of Payload Marker.
- - The decoder knows how much memory it needs to allocate
- for a temporary buffer in multithreaded mode.
- - Simple error detection: wrong size indicates a broken file.
- - Sometimes it is useful to know the file size without
- uncompressing the file.
- It should be noted that the only reliable way to find out what
- the real uncompressed size is is to uncompress the Block,
- because the Block Header and Metadata Block fields may contain
- (intentionally or unintentionally) invalid information.
+3. Block
- Uncompressed Size is stored using the encoding described in
- Section 1.2. If the Uncompressed Size does not match the
- real uncompressed size, the decoder must indicate an error.
+ +==============+=================+=======+
+ | Block Header | Compressed Data | Check |
+ +==============+=================+=======+
-3.1.4. List of Filter Flags
+3.1. Block Header
- +================+================+ +================+
- | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
- +================+================+ +================+
+ +-------------------+-------------+=================+
+ | Block Header Size | Block Flags | Compressed Size |
+ +-------------------+-------------+=================+
- The number of Filter Flags fields is stored in the Block Flags
- field (see Section 3.1.1). As a special case, if the number of
- Filter Flags fields is zero, it is equivalent to having the
- Copy filter as the only filter.
+ +===================+======================+
+ ---> | Uncompressed Size | List of Filter Flags |
+ +===================+======================+
- The format of each Filter Flags field is as follows:
+ +================+--+--+--+--+
+ ---> | Header Padding | CRC32 |
+ +================+--+--+--+--+
- +------+=============+=============================+
- | Misc | External ID | External Size of Properties |
- +------+=============+=============================+
- +===================+
- ---> | Filter Properties |
- +===================+
+3.1.1. Block Header Size
- The list of officially defined Filter IDs and the formats of
- their Filter Properties are described in Section 4.3.
+ This field overlaps with the Index Indicator field (see
+ Section 4.1).
+ This field contains the size of the Block Header field,
+ including the Block Header Size field itself. Valid values are
+ in the range [0x01, 0xFF], which indicate the size of the Block
+ Header as multiples of four bytes, minimum size being eight
+ bytes:
-3.1.4.1. Misc
+ real_header_size = (encoded_header_size + 1) * 4;
- To save space, the most commonly used Filter IDs and the
- Size of Filter Properties are encoded in a single byte.
- Depending on the contents of the Misc field, Filter ID is
- the value of the Misc or External ID field.
+ If bigger Block Header is needed in future, a new field can be
+ added between the current Block Header and Compressed Data
+ fields. The presence of this new field would be indicated in
+ the Block Header.
- Value Filter ID Size of Filter Properties
- 0x00 - 0x1F Misc 0 bytes
- 0x20 - 0x3F Misc 1 byte
- 0x40 - 0x5F Misc 2 bytes
- 0x60 - 0x7F Misc 3 bytes
- 0x80 - 0x9F Misc 4 bytes
- 0xA0 - 0xBF Misc 5 bytes
- 0xC0 - 0xDF Misc 6 bytes
- 0xE0 - 0xFE External ID 0-30 bytes
- 0xFF External ID External Size of Properties
- The following code demonstrates parsing the Misc field and,
- when needed, the External ID and External Size of Properties
- fields.
+3.1.2. Block Flags
- uint64_t id;
- uint64_t properties_size;
- uint8_t misc = read_byte();
+ The first byte of the Block Flags field is a bit field:
- if (misc >= 0xE0) {
- id = read_variable_length_integer();
+ Bit(s) Mask Description
+ 0-1 0x03 Number of filters (1-4)
+ 2-5 0x3C Reserved for future use; must be zero for now.
+ 6 0x40 The Compressed Size field is present.
+ 7 0x80 The Uncompressed Size field is present.
- if (misc == 0xFF)
- properties_size = read_variable_length_integer();
- else
- properties_size = misc - 0xE0;
+ If any reserved bit is set, the decoder must indicate an error.
+ It is possible that there is a new field present which the
+ decoder is not aware of, and can thus parse the Block Header
+ incorrectly.
- } else {
- id = misc;
- properties_size = misc / 0x20;
- }
+3.1.3. Compressed Size
-3.1.4.2. External ID
+ This field is present only if the appropriate bit is set in
+ the Block Flags field (see Section 3.1.2).
- This field is present only if the Misc field contains a value
- that indicates usage of External ID. The External ID is stored
- using the encoding described in Section 1.2.
+ This field contains the size of the Compressed Data field as
+ multiple of four bytes, minimum value being four bytes:
+ real_compressed_size = (stored_compressed_size + 1) * 4;
-3.1.4.3. External Size of Properties
+ The size is stored using the encoding described in Section 1.2.
+ If the Compressed Size does not match the real size of the
+ Compressed Data field, the decoder must indicate an error.
- This field is present only if the Misc field contains a value
- that indicates usage of External Size of Properties. The size
- of Filter Properties is stored using the encoding described in
- Section 1.2.
+3.1.4. Uncompressed Size
-3.1.4.4. Filter Properties
+ This field is present only if the appropriate bit is set in
+ the Block Flags field (see Section 3.1.2).
- Size of this field depends on the Misc field (Section 3.1.4.1)
- and, if present, External Size of Properties field (Section
- 3.1.4.3). The format of this field is depends on the selected
- filter; see Section 4.3 for details.
+ The Uncompressed Size field contains the size of the Block
+ after uncompressing. Uncompressed Size is stored using the
+ encoding described in Section 1.2. If the Uncompressed Size
+ does not match the real uncompressed size, the decoder must
+ indicate an error.
+ Storing the Compressed Size and Uncompressed Size fields serves
+ several purposes:
+ - The decoder knows how much memory it needs to allocate
+ for a temporary buffer in multithreaded mode.
+ - Simple error detection: wrong size indicates a broken file.
+ - Seeking forwards to a specific location in streamed mode.
-3.1.5. CRC32
+ It should be noted that the only reliable way to determine
+ the real uncompressed size is to uncompress the Block,
+ because the Block Header and Index fields may contain
+ (intentionally or unintentionally) invalid information.
- This field is present only if the appropriate bit is set in
- the Stream Flags field (see Section 2.2.2).
- The CRC32 is calculated over everything in the Block Header
- field except the Header Padding field and the CRC32 field
- itself. It is stored as an unsigned 32-bit little endian
- integer. If the calculated value does not match the stored
- one, the decoder must indicate an error.
+3.1.5. List of Filter Flags
+
+ +================+================+ +================+
+ | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
+ +================+================+ +================+
+
+ The number of Filter Flags fields is stored in the Block Flags
+ field (see Section 3.1.2).
+
+ The format of each Filter Flags field is as follows:
+
+ +===========+====================+===================+
+ | Filter ID | Size of Properties | Filter Properties |
+ +===========+====================+===================+
+
+ Both Filter ID and Size of Properties are stored using the
+ encoding described in Section 1.2. Size of Properties indicates
+ the size of the Filter Properties field as bytes. The list of
+ officially defined Filter IDs and the formats of their Filter
+ Properties are described in Section 5.3.
3.1.6. Header Padding
- This field contains as many nul bytes as indicated by the value
- stored in the Header Flags field. If the Header Padding field
- contains any non-nul bytes, the decoder must indicate an error.
+ This field contains as many nul byte as it is needed to make
+ the Block Header have the size specified in Block Header Size.
+ If any of the bytes are not nul bytes, the decoder must
+ indicate an error. It is possible that there is a new field
+ present which the decoder is not aware of, and can thus parse
+ the Block Header incorrectly.
- The intent of the Header Padding field is to allow alignment
- of Compressed Data. The usefulness of alignment is described
- in Section 4.3.
+3.1.7. CRC32
-3.2. Compressed Data
+ The CRC32 is calculated over everything in the Block Header
+ field except the CRC32 field itself. It is stored as an
+ unsigned 32-bit little endian integer. If the calculated
+ value does not match the stored one, the decoder must indicate
+ an error.
- The format of Compressed Data depends on Block Flags and List
- of Filter Flags. Excluding the descriptions of the simplest
- filters in Section 4, the format of the filter-specific encoded
- data is out of scope of this document.
+ By verifying the CRC32 of the Block Header before parsing the
+ actual contents allows the decoder to distinguish between
+ corrupt and unsupported files.
- Note a special case: if End of Payload Marker (see Section
- 3.1.1) is not used and Uncompressed Size is zero, the size
- of the Compressed Data field is always zero.
+3.2. Compressed Data
-3.3. Block Footer
+ The format of Compressed Data depends on Block Flags and List
+ of Filter Flags. Excluding the descriptions of the simplest
+ filters in Section 5.3, the format of the filter-specific
+ encoded data is out of scope of this document.
- +=======+===============+================+
- | Check | Stream Footer | Footer Padding |
- +=======+===============+================+
+ If the natural size of Compressed Data is not a multiple of
+ four bytes, it must be padded with 1-3 nul bytes to make it
+ a multiple of four bytes.
-3.3.1. Check
+3.3. Check
The type and size of the Check field depends on which bits
- are set in the Stream Flags field (see Section 2.2.2).
+ are set in the Stream Flags field (see Section 2.1.1.2).
The Check, when used, is calculated from the original
uncompressed data. If the calculated Check does not match the
@@ -633,101 +578,106 @@ The .lzma File Format
a warning or error.
-3.3.2. Stream Footer
+4. Index
- +===================+===============+--------------+
- | Uncompressed Size | Backward Size | Stream Flags |
- +===================+===============+--------------+
+ +-----------------+=========================+
+ | Index Indicator | Number of Index Records |
+ +-----------------+=========================+
- +----------+---------+
- ---> | Footer Magic Bytes |
- +----------+---------+
+ +=================+=========+-+-+-+-+
+ ---> | List of Records | Padding | CRC32 |
+ +=================+=========+-+-+-+-+
- Stream Footer is present only in
- - Data Block of a Single-Block Stream; and
- - Footer Metadata Block of a Multi-Block Stream.
+ Index serves several purporses. Using it, one can
+ - verify that all Blocks in a Stream have been processed;
+ - find out the uncompressed size of a Stream; and
+ - quickly access the beginning of any Block (random access).
- The Stream Footer field is placed inside Block Footer, because
- no padding is allowed between Check and Stream Footer.
+4.1. Index Indicator
-3.3.2.1. Uncompressed Size
+ This field overlaps with the Block Header Size field (see
+ Section 3.1.1). The value of Index Indicator is always 0x00.
- This field is present only in the Data Block of a Single-Block
- Stream if Uncompressed Size is not stored to the Block Header
- (see Section 3.1.1). Without the Uncompressed Size field in
- Stream Footer it would not be possible to quickly find out
- the Uncompressed Size of the Stream in all cases.
- Uncompressed Size is stored using the encoding described in
- Section 1.2. If the stored value does not match the real
- uncompressed size of the Single-Block Stream, the decoder must
- indicate an error.
+4.2. Number of Records
+ This field indicates how many Records there are in the List
+ of Records field, and thus how many Blocks there are in the
+ Stream. The value is stored using the encoding described in
+ Section 1.2. If the decoder has decoded all the Blocks of the
+ Stream, and then notices that the Number of Records doesn't
+ match the real number of Blocks, the decoder must indicate an
+ error.
-3.3.2.2. Backward Size
- This field contains the total size of the Block Header,
- Compressed Data, Check, and Uncompressed Size fields. The
- value is stored using the encoding described in Section 1.2.
- If the Backward Size does not match the real total size of
- the appropriate fields, the decoder must indicate an error.
+4.3. List of Records
- Implementations reading the Stream backwards should notice
- that the value in this field can never be zero.
+ List of Records consists of as many Records as indicated by the
+ Number of Records field:
+ +========+========+
+ | Record | Record | ...
+ +========+========+
-3.3.2.3. Stream Flags
+ Each Record contains two fields:
- This is a copy of the Stream Flags field from the Stream
- Header. The information stored to Stream Flags is needed
- when parsing the Stream backwards.
+ +============+===================+
+ | Total Size | Uncompressed Size |
+ +============+===================+
+ If the decoder has decoded all the Blocks of the Stream, it
+ must verify that the contents of the Records match the real
+ Total Size and Uncompressed Size of the respective Blocks.
-3.3.2.4. Footer Magic Bytes
+ Implementation hint: It is possible to verify the Index with
+ constant memory usage by calculating for example SHA256 of both
+ the real size values and the List of Records, then comparing
+ the check values. Implementing this using non-cryptographic
+ check like CRC32 should be avoided unless small code size is
+ important.
- As the last step of the decoding process, the decoder must
- verify the existence of Footer Magic Bytes. If they are not
- found, an error must be indicated.
+ If the decoder supports random-access reading, it must verify
+ that Total Size and Uncompressed Size of every completely
+ decoded Block match the sizes stored in the Index. If only
+ partial Block is decoded, the decoder must verify that the
+ processed sizes don't exceed the sizes stored in the Index.
- Using a C array and ASCII:
- const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
- In hexadecimal:
- 59 5A
+4.3.1. Total Size
- The primary reason to have Footer Magic Bytes is to make
- it easier to detect incomplete files quickly, without
- uncompressing. If the file does not end with Footer Magic Bytes
- (excluding Footer Padding described in Section 3.3.3), it
- cannot be undamaged, unless someone has intentionally appended
- garbage after the end of the Stream. (Appending garbage at the
- end of the file does not prevent uncompressing the file, but
- may give a warning or error depending on the decoder
- implementation.)
+ This field indicates the encoded size of the respective Block
+ as multiples of four bytes, minimum value being four bytes:
+ real_total_size = (stored_total_size + 1) * 4;
-3.3.3. Footer Padding
+ The value is stored using the encoding described in Section
+ 1.2.
- In certain situations it is convenient to be able to pad
- Blocks or Streams to be multiples of, for example, 512 bytes.
- Footer Padding makes this possible. Note that this is in no
- way required to enforce alignment in the way described in
- Section 4.3; the Header Padding field is enough for that.
- When Footer Padding is used, it must contain only nul bytes.
- Any non-nul byte should be considered as the beginning of
- a new Block or Stream.
+4.3.2. Uncompressed Size
- The possibility of Padding should be taken into account when
- designing an application that wants to find out information
- about a Stream by parsing Footer Metadata Block.
+ This field indicates the Uncompressed Size of the respective
+ Block as bytes. The value is stored using the encoding
+ described in Section 1.2.
- Support for Padding was inspired by a related note in
- [GNU-tar].
+4.4. Index Padding
-4. Filters
+ This field must contain 0-3 nul bytes to pad the Index to
+ a multiple of four bytes.
+
+
+4.5. CRC32
+
+ The CRC32 is calculated over everything in the Index field
+ except the CRC32 field itself. The CRC32 is stored as an
+ unsigned 32-bit little endian integer. If the calculated
+ value does not match the stored one, the decoder must indicate
+ an error.
+
+
+5. Filter Chains
The Block Flags field defines how many filters are used. When
more than one filter is used, the filters are chained; that is,
@@ -737,116 +687,11 @@ The .lzma File Format
v Uncompressed Data ^
| Filter 0 |
Encoder | Filter 1 | Decoder
- | ... |
| Filter n |
v Compressed Data ^
- The filters are independent from each other, except that they
- must cooperate a little to make it possible, in all cases, to
- detect when all of the data has been decoded. In addition, the
- filters should cooperate in the encoder to keep the alignment
- optimal.
-
-
-4.1. Detecting when All Data Has Been Decoded
-
- There must be a way for the decoder to detect when all of the
- Compressed Data has been decoded. This is simple when only
- one filter is used, but a bit more complex when multiple
- filters are chained.
- This file format supports three methods to detect when all of
- the data has been decoded:
- - Uncompressed size
- - End of Input
- - End of Payload Marker
-
- In both encoder and decoder, filters are initialized starting
- from the first filter in the chain. For each filter, one of
- these three methods is used.
-
-
-4.1.1. With Uncompressed Size
-
- This method is the only method supported by all filters.
- It must be used when uncompressed size is known by the
- filter-specific encoder or decoder. In practice this means
- that Uncompressed Size has been stored to the Block Header.
-
- In case of the first filter in the chain, the uncompressed size
- given to the filter-specific encoder or decoder equals the
- Uncompressed Size stored in the Block Header. For the rest of
- the filters in the chain, uncompressed size is the size of the
- output data of the previous filter in the chain.
-
- Note that when Use End of Payload Marker bit is set in Block
- Flags, Uncompressed Size is considered to be unknown even if
- it was present in the Block Header. Thus, if End of Payload
- Marker is used, uncompressed size of all of the filters in
- the chain is unknown, and can never be used to detect when
- all of the data has been decoded.
-
- Once the correct number of bytes has been written out, the
- filter-specific decoder indicates to its caller that all of
- the data has been decoded. If the filter-specific decoder
- detects End of Input or End of Payload Marker before the
- correct number of bytes is decoded, the decoder must indicate
- an error.
-
-
-4.1.2. With End of Input
-
- Most filters will know that all of the data has been decoded
- when the End of Input data has been reached. Once the filter
- knows that it has received the input data in its entirety,
- it finishes its job, and indicates to its caller that all of
- the data has been decoded. The filter-specific decoder must
- indicate an error if it detects End of Payload Marker.
-
- Note that this method can work only when the filter is not
- the last filter in the chain, because only another filter
- can indicate the End of Input data. In practice this means,
- that a filter later in the chain must support embedding
- End of Payload Marker.
-
- When a filter that cannot embed End of Payload Marker is the
- last filter in the chain, Subblock filter is appended to the
- chain as an implicit filter. In the simplest case, this occurs
- when no filters are specified, and the End of Payload Marker
- bit is set in Block Flags.
-
-
-4.1.3. With End of Payload Marker
-
- End of Payload Marker is a filter-specific bit sequence that
- indicates the end of data. It is supported by only a few
- filters. It is used when uncompressed size is unknown, and
- the filter
- - doesn't support End of Input; or
- - is the last filter in the chain.
-
- End of Payload Marker is embedded at the end of the encoded
- data by the filter-specific encoder. When the filter-specific
- decoder detects the embedded End of Payload Marker, the decoder
- knows that all of the data has been decoded. Then it finishes
- its job, and indicates to its caller that all of the data has
- been decoded. If the filter-specific decoder detects End of
- Input before End of Payload Marker, the decoder must indicate
- an error.
-
- If the filter supports both End of Input and End of Payload
- Marker, the former is used, unless the filter is the last
- filter in the chain.
-
-
-4.2. Alignment
-
- Some filters give better compression ratio or are faster
- when the input or output data is aligned. For optimal results,
- the encoder should try to enforce proper alignment when
- possible. Not enforcing alignment in the encoder is not
- an error. Thus, the decoder must be able to handle files with
- suboptimal alignment.
+5.1. Alignment
Alignment of uncompressed input data is usually the job of
the application producing the data. For example, to get the
@@ -866,8 +711,9 @@ The .lzma File Format
four-byte-aligned input data.
The output of the last filter in the chain is stored to the
- Compressed Data field. Aligning Compressed Data appropriately
- can increase
+ Compressed Data field, which is is guaranteed to be aligned
+ to a multiple of four bytes relative to the beginning of the
+ Stream. This can increase
- speed, if the filtered data is handled multiple bytes at
a time by the filter-specific encoder and decoder,
because accessing aligned data in computer memory is
@@ -875,308 +721,67 @@ The .lzma File Format
- compression ratio, if the output data is later compressed
with an external compression tool.
- Compressed Data in a Stream can be aligned by using the Header
- Padding field in the Block Header.
-
-
-4.3. Filters
-
-4.3.1. Copy
-
- This is a dummy filter that simply copies all data from input
- to output unmodified.
-
- Filter ID: 0x00
- Size of Filter Properties: 0 bytes
- Changes size of data: No
-
- Detecting when all of the data has been decoded:
- Uncompressed size: Yes
- End of Payload Marker: No
- End of Input: Yes
-
- Preferred alignment:
- Input data: 1 byte
- Output data: 1 byte
-
-
-4.3.2. Subblock
-
- The Subblock filter can be used to
- - embed End of Payload Marker when the otherwise last
- filter in the chain does not support embedding it; and
- - apply additional filters in the middle of a Block.
- Filter ID: 0x01
- Size of Filter Properties: 0 bytes
- Changes size of data: Yes, unpredictably
+5.2. Security
- Detecting when all of the data has been decoded:
- Uncompressed size: Yes
- End of Payload Marker: Yes
- End of Input: Yes
+ If filters would be allowed to be chained freely, it would be
+ possible to create malicious files, that would be very slow to
+ decode. Such files could be used to create denial of service
+ attacks.
- Preferred alignment:
- Input data: 1 byte
- Output data: Freely adjustable
+ Slow files could occur when multiple filters are chained:
+ v Compressed input data
+ | Filter 1 decoder (last filter)
+ | Filter 0 decoder (non-last filter)
+ v Uncompressed output data
-4.3.2.1. Format of the Encoded Output
+ The decoder of the last filter in the chain produces a lot of
+ output from little input. Another filter in the chain takes the
+ output of the last filter, and produces very little output
+ while consuming a lot of input. As a result, a lot of data is
+ moved inside the filter chain, but the filter chain as a whole
+ gets very little work done.
- The encoded data from the Subblock filter consist of zero or
- more Subblocks:
+ To prevent this kind of slow files, there are restrictions on
+ how the filters can be chained. These restrictions must be
+ taken into account when designing new filters.
- +==========+==========+
- | Subblock | Subblock | ...
- +==========+==========+
+ The maximum number of filters in the chain has been limited to
+ four, thus there can be at maximum of three non-last filters.
+ Of these three non-last filters, only two are allowed to change
+ the size of the data.
- Each Subblock contains two fields:
+ The non-last filters, that change the size of the data, must
+ have a limit how much the decoder can compress the data: the
+ decoder should produce at least n bytes of output when the
+ filter is given 2n bytes of input. This limit is not
+ absolute, but significant deviations must be avoided.
- +----------------+===============+
- | Subblock Flags | Subblock Data |
- +----------------+===============+
+ The above limitations guarantee that if the last filter in the
+ chain produces 4n bytes of output, the chain as a whole will
+ produce at least n bytes of output.
- Subblock Flags is a bitfield:
- Bits Mask Description
- 0-3 0x0F The interpretation of these bits depend on
- the Subblock Type:
- - 0x20 Bits 0-3 for Size
- - 0x30 Bits 0-3 for Repeat Count
- - Other These bits must be zero.
- 4-7 0xF0 Subblock Type:
- - 0x00: Padding
- - 0x10: End of Payload Marker
- - 0x20: Data
- - 0x30: Repeating Data
- - 0x40: Set Subfilter
- - 0x50: Unset Subfilter
- If some other value is detected, the decoder
- must indicate an error.
-
- The format of the Subblock Data field depends on Subblock Type.
-
- Subblocks with the Subblock Type 0x00 (Padding) don't have a
- Subblock Data field. These Subblocks can be useful for fixing
- alignment. There can be at maximum of 31 consecutive Subblocks
- with this Subblock Type; if there are more, the decoder must
- indicate an error.
+5.3. Filters
- Subblock with the Subblock Type 0x10 (End of Payload Marker)
- doesn't have a Subblock Data field. The decoder must indicate
- an error if this Subblock Type is detected when Subfilter is
- enabled, or when the Subblock filter is not supposed to embed
- the End of Payload Marker.
-
- Subblocks with the Subblock Type 0x20 (Data) contain the rest
- of the Size, which is followed by Size + 1 bytes in the Data
- field (that is, Data can never be empty):
-
- +------+------+------+======+
- | Bits 4-27 for Size | Data |
- +------+------+------+======+
-
- Subblocks with the Subblock Type 0x30 (Repeating Data) contain
- the rest of the Repeat Count, the Size of the Data, and finally
- the actual Data to be repeated:
-
- +---------+---------+--------+------+======+
- | Bits 4-27 for Repeat Count | Size | Data |
- +---------+---------+--------+------+======+
-
- The size of the Data field is Size + 1. It is repeated Repeat
- Count + 1 times. That is, the minimum size of Data is one byte;
- the maximum size of Data is 256 bytes. The minimum number of
- repeats is one; the maximum number of repeats is 2^28.
-
- If Subfilter is not used, the Data field of Subblock Types 0x20
- and 0x30 is the output of the decoded Subblock filter. If
- Subfilter is used, Data is the input of the Subfilter, and the
- decoded output of the Subfilter is the decoded output of the
- Subblock filter.
-
- Subblocks with the Subblock Type 0x40 (Set Subfilter) contain
- a Filter Flags field in Subblock Data:
-
- +==============+
- | Filter Flags |
- +==============+
-
- It is an error to set the Subfilter to Filter ID 0x00 (Copy)
- or 0x01 (Subblock). All the other Filter IDs are allowed.
- The decoder must indicate an error if this Subblock Type is
- detected when a Subfilter is already enabled.
-
- Subblocks with the Subblock Type 0x50 (Unset Subfilter) don't
- have a Subblock Data field. There must be at least one Subblock
- with Subblock Type 0x20 or 0x30 between Subblocks with Subblock
- Type 0x40 and 0x50; if there isn't, the decoder must indicate
- an error.
-
- Subblock Types 0x40 and 0x50 are always used as a pair: If the
- Subblock filter has been enabled with Subblock Type 0x40, it
- must always be disabled later with Subblock Type 0x50.
- Disabling must be done even if the Subfilter used End of
- Payload Marker; after the Subfilter has detected End of Payload
- Marker, the next Subblock that is not Padding must unset the
- Subfilter.
-
- When the Subblock filter is used as an implicit filter to embed
- End of Payload marker, the Subblock Types 0x40 and 0x50 (Set or
- Unset Subfilter) must not be used. The decoder must indicate an
- error if it detects any of these Subblock Types in an implicit
- Subblock filter.
-
- The following code illustrates the basic structure of a
- Subblock decoder.
-
- uint32_t consecutive_padding = 0;
- bool got_output_with_subfilter = false;
-
- while (true) {
- uint32_t size;
- uint32_t repeat;
- uint8_t flags = read_byte();
-
- if (flags != 0)
- consecutive_padding = 0;
-
- switch (flags >> 4) {
- case 0:
- // Padding
- if (flags & 0x0F)
- return DATA_ERROR;
- if (++consecutive_padding == 32)
- return DATA_ERROR;
- break;
-
- case 1:
- // End of Payload Marker
- if (flags & 0x0F)
- return DATA_ERROR;
- if (subfilter_enabled || !allow_eopm)
- return DATA_ERROR;
- break;
-
- case 2:
- // Data
- size = flags & 0x0F;
- for (size_t i = 4; i < 28; i += 8)
- size |= (uint32_t)(read_byte()) << i;
-
- // If any output is produced, this will
- // set got_output_with_subfilter to true.
- copy_data(size);
- break;
-
- case 3:
- // Repeating Data
- repeat = flags & 0x0F;
- for (size_t i = 4; i < 28; i += 8)
- repeat |= (uint32_t)(read_byte()) << i;
- size = read_byte();
-
- // If any output is produced, this will
- // set got_output_with_subfilter to true.
- copy_repeating_data(size, repeat);
- break;
-
- case 4:
- // Set Subfilter
- if (flags & 0x0F)
- return DATA_ERROR;
- if (subfilter_enabled)
- return DATA_ERROR;
- got_output_with_subfilter = false;
- set_subfilter();
- break;
-
- case 5:
- // Unset Subfilter
- if (flags & 0x0F)
- return DATA_ERROR;
- if (!subfilter_enabled)
- return DATA_ERROR;
- if (!got_output_with_subfilter)
- return DATA_ERROR;
- unset_subfilter();
- break;
-
- default:
- return DATA_ERROR;
- }
- }
-
-
-4.3.3. Delta
-
- The Delta filter may increase compression ratio when the value
- of the next byte correlates with the value of an earlier byte
- at specified distance.
-
- Filter ID: 0x20
- Size of Filter Properties: 1 byte
- Changes size of data: No
-
- Detecting when all of the data has been decoded:
- Uncompressed size: Yes
- End of Payload Marker: No
- End of Input: Yes
-
- Preferred alignment:
- Input data: 1 byte
- Output data: Same as the original input data
-
- The Properties byte indicates the delta distance, which can be
- 1-256 bytes backwards from the current byte: 0x00 indicates
- distance of 1 byte and 0xFF distance of 256 bytes.
-
-
-4.3.3.1. Format of the Encoded Output
-
- The code below illustrates both encoding and decoding with
- the Delta filter.
-
- // Distance is in the range [1, 256].
- const unsigned int distance = get_properties_byte() + 1;
- uint8_t pos = 0;
- uint8_t delta[256];
-
- memset(delta, 0, sizeof(delta));
-
- while (1) {
- const int byte = read_byte();
- if (byte == EOF)
- break;
-
- uint8_t tmp = delta[(uint8_t)(distance + pos)];
- if (is_encoder) {
- tmp = (uint8_t)(byte) - tmp;
- delta[pos] = (uint8_t)(byte);
- } else {
- tmp = (uint8_t)(byte) + tmp;
- delta[pos] = tmp;
- }
-
- write_byte(tmp);
- --pos;
- }
-
-
-4.3.4. LZMA
+5.3.1. LZMA2
LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
compression algorithm with high compression ratio and fast
- decompression. LZMA based on LZ77 and range coding algorithms.
+ decompression. LZMA is based on LZ77 and range coding
+ algorithms.
- Filter ID: 0x40
- Size of Filter Properties: 2 bytes
- Changes size of data: Yes, unpredictably
+ LZMA2 uses LZMA internally, but adds support for uncompressed
+ chunks, eases stateful decoder implementations, and improves
+ support for multithreading. Thus, the plain LZMA will not be
+ supported in this file format.
- Detecting when all of the data has been decoded:
- Uncompressed size: Yes
- End of Payload Marker: Yes
- End of Input: No
+ Filter ID: 0x21
+ Size of Filter Properties: 1 byte
+ Changes size of data: Yes
+ Allow as a non-last filter: No
+ Allow as the last filter: Yes
Preferred alignment:
Input data: Adjustable to 1/2/4/8/16 byte(s)
@@ -1188,88 +793,45 @@ The .lzma File Format
a separate document, because including the documentation here
would lengthen this document considerably.
- The format of the Filter Properties field is as follows:
-
- +-----------------+------------------+
- | LZMA Properties | Dictionary Flags |
- +-----------------+------------------+
-
-
-4.3.4.1. LZMA Properties
-
- The LZMA Properties field contains three properties. An
- abbreviation is given in parentheses, followed by the value
- range of the property. The field consists of
-
- 1) the number of literal context bits (lc, [0, 8]);
- 2) the number of literal position bits (lp, [0, 4]); and
- 3) the number of position bits (pb, [0, 4]).
-
- They are encoded using the following formula:
-
- LZMA Properties = (pb * 5 + lp) * 9 + lc
-
- The following C code illustrates a straightforward way to
- decode the properties:
-
- uint8_t lc, lp, pb;
- uint8_t prop = get_lzma_properties() & 0xFF;
- if (prop > (4 * 5 + 4) * 9 + 8)
- return LZMA_PROPERTIES_ERROR;
-
- pb = prop / (9 * 5);
- prop -= pb * 9 * 5;
- lp = prop / 9;
- lc = prop - lp * 9;
-
-
-4.3.4.2. Dictionary Flags
-
- Currently the lowest six bits of the Dictionary Flags field
- are in use:
+ The format of the one-byte Filter Properties field is as
+ follows:
Bits Mask Description
0-5 0x3F Dictionary Size
6-7 0xC0 Reserved for future use; must be zero for now.
Dictionary Size is encoded with one-bit mantissa and five-bit
- exponent. To avoid wasting space, one-byte dictionary has its
- own special value.
+ exponent. The smallest dictionary size is 4 KiB and the biggest
+ is 4 GiB.
Raw value Mantissa Exponent Dictionary size
- 0 1 0 1 byte
- 1 2 0 2 bytes
- 2 3 0 3 bytes
- 3 2 1 4 bytes
- 4 3 1 6 bytes
- 5 2 2 8 bytes
- 6 3 2 12 bytes
- 7 2 3 16 bytes
- 8 3 3 24 bytes
- 9 2 4 32 bytes
+ 0 2 11 4 KiB
+ 1 3 11 6 KiB
+ 2 2 12 8 KiB
+ 3 3 12 12 KiB
+ 4 2 13 16 KiB
+ 5 3 13 24 KiB
+ 6 2 14 32 KiB
... ... ... ...
- 61 2 30 2 GiB
- 62 3 30 3 GiB
- 63 2 31 4 GiB (*)
-
- (*) The real maximum size of the dictionary is one byte
- less than 4 GiB, because the distance of 4 GiB is
- reserved for End of Payload Marker.
+ 35 3 27 768 MiB
+ 36 2 28 1024 MiB
+ 37 3 29 1536 MiB
+ 38 2 30 2048 MiB
+ 39 3 30 3072 MiB
+ 40 2 31 4096 MiB
Instead of having a table in the decoder, the dictionary size
can be decoded using the following C code:
- uint64_t dictionary_size;
const uint8_t bits = get_dictionary_flags() & 0x3F;
- if (bits == 0) {
- dictionary_size = 1;
- } else {
- dictionary_size = 2 | ((bits + 1) & 1);
- dictionary_size = dictionary_size << ((bits - 1) / 2);
- }
+ if (bits > 40)
+ return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
+
+ uint32_t dictionary_size = 2 | (bits & 1);
+ dictionary_size <<= bits / 2 + 11;
-4.3.5. Branch/Call/Jump Filters for Executables
+5.3.2. Branch/Call/Jump Filters for Executables
These filters convert relative branch, call, and jump
instructions to their absolute counterparts in executable
@@ -1278,6 +840,8 @@ The .lzma File Format
Size of Filter Properties: 0 or 4 bytes
Changes size of data: No
+ Allow as a non-last filter: Yes
+ Allow as the last filter: No
Detecting when all of the data has been decoded:
Uncompressed size: Yes
@@ -1307,378 +871,63 @@ The .lzma File Format
the Subblock filter.
-5. Metadata
-
- Metadata is stored in Metadata Blocks, which can be in the
- beginning or at the end of a Multi-Block Stream. Because of
- Blocks, it is possible to compress Metadata in the same way
- as the actual data is compressed. This Section describes the
- format of the data stored in Metadata Blocks.
-
- +----------------+===============================+
- | Metadata Flags | Size of Header Metadata Block |
- +----------------+===============================+
-
- +============+===================+=======+=======+
- ---> | Total Size | Uncompressed Size | Index | Extra |
- +============+===================+=======+=======+
-
- Stream must be parseable backwards. That is, there must be
- a way to locate the beginning of the Stream by starting from
- the end of the Stream. Thus, the Footer Metadata Block must
- contain the Total Size field or the Index field. If the Stream
- has Header Metadata Block, also the Size of Header Metadata
- Block field must be present in Footer Metadata Block.
-
- It must be possible to quickly locate the Blocks in
- non-streamed mode. Thus, the Index field must be present
- at least in one Metadata Block.
-
- If the above conditions are not met, the decoder must indicate
- an error.
-
- There should be no additional data after the last field. If
- there is, the the decoder should indicate an error.
-
-
-5.1. Metadata Flags
-
- This field describes which fields are present in a Metadata
- Block:
-
- Bit(s) Mask Desription
- 0 0x01 Size of Header Metadata Block is present.
- 1 0x02 Total Size is present.
- 2 0x04 Uncompressed Size is present.
- 3 0x08 Index is present.
- 4-6 0x70 Reserve for future use; must be zero for now.
- 7 0x80 Extra is present.
-
- If any reserved bit is set, the decoder must indicate an error.
- It is possible that there is a new field present which the
- decoder is not aware of, and can thus parse the Metadata
- incorrectly.
-
-
-5.2. Size of Header Metadata Block
-
- This field is present only if the appropriate bit is set in
- the Metadata Flags field (see Section 5.1).
-
- Size of Header Metadata Block is needed to make it possible to
- parse the Stream backwards. The size is stored using the
- encoding described in Section 1.2. The decoder must verify that
- that the value stored in this field is non-zero. In Footer
- Metadata Block, the decoder must also verify that the stored
- size matches the real size of Header Metadata Block. In the
- Header Meatadata Block, the value of this field is ignored as
- long as it is not zero.
-
-
-5.3. Total Size
-
- This field is present only if the appropriate bit is set in the
- Metadata Flags field (see Section 5.1).
-
- This field contains the total size of the Data Blocks in the
- Stream. Total Size is stored using the encoding described in
- Section 1.2. If the stored value does not match the real total
- size of the Data Blocks, the decoder must indicate an error.
- The value of this field must be non-zero.
-
- Total Size can be used to quickly locate the beginning or end
- of the Stream. This can be useful for example when doing
- random-access reading, and the Index field is not in the
- Metadata Block currently being read.
-
- It is useless to have both Total Size and Index in the same
- Metadata Block, because Total Size can be calculated from the
- Index field.
-
-
-5.4. Uncompressed Size
-
- This field is present only if the appropriate bit is set in the
- Metadata Flags field (see Section 5.1).
-
- This field contains the total uncompressed size of the Data
- Blocks in the Stream. Uncompresssed Size is stored using the
- encoding described in Section 1.2. If the stored value does not
- match the real uncompressed size of the Data Blocks, the
- decoder must indicate an error.
-
- It is useless to have both Uncompressed Size and Index in
- the same Metadata Block, because Uncompressed Size can be
- calculated from the Index field.
-
-
-5.5. Index
-
- +=======================+=============+====================+
- | Number of Data Blocks | Total Sizes | Uncompressed Sizes |
- +=======================+=============+====================+
-
- Index serves several purporses. Using it, one can
- - verify that all Blocks in a Stream have been processed;
- - find out the Uncompressed Size of a Stream; and
- - quickly access the beginning of any Block (random access).
-
-
-5.5.1. Number of Data Blocks
-
- This field contains the number of Data Blocks in the Stream.
- The value is stored using the encoding described in Section
- 1.2. If the decoder has decoded all the Data Blocks of the
- Stream, and then notices that the Number of Records doesn't
- match the real number of Data Blocks, the decoder must
- indicate an error. The value of this field must be non-zero.
-
-
-5.5.2. Total Sizes
-
- +============+============+
- | Total Size | Total Size | ...
- +============+============+
-
- This field lists the Total Sizes of every Data Block in the
- Stream. There are as many Total Size fields as indicated by
- the Number of Data Blocks field.
-
- Total Size is the size of Block Header, Compressed Data, and
- Block Footer. It is stored using the encoding described in
- Section 1.2. If the Total Sizes do not match the real sizes
- of respective Blocks, the decoder should indicate an error.
- All the Total Size fields must have a non-zero value.
-
-
-5.5.3. Uncompressed Sizes
-
- +===================+===================+
- | Uncompressed Size | Uncompressed Size | ...
- +===================+===================+
-
- This field lists the Uncompressed Sizes of every Data Block
- in the Stream. There are as many Uncompressed Size fields as
- indicated by the Number of Records field.
-
- Uncompressed Sizes are stored using the encoding described
- in Section 1.2. If the Uncompressed Sizes do not match the
- real sizes of respective Blocks, the decoder shoud indicate
- an error.
-
-
-5.6. Extra
-
- This field is present only if the appropriate bit is set in the
- Metadata Flags field (see Section 5.1). Note that the bit does
- not indicate that there is any data in the Extra field; it only
- indicates that Extra may be non-empty.
-
- The Extra field contains only information that is not required
- to properly uncompress the Stream or to do random-access
- reading. Supporting the Extra field is optional. In case the
- decoder doesn't support the Extra field, it should silently
- ignore it.
-
- Extra consists of zero or more Records:
-
- +========+========+
- | Record | Record | ...
- +========+========+
-
- Excluding Records with Record ID 0x00, each Record contains
- three fields:
-
- +==========+==============+======+
- | Reord ID | Size of Data | Data |
- +==========+==============+======+
-
- The Record ID and Size of Data are stored using the encoding
- described in Section 1.2. Data can be binary or UTF-8
- [RFC-3629] strings. Non-UTF-8 strings should be avoided.
- Because the Size of Data is known, there is no need to
- terminate strings with a nul byte, although doing so should
- not be considered an error.
-
- The Record IDs are divided in two categories:
- - Safe-to-Copy Records may be preserved as is when the
- Stream is modified in ways that don't change the actual
- uncompressed data. Examples of such operatings include
- recompressing and adding, modifying, or deleting unrelated
- Extra Records.
- - Unsafe-to-Copy Records should be removed (and possibly
- recreated) when any kind of changes are made to the Stream.
-
- When the actual uncompressed data is modified, all Records
- should be removed (and possibly recreated), unless the
- application knows that the Data stored to the Record(s) is
- still valid.
-
- The following subsections describe the standard Record IDs and
- the format of their Data fields. Safe-to-Copy Records have an
- odd ID, while Unsafe-to-Copy Records have an even ID.
-
-
-5.6.1. 0x00: Dummy/Padding
-
- This Record is special, because it doesn't have the Size of
- Data or Data fields.
-
- Dummy Records can be used, for example, to fill Metadata Block
- when a few bytes of extra space has been reserved for it. There
- can be any number of Dummy Records.
-
-
-5.6.2. 0x01: OpenPGP Signature
-
- OpenPGP signature is computed from uncompressed data. The
- signature can be used to verify that the contents of a Stream
- has been created by a trustworthy source.
-
- If the decoder supports decoding concatenated Streams, it
- must indicate an error when verifying OpenPGP signatures if
- there is more than one Stream.
-
- OpenPGP format is documented in [RFC-2440].
-
-
-5.6.3. 0x02: Filter Information
-
- The Filter Information Record contains information about the
- filters used in the Stream. This field can be used to quickly
- - display which filters are used in each Block;
- - check if all the required filters are supported by the
- current decoder version; and
- - check how much memory is required to decode each Block.
-
- The format of the Filter Information field is as follows:
-
- +=================+=================+
- | Block 0 Filters | Block 1 Filters | ...
- +=================+=================+
-
- There can be at maximum of as many Block Filters fields as
- there are Data Blocks in the Stream. The format of the Block
- Filters field is as follows:
-
- +------------------+======================+============+
- | Block Properties | List of Filter Flags | Subfilters |
- +------------------+======================+============+
-
- Block Properties is a bitfield:
-
- Bit(s) Mask Description
- 0-2 0x07 Number of filters (0-7)
- 3 0x08 End of Payload Marker is used.
- 4 0x10 The Subfilters field is present.
- 5-7 0xE0 Reserved for future use; must be zero for now.
-
- The contents of the List of Filter Flags field must match the
- List of Filter Flags field in the respective Block Header.
-
- The Subfilters field may be present only if the List of Filter
- Flags contains a Filter Flags field for a Subblock filter. The
- format of the Subfilters field is as follows:
-
- +======================+=========================+
- | Number of Subfilters | List of Subfilter Flags |
- +======================+=========================+
-
- The value stored in the Number of Subfilters field is stored
- using the encoding described in Section 1.2. The List of
- Subfilter Flags field contains as many Filter Flags fields
- as indicated by the Number of Subfilters field. These Filter
- Flags fields list some or all the Subfilters used via the
- Subblock filter. The order of the listed Subfilters is not
- significant.
-
- Decoders supporting this Record should indicate a warning or
- error if this Record contains Filter Flags that are not
- actually used by the respective Blocks.
-
-
-5.6.4. 0x03: Comment
-
- Free-form comment is stored in UTF-8 [RFC-3629] encoding.
-
- The beginning of a new line should be indicated using the
- ASCII Line Feed character (0x0A). When the Line Feed character
- is not the native way to indicate new line in the underlying
- operating system, the encoder and decoder should convert the
- newline characters to and from Line Feeds.
-
-
-5.6.5. 0x04: List of Checks
-
- +=======+=======+
- | Check | Check | ...
- +=======+=======+
-
- There are as many Check fields as there are Blocks in the
- Stream. The size of Check fields depend on Stream Flags
- (see Section 2.2.2).
-
- Decoders supporting this Record should indicate a warning or
- error if the Checks don't match the respective Blocks.
-
-
-5.6.6. 0x05: Original Filename
-
- Original filename is stored in UTF-8 [RFC-3629] encoding.
-
- The filename must not include any path, only the filename
- itself. Special care must be taken to prevent directory
- traversal vulnerabilities.
-
- When files are moved between different operating systems, it
- is possible that filename valid in the source system is not
- valid in the target system. It is implementation defined how
- the decoder handles this kind of situations.
-
-
-5.6.7. 0x07: Modification Time
-
- Modification time is stored as POSIX time, as an unsigned
- little endian integer. The number of bits depends on the
- Size of Data field. Note that the usage of unsigned integer
- limits the earliest representable time to 1970-01-01T00:00:00.
+5.3.3. Delta
+ The Delta filter may increase compression ratio when the value
+ of the next byte correlates with the value of an earlier byte
+ at specified distance.
-5.6.8. 0x09: High-Resolution Modification Time
+ Filter ID: 0x03
+ Size of Filter Properties: 1 byte
+ Changes size of data: No
+ Allow as a non-last filter: Yes
+ Allow as the last filter: No
- This Record extends the `0x04: Modification time' Record with
- a subsecond time information. There are two supported formats
- of this field, which can be distinguished by looking at the
- Size of Data field.
+ Preferred alignment:
+ Input data: 1 byte
+ Output data: Same as the original input data
- Size Data
- 3 [0; 9,999,999] times 100 nanoseconds
- 4 [0; 999,999,999] nanoseconds
+ The Properties byte indicates the delta distance, which can be
+ 1-256 bytes backwards from the current byte: 0x00 indicates
+ distance of 1 byte and 0xFF distance of 256 bytes.
- The value is stored as an unsigned 24-bit or 32-bit little
- endian integer.
+5.3.3.1. Format of the Encoded Output
-5.6.9. 0x0B: MIME Type
+ The code below illustrates both encoding and decoding with
+ the Delta filter.
- MIME type of the uncompressed Stream. This can be used to
- detect the content type. [IANA-MIME]
+ // Distance is in the range [1, 256].
+ const unsigned int distance = get_properties_byte() + 1;
+ uint8_t pos = 0;
+ uint8_t delta[256];
+ memset(delta, 0, sizeof(delta));
-5.6.10. 0x0D: Homepage URL
+ while (1) {
+ const int byte = read_byte();
+ if (byte == EOF)
+ break;
- This field can be used, for example, when distributing software
- packages (sources or binaries). The field would indicate the
- homepage of the program.
+ uint8_t tmp = delta[(uint8_t)(distance + pos)];
+ if (is_encoder) {
+ tmp = (uint8_t)(byte) - tmp;
+ delta[pos] = (uint8_t)(byte);
+ } else {
+ tmp = (uint8_t)(byte) + tmp;
+ delta[pos] = tmp;
+ }
- For details on how to encode URLs, see [RFC-1738].
+ write_byte(tmp);
+ --pos;
+ }
-6. Custom Filter and Extra Record IDs
+5.4. Custom Filter IDs
- If a developer wants to use custom Filter or Extra Record IDs,
- he has two choices. The first choice is to contact Lasse Collin
- and ask him to allocate a range of IDs for the developer.
+ If a developer wants to use custom Filter IDs, he has two
+ choices. The first choice is to contact Lasse Collin and ask
+ him to allocate a range of IDs for the developer.
The second choice is to generate a 40-bit random integer,
which the developer can use as his personal Developer ID.
@@ -1690,10 +939,10 @@ The .lzma File Format
dd if=/dev/urandom bs=5 count=1 | hexdump
The developer can then use his Developer ID to create unique
- (well, hopefully unique) Filter and Extra Record IDs.
+ (well, hopefully unique) Filter IDs.
Bits Mask Description
- 0-15 0x0000_0000_0000_FFFF Filter or Extra Record ID
+ 0-15 0x0000_0000_0000_FFFF Filter ID
16-55 0x00FF_FFFF_FFFF_0000 Developer ID
56-62 0x7F00_0000_0000_0000 Static prefix: 0x7F
@@ -1702,21 +951,15 @@ The .lzma File Format
a shorter ID, see the beginning of this Section how to
request a custom ID range.
- Note that Filter and Metadata Record IDs are in their own
- namespaces. That is, you can use the same ID value as Filter ID
- and Metadata Record ID, and the meanings of the IDs do not need
- to be related to each other.
-
-6.1. Reserved Custom Filter ID Ranges
+5.4.1. Reserved Custom Filter ID Ranges
Range Description
- 0x0000_0000 - 0x0000_00DF IDs fitting into the Misc field
0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility
0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility
-7. Cyclic Redundancy Checks
+6. Cyclic Redundancy Checks
There are several incompatible variations to calculate CRC32
and CRC64. For simplicity and clarity, complete examples are
@@ -1811,32 +1054,7 @@ The .lzma File Format
}
-8. References
-
-8.1. Normative References
-
- [RFC-1738]
- Uniform Resource Locators (URL)
- http://www.ietf.org/rfc/rfc1738.txt
-
- [RFC-2119]
- Key words for use in RFCs to Indicate Requirement Levels
- http://www.ietf.org/rfc/rfc2119.txt
-
- [RFC-2440]
- OpenPGP Message Format
- http://www.ietf.org/rfc/rfc2440.txt
-
- [RFC-3629]
- UTF-8, a transformation format of ISO 10646
- http://www.ietf.org/rfc/rfc3629.txt
-
- [IANA-MIME]
- MIME Media Types
- http://www.iana.org/assignments/media-types/
-
-
-8.2. Informative References
+7. References
LZMA SDK - The original LZMA implementation
http://7-zip.org/sdk.html
@@ -1849,6 +1067,10 @@ The .lzma File Format
http://www.ietf.org/rfc/rfc1952.txt
- Notation of byte boxes in section `2.1. Overall conventions'
+ [RFC-2119]
+ Key words for use in RFCs to Indicate Requirement Levels
+ http://www.ietf.org/rfc/rfc2119.txt
+
[GNU-tar]
GNU tar 1.16.1 manual
http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html