diff options
author | Lasse Collin <lasse.collin@tukaani.org> | 2007-12-09 00:42:33 +0200 |
---|---|---|
committer | Lasse Collin <lasse.collin@tukaani.org> | 2007-12-09 00:42:33 +0200 |
commit | 5d018dc03549c1ee4958364712fb0c94e1bf2741 (patch) | |
tree | 1b211911fb33fddb3f04b77f99e81df23623ffc4 /doc | |
download | xz-5d018dc03549c1ee4958364712fb0c94e1bf2741.tar.xz |
Imported to git.
Diffstat (limited to 'doc')
-rw-r--r-- | doc/bugs.txt | 46 | ||||
-rw-r--r-- | doc/faq.txt | 247 | ||||
-rw-r--r-- | doc/file-format.txt | 1861 | ||||
-rw-r--r-- | doc/history.txt | 140 | ||||
-rw-r--r-- | doc/liblzma-advanced.txt | 324 | ||||
-rw-r--r-- | doc/liblzma-hacking.txt | 112 | ||||
-rw-r--r-- | doc/liblzma-intro.txt | 188 | ||||
-rw-r--r-- | doc/liblzma-security.txt | 219 | ||||
-rw-r--r-- | doc/lzma-intro.txt | 107 |
9 files changed, 3244 insertions, 0 deletions
diff --git a/doc/bugs.txt b/doc/bugs.txt new file mode 100644 index 00000000..55579343 --- /dev/null +++ b/doc/bugs.txt @@ -0,0 +1,46 @@ + +Reporting bugs +-------------- + + Naturally it is easiest for me if you already know what causes the + unexpected behavior. Even better if you have a patch to propose. + However, quite often the reason for unexpected behavior is unknown, + so below are a few things what to do before sending a bug report. + + In case of a crash (usually segmentation violation): + + 1. Try to create a small example how to reprocude the issue. + + 2. If you are writing an application using liblzma or libzfile, + double check that you are using the libraries correctly (for + example, that you didn't forget to call lzma_init()). If it is + the command line tool included in LZMA Utils that is crashing, + ignore this step. + + 3. Compile LZMA Utils with debugging code using configure switch + `--enable-debug'. If you are using GCC as the compiler, use + CFLAGS='-O0 -ggdb'. Don't strip the resulting binaries. + + 4. Turn on core dumps. The exact command depends on your shell; + for example in GNU bash it is done with `ulimit -c unlimited', + and in tcsh with `limit coredumpsize unlimited'. + + 5. Try to reproduce the suspected bug. If you get `assertion failed' + message, be sure to include the complete message in your bug + report. If the application leaves a coredump, get a backtrace + using gdb: + $ gdb /path/to/app-binary # Loads the app to the debugger. + (gdb) core core # Opens the coredump. + (gdb) bt # Prints the backtrace. Copy & paste to bug report. + (gdb) quit # Quits gdb. + + Send your bug report to Lasse Collin <lasse.collin@tukaani.org>. Don't + send the core dump file or the actual executables. If you have a small + example file(s) (total size less than 100 KiB), please include it/them + as an attachment. + + Do NOT complain about problems with LZMA Utils to Igor Pavlov. + Although the code of LZMA Utils is derived from his code, there are + a lot of changes, which may have introduced bugs not present in + the original version. + diff --git a/doc/faq.txt b/doc/faq.txt new file mode 100644 index 00000000..d01cf91b --- /dev/null +++ b/doc/faq.txt @@ -0,0 +1,247 @@ + +LZMA Utils FAQ +-------------- + + Copyright (C) 2007 Lasse Collin + + Copying and distribution of this file, with or without modification, + are permitted in any medium without royalty provided the copyright + notice and this notice are preserved. + + +Q: What are LZMA, LZMA Utils, lzma, .lzma, liblzma, LZMA SDK, LZMA_Alone, + 7-Zip and p7zip? + +A: LZMA stands for Lempel-Ziv-Markov chain-Algorithm. LZMA is the name + of the compression algorithm designed by Igor Pavlov. He is the author + of 7-Zip, which is a great LGPL'd compression tool for Microsoft + Windows operating systems. In addition to 7-Zip itself, also LZMA SDK + is available on the website of 7-Zip. LZMA SDK contains LZMA + implementations in C++, Java and C#. The C++ version is the original + implementation which is used also in 7-Zip itself. + + Excluding the unrar plugin, 7-Zip is free software (free as in + freedom). Thanks to this, it was possible to port it to POSIX + platforms. The port was done and is maintained by myspace (TODO: + myspace's real name?). p7zip is a port of 7-Zip's command line version; + p7zip doesn't include the 7-Zip's GUI. + + In POSIX world, users are used to gzip and bzip2 command line tools. + Developers know APIs of zlib and libbzip2. LZMA Utils try to ease + adoption of LZMA on free operating systems by providing a compression + library and a set of command line tools. The library is called liblzma. + It provides a zlib-like API making it easy to adapt LZMA compression in + existing applications. The main command line tool is known as lzma, + whose command line syntax is very similar to that of gzip and bzip2. + + The original command line tool from LZMA SDK (lzma.exe) was found from + a directory called LZMA_Alone in the LZMA SDK. It used a simple header + format in .lzma files. This format was also used by LZMA Utils up to + and including 4.32.x. In LZMA Utils documentation, LZMA_Alone refers + to both the file format and the command line tool from LZMA SDK. + + Because of various limitations of the LZMA_Alone file format, a new + file format was developed. Extending some existing format such as .gz + used by gzip was considered, but these formats were found to be too + limited. The filename suffix for the new .lzma format is `.lzma'. The + same suffix is also used for files in the LZMA_Alone format. To make + the transition to the new format as transparent as possible, LZMA Utils + support both the new and old formats transparently. + + 7-Zip and LZMA SDK: <http://7-zip.org/> + p7zip: <http://p7zip.sourceforge.net/> + LZMA Utils: <http://tukaani.org/lzma/> + + +Q: What LZMA implementations there are available? + +A: LZMA SDK contains implementations in C++, Java and C#. The C++ version + is the original implementation which is part of 7-Zip. LZMA SDK + contains also a small LZMA decoder in C. + + A port of LZMA SDK to Pascal was made by Alan Birtles + <http://www.birtles.org.uk/programming/>. It should work with + multiple Pascal programming language implementations. + + LZMA Utils includes liblzma, which is directly based on LZMA SDK. + liblzma is written in C (C99, not C89). In contrast to C++ callback + API used by LZMA SDK, liblzma uses zlib-like stateful C API. I do not + want to comment whether both/former/latter/neither API(s) are good or + bad. The only reason to implement a zlib-like API was, that many + developers are already familiar with zlib, and very many applications + already use zlib. Having a similar API makes it easier to include LZMA + support in existing applications. + + See also <http://en.wikipedia.org/wiki/LZMA#External_links>. + + +Q: Which file formats are supported by LZMA Utils? + +A: Even when the raw LZMA stream is always the same, it can be wrapped + in different container formats. The preferred format is the new .lzma + format. It has magic bytes (the first six bytes: 0xFF 'L' 'Z' 'M' + 'A' 0x00). The format supports chaining up to seven filters filters, + splitting data to multiple blocks for easier multi-threading and rough + random-access reading. The file integrity is verified using CRC32, + CRC64, or SHA256, and by verifying the uncompressed size of the file. + + LZMA SDK includes a tool called LZMA_Alone. It supports uses a + primitive header which includes only the mandatory stream information + required by the LZMA decoder. This format can be both read and + written by liblzma and the command line tool (use --format=alone to + create such files). + + .7z is the native archive format used by 7-Zip. This format is not + supported by liblzma, and probably will never be supported. You + should use e.g. p7zip to extract .7z files. + + It is possible to implement custom file formats by using raw filter + mode in liblzma. In this mode the application needs to store the filter + properties and provide them to liblzma before starting to uncompress + the data. + + +Q: How can I identify files containing LZMA compressed data? + +A: The preferred filename suffix for .lzma files is `.lzma'. `.tar.lzma' + may be abbreviated to `.tlz'. The same suffixes are used for files in + LZMA_Alone format. In practice this should be no problem since tools + included in LZMA Utils support both formats transparently. + + Checking the magic bytes is easy way to detect files in the new .lzma + format (the first six bytes: 0xFF 'L' 'Z' 'M' 'A' 0x00). The "file" + command version FIXME contains magic strings for this format. + + The old LZMA_Alone format has no magic bytes. Its header cannot contain + arbitrary bytes, thus it is possible to make a guess. Unfortunately the + guessing is usually too hard to be reliable, so don't try it unless you + are desperate. + + +Q: Does the lzma command line tool support sparse files? + +A: Sparse files can (of course) be compressed like normal files, but + uncompression will not restore sparseness of the file. Use an archiver + tool to take care of sparseness before compressing the data with lzma. + + The reason for this is that archiver tools handle files, while + compression tools handle streams or buffers. Being a sparse file is + a property of the file on the disk, not a property of the stream or + buffer. + + +Q: Can I recover parts of a broken LZMA file (e.g. corrupted CD-R)? + +A: With LZMA_Alone and single-block .lzma files, you can uncompress the + file until you hit the first broken byte. The data after the broken + position is lost. LZMA relies on the uncompression history, and if + bytes are missing in the middle of the file, it is impossible to + reliably continue after the broken section. + + With multi-block .lzma files it may be possible to locale the next + block in the file and continue decoding there. A limited recovery + tool for this kind of situations is planned. + + +Q: Is LZMA patented? + +A: No, the authors are not aware of any patents that could affect LZMA. + However, due to nature of software patents, the authors cannot + guarantee, that LZMA isn't affected by any third party patent. + + +Q: Where can I find documentation about how LZMA works as an algorithm? + +A: Read the source code, Luke. There is no documentation about LZMA + internals. It is possible that Igor Pavlov is the only person on + the Earth that completely knows and understands the algorithm. + + You could begin by downloading LZMA SDK, and start reading from + the LZMA decoder to get some idea about the bitstream format. + Before you begin, you should know the basics of LZ77 and + range coding algorithms. LZMA is based on LZ77, but LZMA is + *a lot* more complex. Range coding is used to compress the + final bitstream like Huffman coding is used in Deflate. + + +Q: What are filters? + +A: In context of .lzma files, a filter means an implementation of a + compression algorithm. The primary filter is LZMA, which is why + the names of the tools contain the letters LZMA. + + liblzma and the new .lzma format support also other filters than LZMA. + There are different types of filters, which are suitable for different + types of data. Thus, to select the optimal filter and settings, the + type of the input data being compressed needs to be known. + + Some filters are most useful when combined with another filter like + LZMA. These filters increase redundancy in the data, without changing + the size of the data, by taking advantage of properties specific to + the data being compressed. + + So far, all the filters are always reversible. That is, no matter what + data you pass to a filter encoder, it can be always defiltered back to + the original form. Because of this, it is safe to compress for example + a software package that contains other file types than executables + using a filter specific to the architechture of the package being + compressed. + + The old LZMA_Alone format supports only the LZMA filter. + + +Q: I cannot find BCJ and BCJ2 filters. Don't they exist in liblzma? + +A: BCJ filter is called "x86" in liblzma. BCJ2 is not included, + because it requires using more than one encoded output stream. + + +Q: Can I use LZMA in proprietary, non-free applications? + +A: liblzma is under the GNU LGPL version 2.1 or (at your opinion) any + later version. To summarise (*NOTE* This summary is not legally + binding, that is, it doesn't give you any extra permissions compared + to the LGPL. Read the GNU LGPL carefully for the exact license + conditions.): + * All the changes made into the library itself must be published + under the same license. + * End users must be able to replace the used liblzma. Easiest way + to assure this is to link dynamically against liblzma so users + can replace the shared library file if they want. + * You must make it clear to your users, that your application uses + liblzma, and that liblzma is free software under the GNU LGPL. + A copy of GNU LGPL must be included. + + LZMA SDK contains a special exception which allows linking *unmodified* + code statically with a non-free application. This exception does *not* + apply to liblzma. + + As an alternative, you can support the development of LZMA and 7-Zip + by buying a proprietary license from Igor Pavlov. See homepage of + LZMA SDK <http://7-zip.org/sdk.html> for more information. Note that + having a proprietary license from Igor Pavlov doesn't allow you to use + liblzma in a way that contradicts with the GNU LGPL, because liblzma + contains code that is not copyrighted by Igor Pavlov. Please contact + both Lasse Collin and Igor Pavlov if the license conditions of liblzma + are not suitable for you. + + +Q: I would like to help. What can I do? + +A: See the TODO file. Please contact Lasse Collin before starting to do + anything, because it is possible that someone else is already working + on the same thing. + + +Q: How can I contact the authors? + +A: Lasse Collin is the maintainer of LZMA Utils. You can contact him + either via IRC (Larhzu on #tukaani at Freenode or IRCnet). Email + should work too, <lasse.collin@tukaani.org>. + + Igor Pavlov is the father of LZMA. He is the author of 7-Zip + and LZMA SDK. <http://7-zip.org/> + + NOTE: Please don't bother Igor Pavlov with questions specific + to LZMA Utils. + diff --git a/doc/file-format.txt b/doc/file-format.txt new file mode 100644 index 00000000..4a90a67d --- /dev/null +++ b/doc/file-format.txt @@ -0,0 +1,1861 @@ + +The .lzma File Format +--------------------- + + 0. Preface + 0.1. Copyright Notices + 0.2. Changes + 1. Conventions + 1.1. Byte and Its Representation + 1.2. Multibyte Integers + 2. Stream + 2.1. Stream Types + 2.1.1. Single-Block Stream + 2.1.2. Multi-Block Stream + 2.2. Stream Header + 2.2.1. Header Magic Bytes + 2.2.2. Stream Flags + 2.2.3. CRC32 + 3. Block + 3.1. Block Header + 3.1.1. Block Flags + 3.1.2. Compressed Size + 3.1.3. Uncompressed Size + 3.1.4. List of Filter Flags + 3.1.4.1. Misc + 3.1.4.2. External ID + 3.1.4.3. External Size of Properties + 3.1.4.4. Filter Properties + 3.1.5. CRC32 + 3.1.6. Header Padding + 3.2. Compressed Data + 3.3. Block Footer + 3.3.1. Check + 3.3.2. Stream Footer + 3.3.2.1. Uncompressed Size + 3.3.2.2. Backward Size + 3.3.2.3. Stream Flags + 3.3.2.4. Footer Magic Bytes + 3.3.3. Footer Padding + 4. Filters + 4.1. Detecting when All Data Has Been Decoded + 4.1.1. With Uncompressed Size + 4.1.2. With End of Input + 4.1.3. With End of Payload Marker + 4.2. Alignment + 4.3. Filters + 4.3.1. Copy + 4.3.2. Subblock + 4.3.2.1. Format of the Encoded Output + 4.3.3. Delta + 4.3.3.1. Format of the Encoded Output + 4.3.4. LZMA + 4.3.4.1. LZMA Properties + 4.3.4.2. Dictionary Flags + 4.3.5. Branch/Call/Jump Filters for Executables + 5. Metadata + 5.1. Metadata Flags + 5.2. Size of Header Metadata Block + 5.3. Total Size + 5.4. Uncompressed Size + 5.5. Index + 5.5.1. Number of Data Blocks + 5.5.2. Total Sizes + 5.5.3. Uncompressed Sizes + 5.6. Extra + 5.6.1. 0x00: Dummy/Padding + 5.6.2. 0x01: OpenPGP Signature + 5.6.3. 0x02: Filter Information + 5.6.4. 0x03: Comment + 5.6.5. 0x04: List of Checks + 5.6.6. 0x05: Original Filename + 5.6.7. 0x07: Modification Time + 5.6.8. 0x09: High-Resolution Modification Time + 5.6.9. 0x0B: MIME Type + 5.6.10. 0x0D: Homepage URL + 6. Custom Filter and Extra Record IDs + 6.1. Reserved Custom Filter ID Ranges + 7. Cyclic Redundancy Checks + 8. References + 8.1. Normative References + 8.2. Informative References + + +0. Preface + + This document describes the .lzma file format (filename suffix + `.lzma', MIME type `application/x-lzma'). It is intended that + this format replace the format used by the LZMA_Alone tool + included in LZMA SDK up to and including version 4.43. + + IMPORTANT: The version described in this document is a + draft, NOT a final, official version. Changes + are possible. + + +0.1. Copyright Notices + + Copyright (C) 2006, 2007 Lasse Collin <lasse.collin@tukaani.org> + Copyright (C) 2006 Ville Koskinen <w-ber@iki.fi> + + Copying and distribution of this file, with or without + modification, are permitted in any medium without royalty + provided the copyright notice and this notice are preserved. + Modified versions must be marked as such. + + All source code examples given in this document are put into + the public domain by the authors of this document. + + Thanks for helping with this document goes to Igor Pavlov, + Mark Adler and Mikko Pouru. + + +0.2. Changes + + Last modified: 2007-12-02 22:40+0200 + + (A changelog will be kept once the first official version + is made.) + + +1. Conventions + + The keywords `must', `must not', `required', `should', + `should not', `recommended', `may', and `optional' in this + document are to be interpreted as described in [RFC-2119]. + These words are not capitalized in this document. + + Indicating a warning means displaying a message, returning + appropriate exit status, or something else to let the user + know that something worth warning occurred. The operation + should still finish if a warning is indicated. + + Indicating an error means displaying a message, returning + appropriate exit status, or something else to let the user + know that something prevented successfully finishing the + operation. The operation must be aborted once an error has + been indicated. + + +1.1. Byte and Its Representation + + In this document, byte is always 8 bits. + + A `nul byte' has all bits unset. That is, the value of a nul + byte is 0x00. + + To represent byte blocks, this document uses notation that + is similar to the notation used in [RFC-1952]: + + +-------+ + | Foo | One byte. + +-------+ + + +---+---+ + | Foo | Two bytes; that is, some of the vertical bars + +---+---+ can be missing. + + +=======+ + | Foo | Zero or more bytes. + +=======+ + + In this document, a boxed byte or a byte sequence declared + using this notation is called `a field'. The example field + above would be called called `the Foo field' or plain `Foo'. + + +1.2. Multibyte Integers + + Multibyte integers of static length, such as CRC values, + are stored in little endian byte order (least significant + byte first). + + When smaller values are more likely than bigger values (e.g. + file sizes), multibyte integers are encoded in a simple + variable-length representation: + - Numbers in the range [0, 127] are copied as is, and take + one byte of space. + - Bigger numbers will occupy two or more bytes. The lowest + seven bits of every byte are used for data; the highest + (eighth) bit indicates either that + 0) the byte is in the middle of the byte sequence, or + 1) the byte is the first or the last byte. + + For now, the value of the variable-length integers is limited + to 63 bits, which limits the encoded size of the integer to + nine bytes. These limits may be increased in future if needed. + + Note that the encoding is not as optimal as it could be. For + example, it is possible to encode the number 42 using any + number of bytes between one and nine. This is convenient + for non-streamed encoders, that write Compressed Size or + Uncompressed Size fields to the Block Header (see Section 3.1) + after the Compressed Data field is written to the disk. + + In several situations, the decoder needs to compare that two + fields contain identical information. When comparing fields + using the encoding described in this Section, the decoder must + consider two fields identical if their decoded values are + identical; it does not matter if the encoded variable-length + representations differ. + + The following C code illustrates encoding and decoding 63-bit + variables; the highest bit of uint64_t must be unset. The + functions return the number of bytes occupied by the integer + (1-9), or zero on error. + + #include <sys/types.h> + #include <inttypes.h> + + size_t + encode(uint8_t buf[static 9], uint64_t num) + { + if (num >= (UINT64_C(1) << (9 * 7))) + return 0; + if (num <= 0x7F) { + buf[0] = num; + return 1; + } + buf[0] = (num & 0x7F) | 0x80; + num >>= 7; + size_t i = 1; + while (num >= 0x80) { + buf[i++] = num & 0x7F; + num >>= 7; + } + buf[i++] = num | 0x80; + return i; + } + + size_t + decode(const uint8_t buf[], size_t size_max, uint64_t *num) + { + if (size_max == 0) + return 0; + if (size_max > 9) + size_max = 9; + *num = buf[0] & 0x7F; + if (!(buf[0] & 0x80)) + return 1; + size_t i = 1; + do { + if (i == size_max) + return 0; + *num |= (uint64_t)(buf[i] & 0x7F) << (7 * i); + } while (!(buf[i++] & 0x80)); + return i; + } + + size_t + decode_reverse(const uint8_t buf[], size_t size_max, + uint64_t *num) + { + if (size_max == 0) + return 0; + const size_t end = size_max > 9 ? size_max - 9 : 0; + size_t i = size_max - 1; + *num = buf[i] & 0x7F; + if (!(buf[i] & 0x80)) + return 1; + do { + if (i-- == end) + return 0; + *num <<= 7; + *num |= buf[i] & 0x7F; + } while (!(buf[i] & 0x80)); + return size_max - i; + } + + +2. Stream + + +========+========+========+ + | Stream | Stream | Stream | ... + +========+========+========+ + + A file contains usually only one Stream. However, it is + possible to concatenate multiple Streams together with no + additional processing. It is up to the implementation to + decide if the decoder will continue decoding from the next + Stream once the end of the first Stream has been reached. + + +2.1. Stream Types + + There are two types of Streams: Single-Block Streams and + Multi-Block Streams. Decoders conforming to this specification + must support at least Single-Block Streams. Supporting + Multi-Block Streams is optional. If the decoder supports only + Single-Block Streams, the documentation of the decoder should + mention this fact clearly. + + +2.1.1. Single-Block Stream + + +===============+============+ + | Stream Header | Data Block | + +===============+============+ + + As the name says, a Single-Block Stream has exactly one Block. + The Block must be a Data Block; Metadata Blocks are not allowed + in Single-Block Streams. + + +2.1.2. Multi-Block Stream + + +===============+=======================+ + | Stream Header | Header Metadata Block | + +===============+=======================+ + + +============+ +============+=======================+ + ---> | Data Block | ... | Data Block | Footer Metadata Block | + +============+ +============+=======================+ + + Notes: + - Stream Header is mandatory. + - Header Metadata Block is optional. + - Each Multi-Block Stream has at least one Data Block. The + maximum number of Data Blocks is not limited. + - Footer Metadata Block is mandatory. + + +2.2. Stream Header + + +---+---+---+---+---+---+--------------+--+--+--+--+ + | Header Magic Bytes | Stream Flags | CRC32 | + +---+---+---+---+---+---+--------------+--+--+--+--+ + + +2.2.1. Header Magic Bytes + + The first six (6) bytes of the Stream are so called Header + Magic Bytes. They can be used to identify the file type. + + Using a C array and ASCII: + const uint8_t HEADER_MAGIC[6] + = { 0xFF, 'L', 'Z', 'M', 'A', 0x00 }; + + In plain hexadecimal: + FF 4C 5A 4D 41 00 + + Notes: + - The first byte (0xFF) was chosen so that the files cannot + be erroneously detected as being in LZMA_Alone format, in + which the first byte is in the the range [0x00, 0xE0]. + - The sixth byte (0x00) was chosen to prevent applications + from misdetecting the file as a text file. + + +2.2.2. Stream Flags + + Bit(s) Mask Description + 0-2 0x07 Type of Check (see Section 3.3.1): + ID Size Check name + 0x00 0 bytes None + 0x01 4 bytes CRC32 + 0x02 4 bytes (Reserved) + 0x03 8 bytes CRC64 + 0x04 16 bytes (Reserved) + 0x05 32 bytes SHA-256 + 0x06 32 bytes (Reserved) + 0x07 64 bytes (Reserved) + 3 0x08 The CRC32 field is present in Block Headers. + 4 0x10 If unset, this is a Single-Block Stream; if set, + this is a Multi-Block Stream. + 5-7 0xE0 Reserved for future use; must be zero for now. + + Implementations must support at least the Check IDs 0x00 (None) + and 0x01 (CRC32). Supporting other Check IDs is optional. If an + unsupported Check is used, the decoder must indicate a warning + or error. + + If any reserved bit is set, the decoder must indicate an error. + It is possible that there is a new field present which the + decoder is not aware of, and can thus parse the Stream Header + incorrectly. + + +2.2.3. CRC32 + + The CRC32 is calculated from the Stream Flags field. It is + stored as an unsigned 32-bit little endian integer. If the + calculated value does not match the stored one, the decoder + must indicate an error. + + Note that this field is always present; the bit in Stream Flags + controls only presence of CRC32 in Block Headers. + + +3. Block + + +==============+=================+==============+ + | Block Header | Compressed Data | Block Footer | + +==============+=================+==============+ + + There are two types of Blocks: + - Data Blocks hold the actual compressed data. + - Metadata Blocks hold the Index, Extra, and a few other + non-data fields (see Section 5). + + The type of the Block is indicated by the corresponding bit + in the Block Flags field (see Section 3.1.1). + + +3.1. Block Header + + +------+------+=================+===================+ + | Block Flags | Compressed Size | Uncompressed Size | + +------+------+=================+===================+ + + +======================+--+--+--+--+================+ + ---> | List of Filter Flags | CRC32 | Header Padding | + +======================+--+--+--+--+================+ + + +3.1.1. Block Flags + + The first byte of the Block Flags field is a bit field: + + Bit(s) Mask Description + 0-2 0x07 Number of filters (0-7) + 3 0x08 Use End of Payload Marker (even if + Uncompressed Size is stored to Block Header). + 4 0x10 The Compressed Size field is present. + 5 0x20 The Uncompressed Size field is present. + 6 0x40 Reserved for future use; must be zero for now. + 7 0x80 This is a Metadata Block. + + The second byte of the Block Flags field is also a bit field: + + Bit(s) Mask Description + 0-4 0x1F Size of the Header Padding field (0-31 bytes) + 5-7 0xE0 Reserved for future use; must be zero for now. + + The decoder must indicate an error if End of Payload Marker + is not used and Uncompressed Size is not stored to the Block + Header. Because of this, the first byte of Block Flags can + never be a nul byte. This is useful when detecting beginning + of the Block after Footer Padding (see Section 3.3.3). + + If any reserved bit is set, the decoder must indicate an error. + It is possible that there is a new field present which the + decoder is not aware of, and can thus parse the Block Header + incorrectly. + + +3.1.2. Compressed Size + + This field is present only if the appropriate bit is set in + the Block Flags field (see Section 3.1.1). + + This field contains the size of the Compressed Data field. + The size is stored using the encoding described in Section 1.2. + If the Compressed Size does not match the real size of the + Compressed Data field, the decoder must indicate an error. + + Having the Compressed Size field in the Block Header can be + useful for multithreaded decoding when seeking is not possible. + If the Blocks are small enough, the decoder can read multiple + Blocks into its internal buffer, and decode the Blocks in + parallel. + + Compressed Size can also be useful when seeking forwards to + a specific location in streamed mode: the decoder can quickly + skip over irrelevant Blocks, without decoding them. + + +3.1.3. Uncompressed Size + + This field is present only if the appropriate bit is set in + the Block Flags field (see Section 3.1.1). + + The Uncompressed Size field contains the size of the Block + after uncompressing. + + Storing Uncompressed Size serves several purposes: + - The decoder will know when all of the data has been + decoded without an explicit End of Payload Marker. + - The decoder knows how much memory it needs to allocate + for a temporary buffer in multithreaded mode. + - Simple error detection: wrong size indicates a broken file. + - Sometimes it is useful to know the file size without + uncompressing the file. + + It should be noted that the only reliable way to find out what + the real uncompressed size is is to uncompress the Block, + because the Block Header and Metadata Block fields may contain + (intentionally or unintentionally) invalid information. + + Uncompressed Size is stored using the encoding described in + Section 1.2. If the Uncompressed Size does not match the + real uncompressed size, the decoder must indicate an error. + + +3.1.4. List of Filter Flags + + +================+================+ +================+ + | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags | + +================+================+ +================+ + + The number of Filter Flags fields is stored in the Block Flags + field (see Section 3.1.1). As a special case, if the number of + Filter Flags fields is zero, it is equivalent to having the + Copy filter as the only filter. + + The format of each Filter Flags field is as follows: + + +------+=============+=============================+ + | Misc | External ID | External Size of Properties | + +------+=============+=============================+ + + +===================+ + ---> | Filter Properties | + +===================+ + + The list of officially defined Filter IDs and the formats of + their Filter Properties are described in Section 4.3. + + +3.1.4.1. Misc + + To save space, the most commonly used Filter IDs and the + Size of Filter Properties are encoded in a single byte. + Depending on the contents of the Misc field, Filter ID is + the value of the Misc or External ID field. + + Value Filter ID Size of Filter Properties + 0x00 - 0x1F Misc 0 bytes + 0x20 - 0x3F Misc 1 byte + 0x40 - 0x5F Misc 2 bytes + 0x60 - 0x7F Misc 3 bytes + 0x80 - 0x9F Misc 4 bytes + 0xA0 - 0xBF Misc 5 bytes + 0xC0 - 0xDF Misc 6 bytes + 0xE0 - 0xFE External ID 0-30 bytes + 0xFF External ID External Size of Properties + + The following code demonstrates parsing the Misc field and, + when needed, the External ID and External Size of Properties + fields. + + uint64_t id; + uint64_t properties_size; + uint8_t misc = read_byte(); + + if (misc >= 0xE0) { + id = read_variable_length_integer(); + + if (misc == 0xFF) + properties_size = read_variable_length_integer(); + else + properties_size = misc - 0xE0; + + } else { + id = misc; + properties_size = misc / 0x20; + } + + +3.1.4.2. External ID + + This field is present only if the Misc field contains a value + that indicates usage of External ID. The External ID is stored + using the encoding described in Section 1.2. + + +3.1.4.3. External Size of Properties + + This field is present only if the Misc field contains a value + that indicates usage of External Size of Properties. The size + of Filter Properties is stored using the encoding described in + Section 1.2. + + +3.1.4.4. Filter Properties + + Size of this field depends on the Misc field (Section 3.1.4.1) + and, if present, External Size of Properties field (Section + 3.1.4.3). The format of this field is depends on the selected + filter; see Section 4.3 for details. + + +3.1.5. CRC32 + + This field is present only if the appropriate bit is set in + the Stream Flags field (see Section 2.2.2). + + The CRC32 is calculated over everything in the Block Header + field except the Header Padding field and the CRC32 field + itself. It is stored as an unsigned 32-bit little endian + integer. If the calculated value does not match the stored + one, the decoder must indicate an error. + + +3.1.6. Header Padding + + This field contains as many nul bytes as indicated by the value + stored in the Header Flags field. If the Header Padding field + contains any non-nul bytes, the decoder must indicate an error. + + The intent of the Header Padding field is to allow alignment + of Compressed Data. The usefulness of alignment is described + in Section 4.3. + + +3.2. Compressed Data + + The format of Compressed Data depends on Block Flags and List + of Filter Flags. Excluding the descriptions of the simplest + filters in Section 4, the format of the filter-specific encoded + data is out of scope of this document. + + Note a special case: if End of Payload Marker (see Section + 3.1.1) is not used and Uncompressed Size is zero, the size + of the Compressed Data field is always zero. + + +3.3. Block Footer + + +=======+===============+================+ + | Check | Stream Footer | Footer Padding | + +=======+===============+================+ + + +3.3.1. Check + + The type and size of the Check field depends on which bits + are set in the Stream Flags field (see Section 2.2.2). + + The Check, when used, is calculated from the original + uncompressed data. If the calculated Check does not match the + stored one, the decoder must indicate an error. If the selected + type of Check is not supported by the decoder, it must indicate + a warning or error. + + +3.3.2. Stream Footer + + +===================+===============+--------------+ + | Uncompressed Size | Backward Size | Stream Flags | + +===================+===============+--------------+ + + +----------+---------+ + ---> | Footer Magic Bytes | + +----------+---------+ + + Stream Footer is present only in + - Data Block of a Single-Block Stream; and + - Footer Metadata Block of a Multi-Block Stream. + + The Stream Footer field is placed inside Block Footer, because + no padding is allowed between Check and Stream Footer. + + +3.3.2.1. Uncompressed Size + + This field is present only in the Data Block of a Single-Block + Stream if Uncompressed Size is not stored to the Block Header + (see Section 3.1.1). Without the Uncompressed Size field in + Stream Footer it would not be possible to quickly find out + the Uncompressed Size of the Stream in all cases. + + Uncompressed Size is stored using the encoding described in + Section 1.2. If the stored value does not match the real + uncompressed size of the Single-Block Stream, the decoder must + indicate an error. + + +3.3.2.2. Backward Size + + This field contains the total size of the Block Header, + Compressed Data, Check, and Uncompressed Size fields. The + value is stored using the encoding described in Section 1.2. + If the Backward Size does not match the real total size of + the appropriate fields, the decoder must indicate an error. + + Implementations reading the Stream backwards should notice + that the value in this field can never be zero. + + +3.3.2.3. Stream Flags + + This is a copy of the Stream Flags field from the Stream + Header. The information stored to Stream Flags is needed + when parsing the Stream backwards. + + +3.3.2.4. Footer Magic Bytes + + As the last step of the decoding process, the decoder must + verify the existence of Footer Magic Bytes. If they are not + found, an error must be indicated. + + Using a C array and ASCII: + const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' }; + + In hexadecimal: + 59 5A + + The primary reason to have Footer Magic Bytes is to make + it easier to detect incomplete files quickly, without + uncompressing. If the file does not end with Footer Magic Bytes + (excluding Footer Padding described in Section 3.3.3), it + cannot be undamaged, unless someone has intentionally appended + garbage after the end of the Stream. (Appending garbage at the + end of the file does not prevent uncompressing the file, but + may give a warning or error depending on the decoder + implementation.) + + +3.3.3. Footer Padding + + In certain situations it is convenient to be able to pad + Blocks or Streams to be multiples of, for example, 512 bytes. + Footer Padding makes this possible. Note that this is in no + way required to enforce alignment in the way described in + Section 4.3; the Header Padding field is enough for that. + + When Footer Padding is used, it must contain only nul bytes. + Any non-nul byte should be considered as the beginning of + a new Block or Stream. + + The possibility of Padding should be taken into account when + designing an application that wants to find out information + about a Stream by parsing Footer Metadata Block. + + Support for Padding was inspired by a related note in + [GNU-tar]. + + +4. Filters + + The Block Flags field defines how many filters are used. When + more than one filter is used, the filters are chained; that is, + the output of one filter is the input of another filter. The + following figure illustrates the direction of data flow. + + v Uncompressed Data ^ + | Filter 0 | + Encoder | Filter 1 | Decoder + | ... | + | Filter n | + v Compressed Data ^ + + The filters are independent from each other, except that they + must cooperate a little to make it possible, in all cases, to + detect when all of the data has been decoded. In addition, the + filters should cooperate in the encoder to keep the alignment + optimal. + + +4.1. Detecting when All Data Has Been Decoded + + There must be a way for the decoder to detect when all of the + Compressed Data has been decoded. This is simple when only + one filter is used, but a bit more complex when multiple + filters are chained. + + This file format supports three methods to detect when all of + the data has been decoded: + - Uncompressed size + - End of Input + - End of Payload Marker + + In both encoder and decoder, filters are initialized starting + from the first filter in the chain. For each filter, one of + these three methods is used. + + +4.1.1. With Uncompressed Size + + This method is the only method supported by all filters. + It must be used when uncompressed size is known by the + filter-specific encoder or decoder. In practice this means + that Uncompressed Size has been stored to the Block Header. + + In case of the first filter in the chain, the uncompressed size + given to the filter-specific encoder or decoder equals the + Uncompressed Size stored in the Block Header. For the rest of + the filters in the chain, uncompressed size is the size of the + output data of the previous filter in the chain. + + Note that when Use End of Payload Marker bit is set in Block + Flags, Uncompressed Size is considered to be unknown even if + it was present in the Block Header. Thus, if End of Payload + Marker is used, uncompressed size of all of the filters in + the chain is unknown, and can never be used to detect when + all of the data has been decoded. + + Once the correct number of bytes has been written out, the + filter-specific decoder indicates to its caller that all of + the data has been decoded. If the filter-specific decoder + detects End of Input or End of Payload Marker before the + correct number of bytes is decoded, the decoder must indicate + an error. + + +4.1.2. With End of Input + + Most filters will know that all of the data has been decoded + when the End of Input data has been reached. Once the filter + knows that it has received the input data in its entirety, + it finishes its job, and indicates to its caller that all of + the data has been decoded. The filter-specific decoder must + indicate an error if it detects End of Payload Marker. + + Note that this method can work only when the filter is not + the last filter in the chain, because only another filter + can indicate the End of Input data. In practice this means, + that a filter later in the chain must support embedding + End of Payload Marker. + + When a filter that cannot embed End of Payload Marker is the + last filter in the chain, Subblock filter is appended to the + chain as an implicit filter. In the simplest case, this occurs + when no filters are specified, and Uncompressed Size is unknown + or the End of Payload Marker bit is set in Block Flags. + + +4.1.3. With End of Payload Marker + + End of Payload Marker is a filter-specific bit sequence that + indicates the end of data. It is supported by only a few + filters. It is used when uncompressed size is unknown, and + the filter + - doesn't support End of Input; or + - is the last filter in the chain. + + End of Payload Marker is embedded at the end of the encoded + data by the filter-specific encoder. When the filter-specific + decoder detects the embedded End of Payload Marker, the decoder + knows that all of the data has been decoded. Then it finishes + its job, and indicates to its caller that all of the data has + been decoded. If the filter-specific decoder detects End of + Input before End of Payload Marker, the decoder must indicate + an error. + + If the filter supports both End of Input and End of Payload + Marker, the former is used, unless the filter is the last + filter in the chain. + + +4.2. Alignment + + Some filters give better compression ratio or are faster + when the input or output data is aligned. For optimal results, + the encoder should try to enforce proper alignment when + possible. Not enforcing alignment in the encoder is not + an error. Thus, the decoder must be able to handle files with + suboptimal alignment. + + Alignment of uncompressed input data is usually the job of + the application producing the data. For example, to get the + best results, an archiver tool should make sure that all + PowerPC executable files in the archive stream start at + offsets that are multiples of four bytes. + + Some filters, for example LZMA, can be configured to take + advantage of specified alignment of input data. Note that + taking advantage of aligned input can be benefical also when + a filter is not the first filter in the chain. For example, + if you compress PowerPC executables, you may want to use the + PowerPC filter and chain that with the LZMA filter. Because not + only the input but also the output alignment of the PowerPC + filter is four bytes, it is now benefical to set LZMA settings + so that the LZMA encoder can take advantage of its + four-byte-aligned input data. + + The output of the last filter in the chain is stored to the + Compressed Data field. Aligning Compressed Data appropriately + can increase + - speed, if the filtered data is handled multiple bytes at + a time by the filter-specific encoder and decoder, + because accessing aligned data in computer memory is + usually faster; and + - compression ratio, if the output data is later compressed + with an external compression tool. + + Compressed Data in a Stream can be aligned by using the Header + Padding field in the Block Header. + + +4.3. Filters + +4.3.1. Copy + + This is a dummy filter that simply copies all data from input + to output unmodified. + + Filter ID: 0x00 + Size of Filter Properties: 0 bytes + Changes size of data: No + + Detecting when all of the data has been decoded: + Uncompressed size: Yes + End of Payload Marker: No + End of Input: Yes + + Preferred alignment: + Input data: 1 byte + Output data: 1 byte + + +4.3.2. Subblock + + The Subblock filter can be used to + - embed End of Payload Marker when the otherwise last + filter in the chain does not support embedding it; and + - apply additional filters in the middle of a Block. + + Filter ID: 0x01 + Size of Filter Properties: 0 bytes + Changes size of data: Yes, unpredictably + + Detecting when all of the data has been decoded: + Uncompressed size: Yes + End of Payload Marker: Yes + End of Input: Yes + + Preferred alignment: + Input data: 1 byte + Output data: Freely adjustable + + +4.3.2.1. Format of the Encoded Output + + The encoded data from the Subblock filter consist of zero or + more Subblocks: + + +==========+==========+ + | Subblock | Subblock | ... + +==========+==========+ + + Each Subblock contains two fields: + + +----------------+===============+ + | Subblock Flags | Subblock Data | + +----------------+===============+ + + Subblock Flags is a bitfield: + + Bits Mask Description + 0-3 0x0F The interpretation of these bits depend on + the Subblock Type: + - 0x20 Bits 0-3 for Size + - 0x30 Bits 0-3 for Repeat Count + - Other These bits must be zero. + 4-7 0xF0 Subblock Type: + - 0x00: Padding + - 0x10: End of Payload Marker + - 0x20: Data + - 0x30: Repeating Data + - 0x40: Set Subfilter + - 0x50: Unset Subfilter + If some other value is detected, the decoder + must indicate an error. + + The format of the Subblock Data field depends on Subblock Type. + + Subblocks with the Subblock Type 0x00 (Padding) don't have a + Subblock Data field. These Subblocks can be useful for fixing + alignment. There can be at maximum of 31 consecutive Subblocks + with this Subblock Type; if there are more, the decoder must + indicate an error. + + Subblock with the Subblock Type 0x10 (End of Payload Marker) + doesn't have a Subblock Data field. The decoder must indicate + an error if this Subblock Type is detected when Subfilter is + enabled, or when the Subblock filter is not supposed to embed + the End of Payload Marker. + + Subblocks with the Subblock Type 0x20 (Data) contain the rest + of the Size, which is followed by Size + 1 bytes in the Data + field (that is, Data can never be empty): + + +------+------+------+======+ + | Bits 4-27 for Size | Data | + +------+------+------+======+ + + Subblocks with the Subblock Type 0x30 (Repeating Data) contain + the rest of the Repeat Count, the Size of the Data, and finally + the actual Data to be repeated: + + +---------+---------+--------+------+======+ + | Bits 4-27 for Repeat Count | Size | Data | + +---------+---------+--------+------+======+ + + The size of the Data field is Size + 1. It is repeated Repeat + Count + 1 times. That is, the minimum size of Data is one byte; + the maximum size of Data is 256 bytes. The minimum number of + repeats is one; the maximum number of repeats is 2^28. + + If Subfilter is not used, the Data field of Subblock Types 0x20 + and 0x30 is the output of the decoded Subblock filter. If + Subfilter is used, Data is the input of the Subfilter, and the + decoded output of the Subfilter is the decoded output of the + Subblock filter. + + Subblocks with the Subblock Type 0x40 (Set Subfilter) contain + a Filter Flags field in Subblock Data: + + +==============+ + | Filter Flags | + +==============+ + + It is an error to set the Subfilter to Filter ID 0x00 (Copy) + or 0x01 (Subblock). All the other Filter IDs are allowed. + The decoder must indicate an error if this Subblock Type is + detected when a Subfilter is already enabled. + + Subblocks with the Subblock Type 0x50 (Unset Subfilter) don't + have a Subblock Data field. There must be at least one Subblock + with Subblock Type 0x20 or 0x30 between Subblocks with Subblock + Type 0x40 and 0x50; if there isn't, the decoder must indicate + an error. + + Subblock Types 0x40 and 0x50 are always used as a pair: If the + Subblock filter has been enabled with Subblock Type 0x40, it + must always be disabled later with Subblock Type 0x50. + Disabling must be done even if the Subfilter used End of + Payload Marker; after the Subfilter has detected End of Payload + Marker, the next Subblock that is not Padding must unset the + Subfilter. + + When the Subblock filter is used as an implicit filter to embed + End of Payload marker, the Subblock Types 0x40 and 0x50 (Set or + Unset Subfilter) must not be used. The decoder must indicate an + error if it detects any of these Subblock Types in an implicit + Subblock filter. + + The following code illustrates the basic structure of a + Subblock decoder. + + uint32_t consecutive_padding = 0; + bool got_output_with_subfilter = false; + + while (true) { + uint32_t size; + uint32_t repeat; + uint8_t flags = read_byte(); + + if (flags != 0) + consecutive_padding = 0; + + switch (flags >> 4) { + case 0: + // Padding + if (flags & 0x0F) + return DATA_ERROR; + if (++consecutive_padding == 32) + return DATA_ERROR; + break; + + case 1: + // End of Payload Marker + if (flags & 0x0F) + return DATA_ERROR; + if (subfilter_enabled || !allow_eopm) + return DATA_ERROR; + break; + + case 2: + // Data + size = flags & 0x0F; + for (size_t i = 4; i < 28; i += 8) + size |= (uint32_t)(read_byte()) << i; + + // If any output is produced, this will + // set got_output_with_subfilter to true. + copy_data(size); + break; + + case 3: + // Repeating Data + repeat = flags & 0x0F; + for (size_t i = 4; i < 28; i += 8) + repeat |= (uint32_t)(read_byte()) << i; + size = read_byte(); + + // If any output is produced, this will + // set got_output_with_subfilter to true. + copy_repeating_data(size, repeat); + break; + + case 4: + // Set Subfilter + if (flags & 0x0F) + return DATA_ERROR; + if (subfilter_enabled) + return DATA_ERROR; + got_output_with_subfilter = false; + set_subfilter(); + break; + + case 5: + // Unset Subfilter + if (flags & 0x0F) + return DATA_ERROR; + if (!subfilter_enabled) + return DATA_ERROR; + if (!got_output_with_subfilter) + return DATA_ERROR; + unset_subfilter(); + break; + + default: + return DATA_ERROR; + } + } + + +4.3.3. Delta + + The Delta filter may increase compression ratio when the value + of the next byte correlates with the value of an earlier byte + at specified distance. + + Filter ID: 0x20 + Size of Filter Properties: 1 byte + Changes size of data: No + + Detecting when all of the data has been decoded: + Uncompressed size: Yes + End of Payload Marker: No + End of Input: Yes + + Preferred alignment: + Input data: 1 byte + Output data: Same as the original input data + + The Properties byte indicates the delta distance, which can be + 1-256 bytes backwards from the current byte: 0x00 indicates + distance of 1 byte and 0xFF distance of 256 bytes. + + +4.3.3.1. Format of the Encoded Output + + The code below illustrates both encoding and decoding with + the Delta filter. + + // Distance is in the range [1, 256]. + const unsigned int distance = get_properties_byte() + 1; + uint8_t pos = 0; + uint8_t delta[256]; + + memset(delta, 0, sizeof(delta)); + + while (1) { + const int byte = read_byte(); + if (byte == EOF) + break; + + uint8_t tmp = delta[(uint8_t)(distance + pos)]; + if (is_encoder) { + tmp = (uint8_t)(byte) - tmp; + delta[pos] = (uint8_t)(byte); + } else { + tmp = (uint8_t)(byte) + tmp; + delta[pos] = tmp; + } + + write_byte(tmp); + --pos; + } + + +4.3.4. LZMA + + LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse + compression algorithm with high compression ratio and fast + decompression. LZMA based on LZ77 and range coding algorithms. + + Filter ID: 0x40 + Size of Filter Properties: 2 bytes + Changes size of data: Yes, unpredictably + + Detecting when all of the data has been decoded: + Uncompressed size: Yes + End of Payload Marker: Yes + End of Input: No + + Preferred alignment: + Input data: Adjustable to 1/2/4/8/16 byte(s) + Output data: 1 byte + + At the time of writing, there is no other documentation about + how LZMA works than the source code in LZMA SDK. Once such + documentation gets written, it will probably be published as + a separate document, because including the documentation here + would lengthen this document considerably. + + The format of the Filter Properties field is as follows: + + +-----------------+------------------+ + | LZMA Properties | Dictionary Flags | + +-----------------+------------------+ + + +4.3.4.1. LZMA Properties + + The LZMA Properties bits contain three properties. An + abbreviation is given in parentheses, followed by the value + range of the property. The field consists of + + 1) the number of literal context bits (lc, [0, 8]); + 2) the number of literal position bits (lp, [0, 4]); and + 3) the number of position bits (pb, [0, 4]). + + They are encoded using the following formula: + + LZMA Properties = (pb * 5 + lp) * 9 + lc + + The following C code illustrates a straightforward way to + decode the properties: + + uint8_t lc, lp, pb; + uint8_t prop = get_lzma_properties() & 0xFF; + if (prop > (4 * 5 + 4) * 9 + 8) + return LZMA_PROPERTIES_ERROR; + + pb = prop / (9 * 5); + prop -= pb * 9 * 5; + lp = prop / 9; + lc = prop - lp * 9; + + +4.3.4.2. Dictionary Flags + + Currently the lowest six bits of the Dictionary Flags field + are in use: + + Bits Mask Description + 0-5 0x3F Dictionary Size + 6-7 0xC0 Reserved for future use; must be zero for now. + + Dictionary Size is encoded with one-bit mantissa and five-bit + exponent. To avoid wasting space, one-byte dictionary has its + own special value. + + Raw value Mantissa Exponent Dictionary size + 0 1 0 1 byte + 1 2 0 2 bytes + 2 3 0 3 bytes + 3 2 1 4 bytes + 4 3 1 6 bytes + 5 2 2 8 bytes + 6 3 2 12 bytes + 7 2 3 16 bytes + 8 3 3 24 bytes + 9 2 4 32 bytes + ... ... ... ... + 61 2 30 2 GiB + 62 3 30 3 GiB + 63 2 31 4 GiB (*) + + (*) The real maximum size of the dictionary is one byte + less than 4 GiB, because the distance of 4 GiB is + reserved for End of Payload Marker. + + Instead of having a table in the decoder, the dictionary size + can be decoded using the following C code: + + uint64_t dictionary_size; + const uint8_t bits = get_dictionary_flags() & 0x3F; + if (bits == 0) { + dictionary_size = 1; + } else { + dictionary_size = 2 | ((bits + 1) & 1); + dictionary_size = dictionary_size << ((bits - 1) / 2); + } + + +4.3.5. Branch/Call/Jump Filters for Executables + + These filters convert relative branch, call, and jump + instructions to their absolute counterparts in executable + files. This conversion increases redundancy and thus + compression ratio. + + Size of Filter Properties: 0 or 4 bytes + Changes size of data: No + + Detecting when all of the data has been decoded: + Uncompressed size: Yes + End of Payload Marker: No + End of Input: Yes + + Below is the list of filters in this category. The alignment + is the same for both input and output data. + + Filter ID Alignment Description + 0x04 1 byte x86 filter (BCJ) + 0x05 4 bytes PowerPC (big endian) filter + 0x06 16 bytes IA64 filter + 0x07 4 bytes ARM (little endian) filter + 0x08 2 bytes ARM Thumb (little endian) filter + 0x09 4 bytes SPARC filter + + If the size of Filter Properties is four bytes, the Filter + Properties field contains the start offset used for address + conversions. It is stored as an unsigned 32-bit little endian + integer. If the size of Filter Properties is zero, the start + offset is zero. + + Setting the start offset may be useful if an executable has + multiple sections, and there are many cross-section calls. + Taking advantage of this feature usually requires usage of + the Subblock filter. + + +5. Metadata + + Metadata is stored in Metadata Blocks, which can be in the + beginning or at the end of a Multi-Block Stream. Because of + Blocks, it is possible to compress Metadata in the same way + as the actual data is compressed. This Section describes the + format of the data stored in Metadata Blocks. + + +----------------+===============================+ + | Metadata Flags | Size of Header Metadata Block | + +----------------+===============================+ + + +============+===================+=======+=======+ + ---> | Total Size | Uncompressed Size | Index | Extra | + +============+===================+=======+=======+ + + Stream must be parseable backwards. That is, there must be + a way to locate the beginning of the Stream by starting from + the end of the Stream. Thus, the Footer Metadata Block must + contain the Total Size field or the Index field. If the Stream + has Header Metadata Block, also the Size of Header Metadata + Block field must be present in Footer Metadata Block. + + It must be possible to quickly locate the Blocks in + non-streamed mode. Thus, the Index field must be present + at least in one Metadata Block. + + If the above conditions are not met, the decoder must indicate + an error. + + There should be no additional data after the last field. If + there is, the the decoder should indicate an error. + + +5.1. Metadata Flags + + This field describes which fields are present in a Metadata + Block: + + Bit(s) Mask Desription + 0 0x01 Size of Header Metadata Block is present. + 1 0x02 Total Size is present. + 2 0x04 Uncompressed Size is present. + 3 0x08 Index is present. + 4-6 0x70 Reserve for future use; must be zero for now. + 7 0x80 Extra is present. + + If any reserved bit is set, the decoder must indicate an error. + It is possible that there is a new field present which the + decoder is not aware of, and can thus parse the Metadata + incorrectly. + + +5.2. Size of Header Metadata Block + + This field is present only if the appropriate bit is set in + the Metadata Flags field (see Section 5.1). + + Size of Header Metadata Block is needed to make it possible to + parse the Stream backwards. The size is stored using the + encoding described in Section 1.2. The decoder must verify that + that the value stored in this field is non-zero. In Footer + Metadata Block, the decoder must also verify that the stored + size matches the real size of Header Metadata Block. In the + Header Meatadata Block, the value of this field is ignored as + long as it is not zero. + + +5.3. Total Size + + This field is present only if the appropriate bit is set in the + Metadata Flags field (see Section 5.1). + + This field contains the total size of the Data Blocks in the + Stream. Total Size is stored using the encoding described in + Section 1.2. If the stored value does not match the real total + size of the Data Blocks, the decoder must indicate an error. + The value of this field must be non-zero. + + Total Size can be used to quickly locate the beginning or end + of the Stream. This can be useful for example when doing + random-access reading, and the Index field is not in the + Metadata Block currently being read. + + It is useless to have both Total Size and Index in the same + Metadata Block, because Total Size can be calculated from the + Index field. + + +5.4. Uncompressed Size + + This field is present only if the appropriate bit is set in the + Metadata Flags field (see Section 5.1). + + This field contains the total uncompressed size of the Data + Blocks in the Stream. Uncompresssed Size is stored using the + encoding described in Section 1.2. If the stored value does not + match the real uncompressed size of the Data Blocks, the + decoder must indicate an error. + + It is useless to have both Uncompressed Size and Index in + the same Metadata Block, because Uncompressed Size can be + calculated from the Index field. + + +5.5. Index + + +=======================+=============+====================+ + | Number of Data Blocks | Total Sizes | Uncompressed Sizes | + +=======================+=============+====================+ + + Index serves several purporses. Using it, one can + - verify that all Blocks in a Stream have been processed; + - find out the Uncompressed Size of a Stream; and + - quickly access the beginning of any Block (random access). + + +5.5.1. Number of Data Blocks + + This field contains the number of Data Blocks in the Stream. + The value is stored using the encoding described in Section + 1.2. If the decoder has decoded all the Data Blocks of the + Stream, and then notices that the Number of Records doesn't + match the real number of Data Blocks, the decoder must + indicate an error. The value of this field must be non-zero. + + +5.5.2. Total Sizes + + +============+============+ + | Total Size | Total Size | ... + +============+============+ + + This field lists the Total Sizes of every Data Block in the + Stream. There are as many Total Size fields as indicated by + the Number of Data Blocks field. + + Total Size is the size of Block Header, Compressed Data, and + Block Footer. It is stored using the encoding described in + Section 1.2. If the Total Sizes do not match the real sizes + of respective Blocks, the decoder should indicate an error. + All the Total Size fields must have a non-zero value. + + +5.5.3. Uncompressed Sizes + + +===================+===================+ + | Uncompressed Size | Uncompressed Size | ... + +===================+===================+ + + This field lists the Uncompressed Sizes of every Data Block + in the Stream. There are as many Uncompressed Size fields as + indicated by the Number of Records field. + + Uncompressed Sizes are stored using the encoding described + in Section 1.2. If the Uncompressed Sizes do not match the + real sizes of respective Blocks, the decoder shoud indicate + an error. + + +5.6. Extra + + This field is present only if the appropriate bit is set in the + Metadata Flags field (see Section 5.1). Note that the bit does + not indicate that there is any data in the Extra field; it only + indicates that Extra may be non-empty. + + The Extra field contains only information that is not required + to properly uncompress the Stream or to do random-access + reading. Supporting the Extra field is optional. In case the + decoder doesn't support the Extra field, it should silently + ignore it. + + Extra consists of zero or more Records: + + +========+========+ + | Record | Record | ... + +========+========+ + + Excluding Records with Record ID 0x00, each Record contains + three fields: + + +==========+==============+======+ + | Reord ID | Size of Data | Data | + +==========+==============+======+ + + The Record ID and Size of Data are stored using the encoding + described in Section 1.2. Data can be binary or UTF-8 + [RFC-3629] strings. Non-UTF-8 strings should be avoided. + Because the Size of Data is known, there is no need to + terminate strings with a nul byte, although doing so should + not be considered an error. + + The Record IDs are divided in two categories: + - Safe-to-Copy Records may be preserved as is when the + Stream is modified in ways that don't change the actual + uncompressed data. Examples of such operatings include + recompressing and adding, modifying, or deleting unrelated + Extra Records. + - Unsafe-to-Copy Records should be removed (and possibly + recreated) when any kind of changes are made to the Stream. + + When the actual uncompressed data is modified, all Records + should be removed (and possibly recreated), unless the + application knows that the Data stored to the Record(s) is + still valid. + + The following subsections describe the standard Record IDs and + the format of their Data fields. Safe-to-Copy Records have an + odd ID, while Unsafe-to-Copy Records have an even ID. + + +5.6.1. 0x00: Dummy/Padding + + This Record is special, because it doesn't have the Size of + Data or Data fields. + + Dummy Records can be used, for example, to fill Metadata Block + when a few bytes of extra space has been reserved for it. There + can be any number of Dummy Records. + + +5.6.2. 0x01: OpenPGP Signature + + OpenPGP signature is computed from uncompressed data. The + signature can be used to verify that the contents of a Stream + has been created by a trustworthy source. + + If the decoder supports decoding concatenated Streams, it + must indicate an error when verifying OpenPGP signatures if + there is more than one Stream. + + OpenPGP format is documented in [RFC-2440]. + + +5.6.3. 0x02: Filter Information + + The Filter Information Record contains information about the + filters used in the Stream. This field can be used to quickly + - display which filters are used in each Block; + - check if all the required filters are supported by the + current decoder version; and + - check how much memory is required to decode each Block. + + The format of the Filter Information field is as follows: + + +=================+=================+ + | Block 0 Filters | Block 1 Filters | ... + +=================+=================+ + + There can be at maximum of as many Block Filters fields as + there are Data Blocks in the Stream. The format of the Block + Filters field is as follows: + + +------------------+======================+============+ + | Block Properties | List of Filter Flags | Subfilters | + +------------------+======================+============+ + + Block Properties is a bitfield: + + Bit(s) Mask Description + 0-2 0x07 Number of filters (0-7) + 3 0x08 End of Payload Marker is used. + 4 0x10 The Subfilters field is present. + 5-7 0xE0 Reserved for future use; must be zero for now. + + The contents of the List of Filter Flags field must match the + List of Filter Flags field in the respective Block Header. + + The Subfilters field may be present only if the List of Filter + Flags contains a Filter Flags field for a Subblock filter. The + format of the Subfilters field is as follows: + + +======================+=========================+ + | Number of Subfilters | List of Subfilter Flags | + +======================+=========================+ + + The value stored in the Number of Subfilters field is stored + using the encoding described in Section 1.2. The List of + Subfilter Flags field contains as many Filter Flags fields + as indicated by the Number of Subfilters field. These Filter + Flags fields list some or all the Subfilters used via the + Subblock filter. The order of the listed Subfilters is not + significant. + + Decoders supporting this Record should indicate a warning or + error if this Record contains Filter Flags that are not + actually used by the respective Blocks. + + +5.6.4. 0x03: Comment + + Free-form comment is stored in UTF-8 [RFC-3629] encoding. + + The beginning of a new line should be indicated using the + ASCII Line Feed character (0x0A). When the Line Feed character + is not the native way to indicate new line in the underlying + operating system, the encoder and decoder should convert the + newline characters to and from Line Feeds. + + +5.6.5. 0x04: List of Checks + + +=======+=======+ + | Check | Check | ... + +=======+=======+ + + There are as many Check fields as there are Blocks in the + Stream. The size of Check fields depend on Stream Flags + (see Section 2.2.2). + + Decoders supporting this Record should indicate a warning or + error if the Checks don't match the respective Blocks. + + +5.6.6. 0x05: Original Filename + + Original filename is stored in UTF-8 [RFC-3629] encoding. + + The filename must not include any path, only the filename + itself. Special care must be taken to prevent directory + traversal vulnerabilities. + + When files are moved between different operating systems, it + is possible that filename valid in the source system is not + valid in the target system. It is implementation defined how + the decoder handles this kind of situations. + + +5.6.7. 0x07: Modification Time + + Modification time is stored as POSIX time, as an unsigned + little endian integer. The number of bits depends on the + Size of Data field. Note that the usage of unsigned integer + limits the earliest representable time to 1970-01-01T00:00:00. + + +5.6.8. 0x09: High-Resolution Modification Time + + This Record extends the `0x04: Modification time' Record with + a subsecond time information. There are two supported formats + of this field, which can be distinguished by looking at the + Size of Data field. + + Size Data + 3 [0; 9,999,999] times 100 nanoseconds + 4 [0; 999,999,999] nanoseconds + + The value is stored as an unsigned 24-bit or 32-bit little + endian integer. + + +5.6.9. 0x0B: MIME Type + + MIME type of the uncompressed Stream. This can be used to + detect the content type. [IANA-MIME] + + +5.6.10. 0x0D: Homepage URL + + This field can be used, for example, when distributing software + packages (sources or binaries). The field would indicate the + homepage of the program. + + For details on how to encode URLs, see [RFC-1738]. + + +6. Custom Filter and Extra Record IDs + + If a developer wants to use custom Filter or Extra Record IDs, + he has two choices. The first choice is to contact Lasse Collin + and ask him to allocate a range of IDs for the developer. + + The second choice is to generate a 40-bit random integer, + which the developer can use as his personal Developer ID. + To minimalize the risk of collisions, Developer ID has to be + a randomly generated integer, not manually selected "hex word". + The following command, which works on many free operating + systems, can be used to generate Developer ID: + + dd if=/dev/urandom bs=5 count=1 | hexdump + + The developer can then use his Developer ID to create unique + (well, hopefully unique) Filter and Extra Record IDs. + + Bits Mask Description + 0-15 0x0000_0000_0000_FFFF Filter or Extra Record ID + 16-55 0x00FF_FFFF_FFFF_0000 Developer ID + 56-62 0x7F00_0000_0000_0000 Static prefix: 0x7F + + The resulting 63-bit integer will use 9 bytes of space when + stored using the encoding described in Section 1.2. To get + a shorter ID, see the beginning of this Section how to + request a custom ID range. + + Note that Filter and Metadata Record IDs are in their own + namespaces. That is, you can use the same ID value as Filter ID + and Metadata Record ID, and the meanings of the IDs do not need + to be related to each other. + + +6.1. Reserved Custom Filter ID Ranges + + Range Description + 0x0000_0000 - 0x0000_00DF IDs fitting into the Misc field + 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility + 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility + + +7. Cyclic Redundancy Checks + + There are several incompatible variations to calculate CRC32 + and CRC64. For simplicity and clarity, complete examples are + provided to calculate the checks as they are used in this file + format. Implementations may use different code as long as it + gives identical results. + + The program below reads data from standard input, calculates + the CRC32 and CRC64 values, and prints the calculated values + as big endian hexadecimal strings to standard output. + + #include <sys/types.h> + #include <inttypes.h> + #include <stdio.h> + + uint32_t crc32_table[256]; + uint64_t crc64_table[256]; + + void + init(void) + { + static const uint32_t poly32 = UINT32_C(0xEDB88320); + static const uint64_t poly64 + = UINT64_C(0xC96C5795D7870F42); + + for (size_t i = 0; i < 256; ++i) { + uint32_t crc32 = i; + uint64_t crc64 = i; + + for (size_t j = 0; j < 8; ++j) { + if (crc32 & 1) + crc32 = (crc32 >> 1) ^ poly32; + else + crc32 >>= 1; + + if (crc64 & 1) + crc64 = (crc64 >> 1) ^ poly64; + else + crc64 >>= 1; + } + + crc32_table[i] = crc32; + crc64_table[i] = crc64; + } + } + + uint32_t + crc32(const uint8_t *buf, size_t size, uint32_t crc) + { + crc = ~crc; + for (size_t i = 0; i < size; ++i) + crc = crc32_table[buf[i] ^ (crc & 0xFF)] + ^ (crc >> 8); + return ~crc; + } + + uint64_t + crc64(const uint8_t *buf, size_t size, uint64_t crc) + { + crc = ~crc; + for (size_t i = 0; i < size; ++i) + crc = crc64_table[buf[i] ^ (crc & 0xFF)] + ^ (crc >> 8); + return ~crc; + } + + int + main() + { + init(); + + uint32_t value32 = 0; + uint64_t value64 = 0; + uint64_t total_size = 0; + uint8_t buf[8192]; + + while (1) { + const size_t buf_size = fread(buf, 1, 8192, stdin); + if (buf_size == 0) + break; + + total_size += buf_size; + value32 = crc32(buf, buf_size, value32); + value64 = crc64(buf, buf_size, value64); + } + + printf("Bytes: %" PRIu64 "\n", total_size); + printf("CRC-32: 0x%08" PRIX32 "\n", value32); + printf("CRC-64: 0x%016" PRIX64 "\n", value64); + + return 0; + } + + +8. References + +8.1. Normative References + + [RFC-1738] + Uniform Resource Locators (URL) + http://www.ietf.org/rfc/rfc1738.txt + + [RFC-2119] + Key words for use in RFCs to Indicate Requirement Levels + http://www.ietf.org/rfc/rfc2119.txt + + [RFC-2440] + OpenPGP Message Format + http://www.ietf.org/rfc/rfc2440.txt + + [RFC-3629] + UTF-8, a transformation format of ISO 10646 + http://www.ietf.org/rfc/rfc3629.txt + + [IANA-MIME] + MIME Media Types + http://www.iana.org/assignments/media-types/ + + +8.2. Informative References + + LZMA SDK - The original LZMA implementation + http://7-zip.org/sdk.html + + LZMA Utils - LZMA adapted to POSIX-like systems + http://tukaani.org/lzma/ + + [RFC-1952] + GZIP file format specification version 4.3 + http://www.ietf.org/rfc/rfc1952.txt + - Notation of byte boxes in section `2.1. Overall conventions' + + [GNU-tar] + GNU tar 1.16.1 manual + http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html + - Node 9.4.2 `Blocking Factor', paragraph that begins + `gzip will complain about trailing garbage' + - Note that this URL points to the latest version of the + manual, and may some day not contain the note which is in + 1.16.1. For the exact version of the manual, download GNU + tar 1.16.1: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.16.1.tar.gz + diff --git a/doc/history.txt b/doc/history.txt new file mode 100644 index 00000000..55293062 --- /dev/null +++ b/doc/history.txt @@ -0,0 +1,140 @@ + +LZMA Utils history +------------------ + +Tukaani distribution + + In 2005, there was a small group working on Tukaani distribution, which + was a Slackware fork. One of the project goals was to fit the distro on + a single 700 MiB ISO-9660 image. Using LZMA instead of gzip helped a + lot. Roughly speaking, one could fit data that took 1000 MiB in gzipped + form into 700 MiB with LZMA. Naturally compression ratio varied across + packages, but this was what we got on average. + + Slackware packages have traditionally had .tgz as the filename suffix, + which is an abbreviation of .tar.gz. A logical naming for LZMA + compressed packages was .tlz, being an abbreviation of .tar.lzma. + + At the end of the year 2007, there's no distribution under the Tukaani + project anymore. Development of LZMA Utils still continues. Still, + there are .tlz packages around, because at least Vector Linux (a + Slackware based distribution) uses LZMA for its packages. + + First versions of the modified pkgtools used the LZMA_Alone tool from + Igor Pavlov's LZMA SDK as is. It was fine, because users wouldn't need + to interact with LZMA_Alone directly. But people soon wanted to use + LZMA for other files too, and the interface of LZMA_Alone wasn't + comfortable for those used to gzip and bzip2. + + +First steps of LZMA Utils + + The first version of LZMA Utils (4.22.0) included a shell script called + lzmash. It was wrapper that had gzip-like command line interface. It + used the LZMA_Alone tool from LZMA SDK to do all the real work. zgrep, + zdiff, and related scripts from gzip were adapted work with LZMA and + were part of the first LZMA Utils release too. + + LZMA Utils 4.22.0 included also lzmadec, which was a small (less than + 10 KiB) decoder-only command line tool. It was written on top of the + decoder-only C code found from the LZMA SDK. lzmadec was convenient in + situations where LZMA_Alone (a few hundred KiB) would be too big. + + lzmash and lzmadec were written by Lasse Collin. + + +Second generation + + The lzmash script was an ugly and not very secure hack. The last + version of LZMA Utils to use lzmash was 4.27.1. + + LZMA Utils 4.32.0beta1 introduced a new lzma command line tool written + by Ville Koskinen. It was written in C++, and used the encoder and + decoder from C++ LZMA SDK with little modifications. This tool replaced + both the lzmash script and the LZMA_Alone command line tool in LZMA + Utils. + + Introducing this new tool caused some temporary incompatibilities, + because LZMA_Alone executable was simply named lzma like the new + command line tool, but they had completely different command line + interface. The file format was still the same. + + Lasse wrote liblzmadec, which was a small decoder-only library based on + the C code found from LZMA SDK. liblzmadec had API similar to zlib, + although there were some significant differences, which made it + non-trivial to use it in some applications designed for zlib and + libbzip2. + + The lzmadec command line tool was converted to use liblzmadec. + + Alexandre Sauvé helped converting build system to use GNU Autotools. + This made is easier to test for certain less portable features needed + by the new command line tool. + + Since the new command line tool never got completely finished (for + example, it didn't support LZMA_OPT environment variable), the intent + was to not call 4.32.x stable. Similarly, liblzmadec wasn't polished, + but appeared to work well enough, so some people started using it too. + + Because the development of the third generation of LZMA Utils was + delayed considerably (roughly two years), the 4.32.x branch had to be + kept maintained. It got some bug fixes now and then, and finally it was + decided to call it stable, although most of the missing features were + never added. + + +File format problems + + The file format used by LZMA_Alone was primitive. It was designed for + embedded systems in mind, and thus provided only minimal set of + features. The two biggest problems for non-embedded use were lack of + magic bytes and integrity check. + + Igor and Lasse started developing a new file format with some help from + Ville Koskinen, Mark Adler and Mikko Pouru. Designing the new format + took quite a long time. It was mostly because Lasse was quite slow at + getting things done due to personal reasons. + + Near the end of the year 2007 the new format was practically finished. + Compared to LZMA_Alone format and the .gz format used by gzip, the new + .lzma format is quite complex as a whole. This means that tools having + *full* support for the new format would be larger and more complex than + the tools supporting only the old LZMA_Alone format. + + For the situations where the full support for the .lzma format wouldn't + be required (embedded systems, operating system kernels), the new + format has a well-defined subset, which is easy to support with small + amount of code. It wouldn't be as small as an implementation using the + LZMA_Alone format, but the difference shouldn't be significant. + + The new .lzma format allows dividing the data in multiple independent + blocks, which can be compressed and uncompressed independenly. This + makes multi-threading possible with algorithms that aren't inherently + parallel (such as LZMA). There's also a central index of the sizes of + the blocks, which makes it possible to do limited random-access reading + with granularity of the block size. + + The new .lzma format uses the same filename suffix that was used for + LZMA_Alone files. The advantage is that users using the new tools won't + notice the change to the new format. The disadvantage is that the old + tools won't work with the new files. + + +Third generation + + LZMA Utils 4.42.0alphas drop the rest of the C++ LZMA SDK. The LZMA and + other included filters (algorithm implementations) are still directly + based on LZMA SDK, but ported to C. + + liblzma is now the core of LZMA Utils. It has zlib-like API, which + doesn't suffer from the problems of the API of liblzmadec. liblzma + supports not only LZMA, but several other filters, which together + can improve compression ratio even further with certain file types. + + The lzma and lzmadec command line tools have been rewritten. They uses + liblzma to do the actual compressing or uncompressing. + + The development of LZMA Utils 4.42.x is still in alpha stage. Several + features are still missing or don't fully work yet. Documentation is + also very minimal. + diff --git a/doc/liblzma-advanced.txt b/doc/liblzma-advanced.txt new file mode 100644 index 00000000..d829a33a --- /dev/null +++ b/doc/liblzma-advanced.txt @@ -0,0 +1,324 @@ + +Advanced features of liblzma +---------------------------- + +0. Introduction + + Most developers need only the basic features of liblzma. These + features allow single-threaded encoding and decoding of .lzma files + in streamed mode. + + In some cases developers want more. The .lzma file format is + designed to allow multi-threaded encoding and decoding and limited + random-access reading. These features are possible in non-streamed + mode and limitedly also in streamed mode. + + To take advange of these features, the application needs a custom + .lzma file format handler. liblzma provides a set of tools to ease + this task, but it's still quite a bit of work to get a good custom + .lzma handler done. + + +1. Where to begin + + Start by reading the .lzma file format specification. Understanding + the basics of the .lzma file structure is required to implement a + custom .lzma file handler and to understand the rest of this document. + + +2. The basic components + +2.1. Stream Header and tail + + Stream Header begins the .lzma Stream and Stream tail ends it. Stream + Header is defined in the file format specification, but Stream tail + isn't (thus I write "tail" with a lower-case letter). Stream tail is + simply the Stream Flags and the Footer Magic Bytes fields together. + It was done this way in liblzma, because the Block coders take care + of the rest of the stuff in the Stream Footer. + + For now, the size of Stream Header is fixed to 11 bytes. The header + <lzma/stream_flags.h> defines LZMA_STREAM_HEADER_SIZE, which you + should use instead of a hardcoded number. Similarly, Stream tail + is fixed to 3 bytes, and there is a constant LZMA_STREAM_TAIL_SIZE. + + It is possible, that a future version of the .lzma format will have + variable-sized Stream Header and tail. As of writing, this seems so + unlikely though, that it was considered simplest to just use a + constant instead of providing a functions to get and store the sizes + of the Stream Header and tail. + + +2.x. Stream tail + + For now, the size of Stream tail is fixed to 3 bytes. The header + <lzma/stream_flags.h> defines LZMA_STREAM_TAIL_SIZE, which you + should use instead of a hardcoded number. + + +3. Keeping track of size information + + The lzma_info_* functions found from <lzma/info.h> should ease the + task of keeping track of sizes of the Blocks and also the Stream + as a whole. Using these functions is strongly recommended, because + there are surprisingly many situations where an error can occur, + and these functions check for possible errors every time some new + information becomes available. + + If you find lzma_info_* functions lacking something that you would + find useful, please contact the author. + + +3.1. Start offset of the Stream + + If you are storing the .lzma Stream inside anothe file format, or + for some other reason are placing the .lzma Stream to somewhere + else than to the beginning of the file, you should tell the starting + offset of the Stream using lzma_info_start_offset_set(). + + The start offset of the Stream is used for two distinct purporses. + First, knowing the start offset of the Stream allows + lzma_info_alignment_get() to correctly calculate the alignment of + every Block. This information is given to the Block encoder, which + will calculate the size of Header Padding so that Compressed Data + is alignment at an optimal offset. + + Another use for start offset of the Stream is in random-access + reading. If you set the start offset of the Stream, lzma_info_locate() + will be able to calculate the offset relative to the beginning of the + file containing the Stream (instead of offset relative to the + beginning of the Stream). + + +3.2. Size of Stream Header + + While the size of Stream Header is constant (11 bytes) in the current + version of the .lzma file format, this may change in future. + + +3.3. Size of Header Metadata Block + + This information is needed when doing random-access reading, and + to verify the value of this field stored in Footer Metadata Block. + + +3.4. Total Size of the Data Blocks + + +3.5. Uncompressed Size of Data Blocks + + +3.6. Index + + + + +x. Alignment + + There are a few slightly different types of alignment issues when + working with .lzma files. + + The .lzma format doesn't strictly require any kind of alignment. + However, if the encoder carefully optimizes the alignment in all + situations, it can improve compression ratio, speed of the encoder + and decoder, and slightly help if the files get damaged and need + recovery. + + Alignment has the most significant effect compression ratio FIXME + + +x.1. Compression ratio + + Some filters take advantage of the alignment of the input data. + To get the best compression ratio, make sure that you feed these + filters correctly aligned data. + + Some filters (e.g. LZMA) don't necessarily mind too much if the + input doesn't match the preferred alignment. With these filters + the penalty in compression ratio depends on the specific type of + data being compressed. + + Other filters (e.g. PowerPC executable filter) won't work at all + with data that is improperly aligned. While the data can still + be de-filtered back to its original form, the benefit of the + filtering (better compression ratio) is completely lost, because + these filters expect certain patterns at properly aligned offsets. + The compression ratio may even worse with incorrectly aligned input + than without the filter. + + +x.1.1. Inter-filter alignment + + When there are multiple filters chained, checking the alignment can + be useful not only with the input of the first filter and output of + the last filter, but also between the filters. + + Inter-filter alignment important especially with the Subblock filter. + + +x.1.2. Further compression with external tools + + This is relatively rare situation in practice, but still worth + understanding. + + Let's say that there are several SPARC executables, which are each + filtered to separate .lzma files using only the SPARC filter. If + Uncompressed Size is written to the Block Header, the size of Block + Header may vary between the .lzma files. If no Padding is used in + the Block Header to correct the alignment, the starting offset of + the Compressed Data field will be differently aligned in different + .lzma files. + + All these .lzma files are archived into a single .tar archive. Due + to nature of the .tar format, every file is aligned inside the + archive to an offset that is a multiple of 512 bytes. + + The .tar archive is compressed into a new .lzma file using the LZMA + filter with options, that prefer input alignment of four bytes. Now + if the independent .lzma files don't have the same alignment of + the Compressed Data fields, the LZMA filter will be unable to take + advantage of the input alignment between the files in the .tar + archive, which reduces compression ratio. + + Thus, even if you have only single Block per file, it can be good for + compression ratio to align the Compressed Data to optimal offset. + + +x.2. Speed + + Most modern computers are faster when multi-byte data is located + at aligned offsets in RAM. Proper alignment of the Compressed Data + fields can slightly increase the speed of some filters. + + +x.3. Recovery + + Aligning every Block Header to start at an offset with big enough + alignment may ease or at least speed up recovery of broken files. + + +y. Typical usage cases + +y.x. Parsing the Stream backwards + + You may need to parse the Stream backwards if you need to get + information such as the sizes of the Stream, Index, or Extra. + The basic procedure to do this follows. + + Locate the end of the Stream. If the Stream is stored as is in a + standalone .lzma file, simply seek to the end of the file and start + reading backwards using appropriate buffer size. The file format + specification allows arbitrary amount of Footer Padding (zero or more + NUL bytes), which you skip before trying to decode the Stream tail. + + Once you have located the end of the Stream (a non-NULL byte), make + sure you have at least the last LZMA_STREAM_TAIL_SIZE bytes of the + Stream in a buffer. If there isn't enough bytes left from the file, + the file is too small to contain a valid Stream. Decode the Stream + tail using lzma_stream_tail_decoder(). Store the offset of the first + byte of the Stream tail; you will need it later. + + You may now want to do some internal verifications e.g. if the Check + type is supported by the liblzma build you are using. + + Decode the Backward Size field with lzma_vli_reverse_decode(). The + field is at maximum of LZMA_VLI_BYTES_MAX bytes long. Check that + Backward Size is not zero. Store the offset of the first byte of + the Backward Size; you will need it later. + + Now you know the Total Size of the last Block of the Stream. It's the + value of Backward Size plus the size of the Backward Size field. Note + that you cannot use lzma_vli_size() to calculate the size since there + might be padding; you need to use the real observed size of the + Backward Size field. + + At this point, the operation continues differently for Single-Block + and Multi-Block Streams. + + +y.x.1. Single-Block Stream + + There might be Uncompressed Size field present in the Stream Footer. + You cannot know it for sure unless you have already parsed the Block + Header earlier. For security reasons, you probably want to try to + decode the Uncompressed Size field, but you must not indicate any + error if decoding fails. Later you can give the decoded Uncompressed + Size to Block decoder if Uncopmressed Size isn't otherwise known; + this prevents it from producing too much output in case of (possibly + intentionally) corrupt file. + + Calculate the the start offset of the Stream: + + backward_offset - backward_size - LZMA_STREAM_HEADER_SIZE + + backward_offset is the offset of the first byte of the Backward Size + field. Remember to check for integer overflows, which can occur with + invalid input files. + + Seek to the beginning of the Stream. Decode the Stream Header using + lzma_stream_header_decoder(). Verify that the decoded Stream Flags + match the values found from Stream tail. You can use the + lzma_stream_flags_is_equal() macro for this. + + Decode the Block Header. Verify that it isn't a Metadata Block, since + Single-Block Streams cannot have Metadata. If Uncompressed Size is + present in the Block Header, the value you tried to decode from the + Stream Footer must be ignored, since Uncompressed Size wasn't actually + present there. If Block Header doesn't have Uncompressed Size, and + decoding the Uncompressed Size field from the Stream Footer failed, + the file is corrupt. + + If you were only looking for the Uncompressed Size of the Stream, + you now got that information, and you can stop processing the Stream. + + To decode the Block, the same instructions apply as described in + FIXME. However, because you have some extra known information decoded + from the Stream Footer, you should give this information to the Block + decoder so that it can verify it while decoding: + - If Uncompressed Size is not present in the Block Header, set + lzma_options_block.uncompressed_size to the value you decoded + from the Stream Footer. + - Always set lzma_options_block.total_size to backward_size + + size_of_backward_size (you calculated this sum earlier already). + + +y.x.2. Multi-Block Stream + + Calculate the start offset of the Footer Metadata Block: + + backward_offset - backward_size + + backward_offset is the offset of the first byte of the Backward Size + field. Remember to check for integer overflows, which can occur with + broken input files. + + Decode the Block Header. Verify that it is a Metadata Block. Set + lzma_options_block.total_size to backward_size + size_of_backward_size + (you calculated this sum earlier already). Then decode the Footer + Metadata Block. + + Store the decoded Footer Metadata to lzma_info structure using + lzma_info_set_metadata(). Set also the offset of the Backward Size + field using lzma_info_size_set(). Then you can get the start offset + of the Stream using lzma_info_size_get(). Note that any of these steps + may fail so don't omit error checking. + + Seek to the beginning of the Stream. Decode the Stream Header using + lzma_stream_header_decoder(). Verify that the decoded Stream Flags + match the values found from Stream tail. You can use the + lzma_stream_flags_is_equal() macro for this. + + If you were only looking for the Uncompressed Size of the Stream, + it's possible that you already have it now. If Uncompressed Size (or + whatever information you were looking for) isn't available yet, + continue by decoding also the Header Metadata Block. (If some + information is missing, the Header Metadata Block has to be present.) + + Decoding the Data Blocks goes the same way as described in FIXME. + + +y.x.3. Variations + + If you know the offset of the beginning of the Stream, you may want + to parse the Stream Header before parsing the Stream tail. + diff --git a/doc/liblzma-hacking.txt b/doc/liblzma-hacking.txt new file mode 100644 index 00000000..64390bcb --- /dev/null +++ b/doc/liblzma-hacking.txt @@ -0,0 +1,112 @@ + +Hacking liblzma +--------------- + +0. Preface + + This document gives some overall information about the internals of + liblzma, which should make it easier to start reading and modifying + the code. + + +1. Programming language + + liblzma was written in C99. If you use GCC, this means that you need + at least GCC 3.x.x. GCC 2 isn't and won't be supported. + + Some GCC-specific extensions are used *conditionally*. They aren't + required to build a full-featured library. Don't make the code rely + on any non-standard compiler extensions or even C99 features that + aren't portable between almost-C99 compatible compilers (for example + non-static inlines). + + The public API headers are in C89. This is to avoid frustrating those + who maintain programs, which are strictly in C89 or C++. + + An assumption about sizeof(size_t) is made. If this assumption is + wrong, some porting is probably needed: + + sizeof(uint32_t) <= sizeof(size_t) <= sizeof(uint64_t) + + +2. Internal vs. external API + + + + Input Output + v Application ^ + | liblzma public API | + | Stream coder | + | Block coder | + | Filter coder | + | ... | + v Filter coder ^ + + + Application + `-- liblzma public API + `-- Stream coder + |-- Stream info handler + |-- Stream Header coder + |-- Block Header coder + | `-- Filter Flags coder + |-- Metadata coder + | `-- Block coder + | `-- Filter 0 + | `-- Filter 1 + | ... + |-- Data Block coder + | `-- Filter 0 + | `-- Filter 1 + | ... + `-- Stream tail coder + + + +x. Designing new filters + + All filters must be designed so that the decoder cannot consume + arbitrary amount input without producing any decoded output. Failing + to follow this rule makes liblzma vulnerable to DoS attacks if + untrusted files are decoded (usually they are untrusted). + + An example should clarify the reason behind this requirement: There + are two filters in the chain. The decoder of the first filter produces + huge amount of output (many gigabytes or more) with a few bytes of + input, which gets passed to the decoder of the second filter. If the + data passed to the second filter is interpreted as something that + produces no output (e.g. padding), the filter chain as a whole + produces no output and consumes no input for a long period of time. + + The above problem was present in the first versions of the Subblock + filter. A tiny .lzma file could have taken several years to decode + while it wouldn't produce any output at all. The problem was fixed + by adding limits for number of consecutive Padding bytes, and requiring + that some decoded output must be produced between Set Subfilter and + Unset Subfilter. + + +x. Implementing new filters + + If the filter supports embedding End of Payload Marker, make sure that + when your filter detects End of Payload Marker, + - the usage of End of Payload Marker is actually allowed (i.e. End + of Input isn't used); and + - it also checks that there is no more input coming from the next + filter in the chain. + + The second requirement is slightly tricky. It's possible that the next + filter hasn't returned LZMA_STREAM_END yet. It may even need a few + bytes more input before it will do so. You need to give it as much + input as it needs, and verify that it doesn't produce any output. + + Don't call the next filter in the chain after it has returned + LZMA_STREAM_END (except in encoder if action == LZMA_SYNC_FLUSH). + It will result undefined behavior. + + Be pedantic. If the input data isn't exactly valid, reject it. + + At the moment, liblzma isn't modular. You will need to edit several + files in src/liblzma/common to include support for a new filter. grep + for LZMA_FILTER_LZMA to locate the files needing changes. + diff --git a/doc/liblzma-intro.txt b/doc/liblzma-intro.txt new file mode 100644 index 00000000..9cbd63a9 --- /dev/null +++ b/doc/liblzma-intro.txt @@ -0,0 +1,188 @@ + +Introduction to liblzma +----------------------- + +Writing applications to work with liblzma + + liblzma API is split in several subheaders to improve readability and + maintainance. The subheaders must not be #included directly; simply + use `#include <lzma.h>' instead. + + Those who have used zlib should find liblzma's API easy to use. + To developers who haven't used zlib before, I recommend learning + zlib first, because zlib has excellent documentation. + + While the API is similar to that of zlib, there are some major + differences, which are summarized below. + + For basic stream encoding, zlib has three functions (deflateInit(), + deflate(), and deflateEnd()). Similarly, there are three functions + for stream decoding (inflateInit(), inflate(), and inflateEnd()). + liblzma has only single coding and ending function. Thus, to + encode one may use, for example, lzma_stream_encoder_single(), + lzma_code(), and lzma_end(). Simlarly for decoding, one may + use lzma_auto_decoder(), lzma_code(), and lzma_end(). + + zlib has deflateReset() and inflateReset() to reset the stream + structure without reallocating all the memory. In liblzma, all + coder initialization functions are like zlib's reset functions: + the first-time initializations are done with the same functions + as the reinitializations (resetting). + + To make all this work, liblzma needs to know when lzma_stream + doesn't already point to an allocated and initialized coder. + This is achieved by initializing lzma_stream structure with + LZMA_STREAM_INIT (static initialization) or LZMA_STREAM_INIT_VAR + (for exampple when new lzma_stream has been allocated with malloc()). + This initialization should be done exactly once per lzma_stream + structure to avoid leaking memory. Calling lzma_end() will leave + lzma_stream into a state comparable to the state achieved with + LZMA_STREAM_INIT and LZMA_STREAM_INIT_VAR. + + Example probably clarifies a lot. With zlib, compression goes + roughly like this: + + z_stream strm; + deflateInit(&strm, level); + deflate(&strm, Z_RUN); + deflate(&strm, Z_RUN); + ... + deflate(&strm, Z_FINISH); + deflateEnd(&strm) or deflateReset(&strm) + + With liblzma, it's slightly different: + + lzma_stream strm = LZMA_STREAM_INIT; + lzma_stream_encoder_single(&strm, &options); + lzma_code(&strm, LZMA_RUN); + lzma_code(&strm, LZMA_RUN); + ... + lzma_code(&strm, LZMA_FINISH); + lzma_end(&strm) or reinitialize for new coding work + + Reinitialization in the last step can be any function that can + initialize lzma_stream; it doesn't need to be the same function + that was used for the previous initialization. If it is the same + function, liblzma will usually be able to re-use most of the + existing memory allocations (depends on how much the initialization + options change). If you reinitialize with different function, + liblzma will automatically free the memory of the previous coder. + + +File formats + + liblzma supports multiple container formats for the compressed data. + Different initialization functions initialize the lzma_stream to + process different container formats. See the details from the public + header files. + + The following functions are the most commonly used: + + - lzma_stream_encoder_single(): Encodes Single-Block Stream; this + the recommended format for most purporses. + + - lzma_alone_encoder(): Useful if you need to encode into the + legacy LZMA_Alone format. + + - lzma_auto_decoder(): Decoder that automatically detects the + file format; recommended when you decode compressed files on + disk, because this way compatibility with the legacy LZMA_Alone + format is transparent. + + - lzma_stream_decoder(): Decoder for Single- and Multi-Block + Streams; this is good if you want to accept only .lzma Streams. + + +Filters + + liblzma supports multiple filters (algorithm implementations). The new + .lzma format supports filter-chain having up to seven filters. In the + filter chain, the output of one filter is input of the next filter in + the chain. The legacy LZMA_Alone format supports only one filter, and + that must always be LZMA. + + General-purporse compression: + + LZMA The main algorithm of liblzma (surprise!) + + Branch/Call/Jump filters for executables: + + x86 This filter is known as BCJ in 7-Zip + IA64 IA-64 (Itanium) + PowerPC Big endian PowerPC + ARM + ARM-Thumb + SPARC + + Other filters: + + Copy Dummy filter that simply copies all the data + from input to output. + + Subblock Multi-purporse filter, that can + - embed End of Payload Marker if the previous + filter in the chain doesn't support it; and + - apply Subfilters, which filter only part + of the same compressed Block in the Stream. + + Branch/Call/Jump filters never change the size of the data. They + should usually be used as a pre-filter for some compression filter + like LZMA. + + +Integrity checks + + The .lzma Stream format uses CRC32 as the integrity check for + different file format headers. It is possible to omit CRC32 from + the Block Headers, but not from Stream Header. This is the reason + why CRC32 code cannot be disabled when building liblzma (in addition, + the LZMA encoder uses CRC32 for hashing, so that's another reason). + + The integrity check of the actual data is calculated from the + uncompressed data. This check can be CRC32, CRC64, or SHA256. + It can also be omitted completely, although that usually is not + a good thing to do. There are free IDs left, so support for new + checks algorithms can be added later. + + +API and ABI stability + + The API and ABI of liblzma isn't stable yet, although no huge + changes should happen. One potential place for change is the + lzma_options_subblock structure. + + In the 4.42.0alpha phase, the shared library version number won't + be updated even if ABI breaks. I don't want to track the ABI changes + yet. Just rebuild everything when you upgrade liblzma until we get + to the beta stage. + + +Size of the library + + While liblzma isn't huge, it is quite far from the smallest possible + LZMA implementation: full liblzma binary (with support for all + filters and other features) is way over 100 KiB, but the plain raw + LZMA decoder is only 5-10 KiB. + + To decrease the size of the library, you can omit parts of the library + by passing certain options to the `configure' script. Disabling + everything but the decoders of the require filters will usually give + you a small enough library, but if you need a decoder for example + embedded in the operating system kernel, the code from liblzma probably + isn't suitable as is. + + If you need a minimal implementation supporting .lzma Streams, you + may need to do partial rewrite. liblzma uses stateful API like zlib. + That increases the size of the library. Using callback API or even + simpler buffer-to-buffer API would allow smaller implementation. + + LZMA SDK contains smaller LZMA decoder written in ANSI-C than + liblzma, so you may want to take a look at that code. However, + it doesn't (at least not yet) support the new .lzma Stream format. + + +Documentation + + There's no other documentation than the public headers and this + text yet. Real docs will be written some day, I hope. + diff --git a/doc/liblzma-security.txt b/doc/liblzma-security.txt new file mode 100644 index 00000000..487637ed --- /dev/null +++ b/doc/liblzma-security.txt @@ -0,0 +1,219 @@ + +Using liblzma securely +---------------------- + +0. Introduction + + This document discusses how to use liblzma securely. There are issues + that don't apply to zlib or libbzip2, so reading this document is + strongly recommended even for those who are very familiar with zlib + or libbzip2. + + While making liblzma itself as secure as possible is essential, it's + out of scope of this document. + + +1. Memory usage + + The memory usage of liblzma varies a lot. + + +1.1. Problem sources + +1.1.1. Block coder + + The memory requirements of Block encoder depend on the used filters + and their settings. The memory requirements of the Block decoder + depend on the which filters and with which filter settings the Block + was encoded. Usually the memory requirements of a decoder are equal + or less than the requirements of the encoder with the same settings. + + While the typical memory requirements to decode a Block is from a few + hundred kilobytes to tens of megabytes, a maliciously constructed + files can require a lot more RAM to decode. With the current filters, + the maximum amount is about 7 GiB. If you use multi-threaded decoding, + every Block can require this amount of RAM, thus a four-threaded + decoder could suddenly try to allocate 28 GiB of RAM. + + If you don't limit the maximum memory usage in any way, and there are + no resource limits set on the operating system side, one malicious + input file can run the system out of memory, or at least make it swap + badly for a long time. This is exceptionally bad on servers e.g. + email server doing virus scanning on incoming messages. + + +1.1.2. Metadata decoder + + Multi-Block .lzma files contain at least one Metadata Block. + Externally the Metadata Blocks are similar to Data Blocks, so all + the issues mentioned about memory usage of Data Blocks applies to + Metadata Blocks too. + + The uncompressed content of Metadata Blocks contain information about + the Stream as a whole, and optionally some Extra Records. The + information about the Stream is kept in liblzma's internal data + structures in RAM. Extra Records can contain arbitrary data. They are + not interpreted by liblzma, but liblzma will provide them to the + application in uninterpreted form if the application wishes so. + + Usually the Uncompressed Size of a Metadata Block is small. Even on + extreme cases, it shouldn't be much bigger than a few megabytes. Once + the Metadata has been parsed into native data structures in liblzma, + it usually takes a little more memory than in the encoded form. For + all normal files, this is no problem, since the resulting memory usage + won't be too much. + + The problem is that a maliciously constructed Metadata Block can + contain huge amount of "information", which liblzma will try to store + in its internal data structures. This may cause liblzma to allocate + all the available RAM unless some kind of resource usage limits are + applied. + + Note that the Extra Records in Metadata are always parsed but, but + memory is allocated for them only if the application has requested + liblzma to provide the Extra Records to the application. + + +1.2. Solutions + + If you need to decode files from untrusted sources (most people do), + you must limit the memory usage to avoid denial of service (DoS) + conditions caused by malicious input files. + + The first step is to find out how much memory you are allowed consume + at maximum. This may be a hardcoded constant or derived from the + available RAM; whatever is appropriate in the application. + + The simplest solution is to use setrlimit() if the kernel supports + RLIMIT_AS, which limits the memory usage of the whole process. + For more portable and fine-grained limitting, you can use + memory limitter functions found from <lzma/memlimit.h>. + + +1.2.1. Encoder + + lzma_memory_usage() will give you a rough estimate about the memory + usage of the given filter chain. To dramatically simplify the internal + implementation, this function doesn't take into account all the small + helper data structures needed in various places; only the structures + with significant memory usage are taken into account. Still, the + accuracy of this function should be well within a mebibyte. + + The Subblock filter is a special case. If a Subfilter has been + specified, it isn't taken into account when lzma_memory_usage() + calculates the memory usage. You need to calculate the memory usage + of the Subfilter separately. + + Keeping track of Blocks in a Multi-Block Stream takes a few dozen + bytes of RAM per Block (size of the lzma_index structure plus overhead + of malloc()). It isn't a good idea to put tens of thousands of Blocks + into a Stream unless you have a very good reason to do so (compressed + dictionary could be an example of such situation). + + Also keep the number and sizes of Extra Records sane. If you produce + the list of Extra Records automatically from some untrusted source, + you should not only validate the content of these Records, but also + their memory usage. + + +1.2.2. Decoder + + A single-threaded decoder should simply use a memory limitter and + indicate an error if it runs out of memory. + + Memory-limitting with multi-threaded decoding is tricky. The simple + solution is to divide the maximum allowed memory usage with the + maximum allowed threads, and give each Block decoder their own + independent lzma_memory_limitter. The drawback is that if one Block + needs notably more RAM than any other Block, the decoder will run out + of memory when in reality there would be plenty of free RAM. + + An attractive alternative would be using shared lzma_memory_limitter. + Depending on the application and the expected type of input, this may + either be the best solution or a source of hard-to-repeat problems. + Consider the following requirements: + - You use at maximum of n threads. + - x(i) is the decoder memory requirements of the Block number i + in an expected input Stream. + - The memory limitter is set to higher value than the sum of n + highest values x(i). + + (If you are better at explaining the above conditions, please + contribute your improved version.) + + If the above conditions aren't met, it is possible that the decoding + will fail unpredictably. That is, on the same machine using the same + settings, the decoding may sometimes succeed and sometimes fail. This + is because sometimes threads may run so that the Blocks with highest + memory usage are tried to be decoded at the same time. + + Most .lzma files have all the Blocks encoded with identical settings, + or at least the memory usage won't vary dramatically. That's why most + multi-threaded decoders probably want to use the simple "separate + lzma_memory_limitter for each thread" solution, possibly fallbacking + to single-threaded mode in case the per-thread memory limits aren't + enough in multi-threaded mode. + +FIXME: Memory usage of Stream info. + +[ + +] + + +2. Huge uncompressed output + +2.1. Data Blocks + + Decoding a tiny .lzma file can produce huge amount of uncompressed + output. There is an example file of 45 bytes, which decodes to 64 PiB + (that's 2^56 bytes). Uncompressing such a file to disk is likely to + fill even a bigger disk array. If the data is written to a pipe, it + may not fill the disk, but would still take very long time to finish. + + To avoid denial of service conditions caused by huge amount of + uncompressed output, applications using liblzma should use some method + to limit the amount of output produced. The exact method depends on + the application. + + All valid .lzma Streams make it possible to find out the uncompressed + size of the Stream without actually uncompressing the data. This + information is available in at least one of the Metadata Blocks. + Once the uncompressed size is parsed, the decoder can verify that + it doesn't exceed certain limits (e.g. available disk space). + + When the uncompressed size is known, the decoder can actively keep + track of the amount of output produced so far, and that it doesn't + exceed the known uncompressed size. If it does exceed, the file is + known to be corrupt and an error should be indicated without + continuing to decode the rest of the file. + + Unfortunately, finding the uncompressed size beforehand is often + possible only in non-streamed mode, because the needed information + could be in the Footer Metdata Block, which (obviously) is at the + end of the Stream. In purely streamed mode decoding, one may need to + use some rough arbitrary limits to prevent the problems described in + the beginning of this section. + + +2.2. Metadata + + Metadata is stored in Metadata Blocks, which are very similar to + Data Blocks. Thus, the uncompressed size can be huge just like with + Data Blocks. The difference is, that the contents of Metadata Blocks + aren't given to the application as is, but parsed by liblzma. Still, + reading through a huge Metadata can take very long time, effectively + creating a denial of service like piping decoded a Data Block to + another process would do. + + At first it would seem that using a memory limitter would prevent + this issue as a side effect. But it does so only if the application + requests liblzma to allocate the Extra Records and provide them to + the application. If Extra Records aren't requested, they aren't + allocated either. Still, the Extra Records are being read through + to validate that the Metadata is in proper format. + + The solution is to limit the Uncompressed Size of a Metadata Block + to some relatively large value. This will make liblzma to give an + error when the given limit is reached. + diff --git a/doc/lzma-intro.txt b/doc/lzma-intro.txt new file mode 100644 index 00000000..bde8a059 --- /dev/null +++ b/doc/lzma-intro.txt @@ -0,0 +1,107 @@ + +Introduction to the lzma command line tool +------------------------------------------ + +Overview + + The lzma command line tool is similar to gzip and bzip2, but for + compressing and uncompressing .lzma files. + + +Supported file formats + + By default, the tool creates files in the new .lzma format. This can + be overriden with --format=FMT command line option. Use --format=alone + to create files in the old LZMA_Alone format. + + By default, the tool uncompresses both the new .lzma format and + LZMA_Alone format. This is to make it transparent to switch from + the old LZMA_Alone format to the new .lzma format. Since both + formats use the same filename suffix, average user should never + notice which format was used. + + +Differences to gzip and bzip2 + + Standard input and output + + Both gzip and bzip2 refuse to write compressed data to a terminal and + read compressed data from a terminal. With gzip (but not with bzip2), + this can be overriden with the `--force' option. lzma follows the + behavior of gzip here. + + Usage of LZMA_OPT environment variable + + gzip and bzip2 read GZIP and BZIP2 environment variables at startup. + These variables may contain extra command line options. + + gzip and bzip2 allow passing not only options, but also end-of-options + indicator (`--') and filenames via the environment variable. No quoting + is supported with the filenames. + + Here are examples with gzip. bzip2 behaves identically. + + bash$ echo asdf > 'foo bar' + bash$ GZIP='"foo bar"' gzip + gzip: "foo: No such file or directory + gzip: bar": No such file or directory + + bash$ GZIP=-- gzip --help + gzip: --help: No such file or directory + + lzma silently ignores all non-option arguments given via the + environment variable LZMA_OPT. Like on the command line, everything + after `--' is taken as non-options, and thus ignored in LZMA_OPT. + + bash$ LZMA_OPT='--help' lzma --version # Displays help + bash$ LZMA_OPT='-- --help' lzma --version # Displays version + + +Filter chain presets + + Like in gzip and bzip2, lzma supports numbered presets from 1 to 9 + where 1 is the fastest and 9 the best compression. 1 and 2 are for + fast compressing with small memory usage, 3 to 6 for good compression + ratio with medium memory usage, and 7 to 9 for excellent compression + ratio with higher memory requirements. The default is 7 if memory + usage limit allows. + + In future, there will probably be an option like --preset=NAME, which + will contain more special presets for specific file types. + + It's also possible that there will be some heuristics to select good + filters. For example, the tool could detect when a .tar archive is + being compressed, and enable x86 filter only for those files in the + .tar archive that are ELF or PE executables for x86. + + +Specifying custom filter chains + + Custom filter chains are specified by using long options with the name + of the filters in correct order. For example, to pass the input data to + the x86 filter and the output of that to the LZMA filter, the following + command will do: + + lzma --x86 --lzma filename + + Some filters accept options, which are specified as a comma-separated + list of key=value pairs: + + lzma --delta=distance=4 --lzma=dict=4Mi,lc=8,lp=2 filename + + +Memory usage control + + By default, the command line tool limits memory usage to 1/3 of the + available physical RAM. If no preset or custom filter chain has been + given, the default preset will be used. If the memory limit is too + low for the default preset, the tool will silently switch to lower + preset. + + When a preset or a custom filter chain has been specified and the + memory limit is too low, an error message is displayed and no files + are processed. + + If the decoder hits the memory usage limit, an error is displayed and + no more files are processed. + |