aboutsummaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorLasse Collin <lasse.collin@tukaani.org>2007-12-09 00:42:33 +0200
committerLasse Collin <lasse.collin@tukaani.org>2007-12-09 00:42:33 +0200
commit5d018dc03549c1ee4958364712fb0c94e1bf2741 (patch)
tree1b211911fb33fddb3f04b77f99e81df23623ffc4 /doc
downloadxz-5d018dc03549c1ee4958364712fb0c94e1bf2741.tar.xz
Imported to git.
Diffstat (limited to '')
-rw-r--r--doc/bugs.txt46
-rw-r--r--doc/faq.txt247
-rw-r--r--doc/file-format.txt1861
-rw-r--r--doc/history.txt140
-rw-r--r--doc/liblzma-advanced.txt324
-rw-r--r--doc/liblzma-hacking.txt112
-rw-r--r--doc/liblzma-intro.txt188
-rw-r--r--doc/liblzma-security.txt219
-rw-r--r--doc/lzma-intro.txt107
9 files changed, 3244 insertions, 0 deletions
diff --git a/doc/bugs.txt b/doc/bugs.txt
new file mode 100644
index 00000000..55579343
--- /dev/null
+++ b/doc/bugs.txt
@@ -0,0 +1,46 @@
+
+Reporting bugs
+--------------
+
+ Naturally it is easiest for me if you already know what causes the
+ unexpected behavior. Even better if you have a patch to propose.
+ However, quite often the reason for unexpected behavior is unknown,
+ so below are a few things what to do before sending a bug report.
+
+ In case of a crash (usually segmentation violation):
+
+ 1. Try to create a small example how to reprocude the issue.
+
+ 2. If you are writing an application using liblzma or libzfile,
+ double check that you are using the libraries correctly (for
+ example, that you didn't forget to call lzma_init()). If it is
+ the command line tool included in LZMA Utils that is crashing,
+ ignore this step.
+
+ 3. Compile LZMA Utils with debugging code using configure switch
+ `--enable-debug'. If you are using GCC as the compiler, use
+ CFLAGS='-O0 -ggdb'. Don't strip the resulting binaries.
+
+ 4. Turn on core dumps. The exact command depends on your shell;
+ for example in GNU bash it is done with `ulimit -c unlimited',
+ and in tcsh with `limit coredumpsize unlimited'.
+
+ 5. Try to reproduce the suspected bug. If you get `assertion failed'
+ message, be sure to include the complete message in your bug
+ report. If the application leaves a coredump, get a backtrace
+ using gdb:
+ $ gdb /path/to/app-binary # Loads the app to the debugger.
+ (gdb) core core # Opens the coredump.
+ (gdb) bt # Prints the backtrace. Copy & paste to bug report.
+ (gdb) quit # Quits gdb.
+
+ Send your bug report to Lasse Collin <lasse.collin@tukaani.org>. Don't
+ send the core dump file or the actual executables. If you have a small
+ example file(s) (total size less than 100 KiB), please include it/them
+ as an attachment.
+
+ Do NOT complain about problems with LZMA Utils to Igor Pavlov.
+ Although the code of LZMA Utils is derived from his code, there are
+ a lot of changes, which may have introduced bugs not present in
+ the original version.
+
diff --git a/doc/faq.txt b/doc/faq.txt
new file mode 100644
index 00000000..d01cf91b
--- /dev/null
+++ b/doc/faq.txt
@@ -0,0 +1,247 @@
+
+LZMA Utils FAQ
+--------------
+
+ Copyright (C) 2007 Lasse Collin
+
+ Copying and distribution of this file, with or without modification,
+ are permitted in any medium without royalty provided the copyright
+ notice and this notice are preserved.
+
+
+Q: What are LZMA, LZMA Utils, lzma, .lzma, liblzma, LZMA SDK, LZMA_Alone,
+ 7-Zip and p7zip?
+
+A: LZMA stands for Lempel-Ziv-Markov chain-Algorithm. LZMA is the name
+ of the compression algorithm designed by Igor Pavlov. He is the author
+ of 7-Zip, which is a great LGPL'd compression tool for Microsoft
+ Windows operating systems. In addition to 7-Zip itself, also LZMA SDK
+ is available on the website of 7-Zip. LZMA SDK contains LZMA
+ implementations in C++, Java and C#. The C++ version is the original
+ implementation which is used also in 7-Zip itself.
+
+ Excluding the unrar plugin, 7-Zip is free software (free as in
+ freedom). Thanks to this, it was possible to port it to POSIX
+ platforms. The port was done and is maintained by myspace (TODO:
+ myspace's real name?). p7zip is a port of 7-Zip's command line version;
+ p7zip doesn't include the 7-Zip's GUI.
+
+ In POSIX world, users are used to gzip and bzip2 command line tools.
+ Developers know APIs of zlib and libbzip2. LZMA Utils try to ease
+ adoption of LZMA on free operating systems by providing a compression
+ library and a set of command line tools. The library is called liblzma.
+ It provides a zlib-like API making it easy to adapt LZMA compression in
+ existing applications. The main command line tool is known as lzma,
+ whose command line syntax is very similar to that of gzip and bzip2.
+
+ The original command line tool from LZMA SDK (lzma.exe) was found from
+ a directory called LZMA_Alone in the LZMA SDK. It used a simple header
+ format in .lzma files. This format was also used by LZMA Utils up to
+ and including 4.32.x. In LZMA Utils documentation, LZMA_Alone refers
+ to both the file format and the command line tool from LZMA SDK.
+
+ Because of various limitations of the LZMA_Alone file format, a new
+ file format was developed. Extending some existing format such as .gz
+ used by gzip was considered, but these formats were found to be too
+ limited. The filename suffix for the new .lzma format is `.lzma'. The
+ same suffix is also used for files in the LZMA_Alone format. To make
+ the transition to the new format as transparent as possible, LZMA Utils
+ support both the new and old formats transparently.
+
+ 7-Zip and LZMA SDK: <http://7-zip.org/>
+ p7zip: <http://p7zip.sourceforge.net/>
+ LZMA Utils: <http://tukaani.org/lzma/>
+
+
+Q: What LZMA implementations there are available?
+
+A: LZMA SDK contains implementations in C++, Java and C#. The C++ version
+ is the original implementation which is part of 7-Zip. LZMA SDK
+ contains also a small LZMA decoder in C.
+
+ A port of LZMA SDK to Pascal was made by Alan Birtles
+ <http://www.birtles.org.uk/programming/>. It should work with
+ multiple Pascal programming language implementations.
+
+ LZMA Utils includes liblzma, which is directly based on LZMA SDK.
+ liblzma is written in C (C99, not C89). In contrast to C++ callback
+ API used by LZMA SDK, liblzma uses zlib-like stateful C API. I do not
+ want to comment whether both/former/latter/neither API(s) are good or
+ bad. The only reason to implement a zlib-like API was, that many
+ developers are already familiar with zlib, and very many applications
+ already use zlib. Having a similar API makes it easier to include LZMA
+ support in existing applications.
+
+ See also <http://en.wikipedia.org/wiki/LZMA#External_links>.
+
+
+Q: Which file formats are supported by LZMA Utils?
+
+A: Even when the raw LZMA stream is always the same, it can be wrapped
+ in different container formats. The preferred format is the new .lzma
+ format. It has magic bytes (the first six bytes: 0xFF 'L' 'Z' 'M'
+ 'A' 0x00). The format supports chaining up to seven filters filters,
+ splitting data to multiple blocks for easier multi-threading and rough
+ random-access reading. The file integrity is verified using CRC32,
+ CRC64, or SHA256, and by verifying the uncompressed size of the file.
+
+ LZMA SDK includes a tool called LZMA_Alone. It supports uses a
+ primitive header which includes only the mandatory stream information
+ required by the LZMA decoder. This format can be both read and
+ written by liblzma and the command line tool (use --format=alone to
+ create such files).
+
+ .7z is the native archive format used by 7-Zip. This format is not
+ supported by liblzma, and probably will never be supported. You
+ should use e.g. p7zip to extract .7z files.
+
+ It is possible to implement custom file formats by using raw filter
+ mode in liblzma. In this mode the application needs to store the filter
+ properties and provide them to liblzma before starting to uncompress
+ the data.
+
+
+Q: How can I identify files containing LZMA compressed data?
+
+A: The preferred filename suffix for .lzma files is `.lzma'. `.tar.lzma'
+ may be abbreviated to `.tlz'. The same suffixes are used for files in
+ LZMA_Alone format. In practice this should be no problem since tools
+ included in LZMA Utils support both formats transparently.
+
+ Checking the magic bytes is easy way to detect files in the new .lzma
+ format (the first six bytes: 0xFF 'L' 'Z' 'M' 'A' 0x00). The "file"
+ command version FIXME contains magic strings for this format.
+
+ The old LZMA_Alone format has no magic bytes. Its header cannot contain
+ arbitrary bytes, thus it is possible to make a guess. Unfortunately the
+ guessing is usually too hard to be reliable, so don't try it unless you
+ are desperate.
+
+
+Q: Does the lzma command line tool support sparse files?
+
+A: Sparse files can (of course) be compressed like normal files, but
+ uncompression will not restore sparseness of the file. Use an archiver
+ tool to take care of sparseness before compressing the data with lzma.
+
+ The reason for this is that archiver tools handle files, while
+ compression tools handle streams or buffers. Being a sparse file is
+ a property of the file on the disk, not a property of the stream or
+ buffer.
+
+
+Q: Can I recover parts of a broken LZMA file (e.g. corrupted CD-R)?
+
+A: With LZMA_Alone and single-block .lzma files, you can uncompress the
+ file until you hit the first broken byte. The data after the broken
+ position is lost. LZMA relies on the uncompression history, and if
+ bytes are missing in the middle of the file, it is impossible to
+ reliably continue after the broken section.
+
+ With multi-block .lzma files it may be possible to locale the next
+ block in the file and continue decoding there. A limited recovery
+ tool for this kind of situations is planned.
+
+
+Q: Is LZMA patented?
+
+A: No, the authors are not aware of any patents that could affect LZMA.
+ However, due to nature of software patents, the authors cannot
+ guarantee, that LZMA isn't affected by any third party patent.
+
+
+Q: Where can I find documentation about how LZMA works as an algorithm?
+
+A: Read the source code, Luke. There is no documentation about LZMA
+ internals. It is possible that Igor Pavlov is the only person on
+ the Earth that completely knows and understands the algorithm.
+
+ You could begin by downloading LZMA SDK, and start reading from
+ the LZMA decoder to get some idea about the bitstream format.
+ Before you begin, you should know the basics of LZ77 and
+ range coding algorithms. LZMA is based on LZ77, but LZMA is
+ *a lot* more complex. Range coding is used to compress the
+ final bitstream like Huffman coding is used in Deflate.
+
+
+Q: What are filters?
+
+A: In context of .lzma files, a filter means an implementation of a
+ compression algorithm. The primary filter is LZMA, which is why
+ the names of the tools contain the letters LZMA.
+
+ liblzma and the new .lzma format support also other filters than LZMA.
+ There are different types of filters, which are suitable for different
+ types of data. Thus, to select the optimal filter and settings, the
+ type of the input data being compressed needs to be known.
+
+ Some filters are most useful when combined with another filter like
+ LZMA. These filters increase redundancy in the data, without changing
+ the size of the data, by taking advantage of properties specific to
+ the data being compressed.
+
+ So far, all the filters are always reversible. That is, no matter what
+ data you pass to a filter encoder, it can be always defiltered back to
+ the original form. Because of this, it is safe to compress for example
+ a software package that contains other file types than executables
+ using a filter specific to the architechture of the package being
+ compressed.
+
+ The old LZMA_Alone format supports only the LZMA filter.
+
+
+Q: I cannot find BCJ and BCJ2 filters. Don't they exist in liblzma?
+
+A: BCJ filter is called "x86" in liblzma. BCJ2 is not included,
+ because it requires using more than one encoded output stream.
+
+
+Q: Can I use LZMA in proprietary, non-free applications?
+
+A: liblzma is under the GNU LGPL version 2.1 or (at your opinion) any
+ later version. To summarise (*NOTE* This summary is not legally
+ binding, that is, it doesn't give you any extra permissions compared
+ to the LGPL. Read the GNU LGPL carefully for the exact license
+ conditions.):
+ * All the changes made into the library itself must be published
+ under the same license.
+ * End users must be able to replace the used liblzma. Easiest way
+ to assure this is to link dynamically against liblzma so users
+ can replace the shared library file if they want.
+ * You must make it clear to your users, that your application uses
+ liblzma, and that liblzma is free software under the GNU LGPL.
+ A copy of GNU LGPL must be included.
+
+ LZMA SDK contains a special exception which allows linking *unmodified*
+ code statically with a non-free application. This exception does *not*
+ apply to liblzma.
+
+ As an alternative, you can support the development of LZMA and 7-Zip
+ by buying a proprietary license from Igor Pavlov. See homepage of
+ LZMA SDK <http://7-zip.org/sdk.html> for more information. Note that
+ having a proprietary license from Igor Pavlov doesn't allow you to use
+ liblzma in a way that contradicts with the GNU LGPL, because liblzma
+ contains code that is not copyrighted by Igor Pavlov. Please contact
+ both Lasse Collin and Igor Pavlov if the license conditions of liblzma
+ are not suitable for you.
+
+
+Q: I would like to help. What can I do?
+
+A: See the TODO file. Please contact Lasse Collin before starting to do
+ anything, because it is possible that someone else is already working
+ on the same thing.
+
+
+Q: How can I contact the authors?
+
+A: Lasse Collin is the maintainer of LZMA Utils. You can contact him
+ either via IRC (Larhzu on #tukaani at Freenode or IRCnet). Email
+ should work too, <lasse.collin@tukaani.org>.
+
+ Igor Pavlov is the father of LZMA. He is the author of 7-Zip
+ and LZMA SDK. <http://7-zip.org/>
+
+ NOTE: Please don't bother Igor Pavlov with questions specific
+ to LZMA Utils.
+
diff --git a/doc/file-format.txt b/doc/file-format.txt
new file mode 100644
index 00000000..4a90a67d
--- /dev/null
+++ b/doc/file-format.txt
@@ -0,0 +1,1861 @@
+
+The .lzma File Format
+---------------------
+
+ 0. Preface
+ 0.1. Copyright Notices
+ 0.2. Changes
+ 1. Conventions
+ 1.1. Byte and Its Representation
+ 1.2. Multibyte Integers
+ 2. Stream
+ 2.1. Stream Types
+ 2.1.1. Single-Block Stream
+ 2.1.2. Multi-Block Stream
+ 2.2. Stream Header
+ 2.2.1. Header Magic Bytes
+ 2.2.2. Stream Flags
+ 2.2.3. CRC32
+ 3. Block
+ 3.1. Block Header
+ 3.1.1. Block Flags
+ 3.1.2. Compressed Size
+ 3.1.3. Uncompressed Size
+ 3.1.4. List of Filter Flags
+ 3.1.4.1. Misc
+ 3.1.4.2. External ID
+ 3.1.4.3. External Size of Properties
+ 3.1.4.4. Filter Properties
+ 3.1.5. CRC32
+ 3.1.6. Header Padding
+ 3.2. Compressed Data
+ 3.3. Block Footer
+ 3.3.1. Check
+ 3.3.2. Stream Footer
+ 3.3.2.1. Uncompressed Size
+ 3.3.2.2. Backward Size
+ 3.3.2.3. Stream Flags
+ 3.3.2.4. Footer Magic Bytes
+ 3.3.3. Footer Padding
+ 4. Filters
+ 4.1. Detecting when All Data Has Been Decoded
+ 4.1.1. With Uncompressed Size
+ 4.1.2. With End of Input
+ 4.1.3. With End of Payload Marker
+ 4.2. Alignment
+ 4.3. Filters
+ 4.3.1. Copy
+ 4.3.2. Subblock
+ 4.3.2.1. Format of the Encoded Output
+ 4.3.3. Delta
+ 4.3.3.1. Format of the Encoded Output
+ 4.3.4. LZMA
+ 4.3.4.1. LZMA Properties
+ 4.3.4.2. Dictionary Flags
+ 4.3.5. Branch/Call/Jump Filters for Executables
+ 5. Metadata
+ 5.1. Metadata Flags
+ 5.2. Size of Header Metadata Block
+ 5.3. Total Size
+ 5.4. Uncompressed Size
+ 5.5. Index
+ 5.5.1. Number of Data Blocks
+ 5.5.2. Total Sizes
+ 5.5.3. Uncompressed Sizes
+ 5.6. Extra
+ 5.6.1. 0x00: Dummy/Padding
+ 5.6.2. 0x01: OpenPGP Signature
+ 5.6.3. 0x02: Filter Information
+ 5.6.4. 0x03: Comment
+ 5.6.5. 0x04: List of Checks
+ 5.6.6. 0x05: Original Filename
+ 5.6.7. 0x07: Modification Time
+ 5.6.8. 0x09: High-Resolution Modification Time
+ 5.6.9. 0x0B: MIME Type
+ 5.6.10. 0x0D: Homepage URL
+ 6. Custom Filter and Extra Record IDs
+ 6.1. Reserved Custom Filter ID Ranges
+ 7. Cyclic Redundancy Checks
+ 8. References
+ 8.1. Normative References
+ 8.2. Informative References
+
+
+0. Preface
+
+ This document describes the .lzma file format (filename suffix
+ `.lzma', MIME type `application/x-lzma'). It is intended that
+ this format replace the format used by the LZMA_Alone tool
+ included in LZMA SDK up to and including version 4.43.
+
+ IMPORTANT: The version described in this document is a
+ draft, NOT a final, official version. Changes
+ are possible.
+
+
+0.1. Copyright Notices
+
+ Copyright (C) 2006, 2007 Lasse Collin <lasse.collin@tukaani.org>
+ Copyright (C) 2006 Ville Koskinen <w-ber@iki.fi>
+
+ Copying and distribution of this file, with or without
+ modification, are permitted in any medium without royalty
+ provided the copyright notice and this notice are preserved.
+ Modified versions must be marked as such.
+
+ All source code examples given in this document are put into
+ the public domain by the authors of this document.
+
+ Thanks for helping with this document goes to Igor Pavlov,
+ Mark Adler and Mikko Pouru.
+
+
+0.2. Changes
+
+ Last modified: 2007-12-02 22:40+0200
+
+ (A changelog will be kept once the first official version
+ is made.)
+
+
+1. Conventions
+
+ The keywords `must', `must not', `required', `should',
+ `should not', `recommended', `may', and `optional' in this
+ document are to be interpreted as described in [RFC-2119].
+ These words are not capitalized in this document.
+
+ Indicating a warning means displaying a message, returning
+ appropriate exit status, or something else to let the user
+ know that something worth warning occurred. The operation
+ should still finish if a warning is indicated.
+
+ Indicating an error means displaying a message, returning
+ appropriate exit status, or something else to let the user
+ know that something prevented successfully finishing the
+ operation. The operation must be aborted once an error has
+ been indicated.
+
+
+1.1. Byte and Its Representation
+
+ In this document, byte is always 8 bits.
+
+ A `nul byte' has all bits unset. That is, the value of a nul
+ byte is 0x00.
+
+ To represent byte blocks, this document uses notation that
+ is similar to the notation used in [RFC-1952]:
+
+ +-------+
+ | Foo | One byte.
+ +-------+
+
+ +---+---+
+ | Foo | Two bytes; that is, some of the vertical bars
+ +---+---+ can be missing.
+
+ +=======+
+ | Foo | Zero or more bytes.
+ +=======+
+
+ In this document, a boxed byte or a byte sequence declared
+ using this notation is called `a field'. The example field
+ above would be called called `the Foo field' or plain `Foo'.
+
+
+1.2. Multibyte Integers
+
+ Multibyte integers of static length, such as CRC values,
+ are stored in little endian byte order (least significant
+ byte first).
+
+ When smaller values are more likely than bigger values (e.g.
+ file sizes), multibyte integers are encoded in a simple
+ variable-length representation:
+ - Numbers in the range [0, 127] are copied as is, and take
+ one byte of space.
+ - Bigger numbers will occupy two or more bytes. The lowest
+ seven bits of every byte are used for data; the highest
+ (eighth) bit indicates either that
+ 0) the byte is in the middle of the byte sequence, or
+ 1) the byte is the first or the last byte.
+
+ For now, the value of the variable-length integers is limited
+ to 63 bits, which limits the encoded size of the integer to
+ nine bytes. These limits may be increased in future if needed.
+
+ Note that the encoding is not as optimal as it could be. For
+ example, it is possible to encode the number 42 using any
+ number of bytes between one and nine. This is convenient
+ for non-streamed encoders, that write Compressed Size or
+ Uncompressed Size fields to the Block Header (see Section 3.1)
+ after the Compressed Data field is written to the disk.
+
+ In several situations, the decoder needs to compare that two
+ fields contain identical information. When comparing fields
+ using the encoding described in this Section, the decoder must
+ consider two fields identical if their decoded values are
+ identical; it does not matter if the encoded variable-length
+ representations differ.
+
+ The following C code illustrates encoding and decoding 63-bit
+ variables; the highest bit of uint64_t must be unset. The
+ functions return the number of bytes occupied by the integer
+ (1-9), or zero on error.
+
+ #include <sys/types.h>
+ #include <inttypes.h>
+
+ size_t
+ encode(uint8_t buf[static 9], uint64_t num)
+ {
+ if (num >= (UINT64_C(1) << (9 * 7)))
+ return 0;
+ if (num <= 0x7F) {
+ buf[0] = num;
+ return 1;
+ }
+ buf[0] = (num & 0x7F) | 0x80;
+ num >>= 7;
+ size_t i = 1;
+ while (num >= 0x80) {
+ buf[i++] = num & 0x7F;
+ num >>= 7;
+ }
+ buf[i++] = num | 0x80;
+ return i;
+ }
+
+ size_t
+ decode(const uint8_t buf[], size_t size_max, uint64_t *num)
+ {
+ if (size_max == 0)
+ return 0;
+ if (size_max > 9)
+ size_max = 9;
+ *num = buf[0] & 0x7F;
+ if (!(buf[0] & 0x80))
+ return 1;
+ size_t i = 1;
+ do {
+ if (i == size_max)
+ return 0;
+ *num |= (uint64_t)(buf[i] & 0x7F) << (7 * i);
+ } while (!(buf[i++] & 0x80));
+ return i;
+ }
+
+ size_t
+ decode_reverse(const uint8_t buf[], size_t size_max,
+ uint64_t *num)
+ {
+ if (size_max == 0)
+ return 0;
+ const size_t end = size_max > 9 ? size_max - 9 : 0;
+ size_t i = size_max - 1;
+ *num = buf[i] & 0x7F;
+ if (!(buf[i] & 0x80))
+ return 1;
+ do {
+ if (i-- == end)
+ return 0;
+ *num <<= 7;
+ *num |= buf[i] & 0x7F;
+ } while (!(buf[i] & 0x80));
+ return size_max - i;
+ }
+
+
+2. Stream
+
+ +========+========+========+
+ | Stream | Stream | Stream | ...
+ +========+========+========+
+
+ A file contains usually only one Stream. However, it is
+ possible to concatenate multiple Streams together with no
+ additional processing. It is up to the implementation to
+ decide if the decoder will continue decoding from the next
+ Stream once the end of the first Stream has been reached.
+
+
+2.1. Stream Types
+
+ There are two types of Streams: Single-Block Streams and
+ Multi-Block Streams. Decoders conforming to this specification
+ must support at least Single-Block Streams. Supporting
+ Multi-Block Streams is optional. If the decoder supports only
+ Single-Block Streams, the documentation of the decoder should
+ mention this fact clearly.
+
+
+2.1.1. Single-Block Stream
+
+ +===============+============+
+ | Stream Header | Data Block |
+ +===============+============+
+
+ As the name says, a Single-Block Stream has exactly one Block.
+ The Block must be a Data Block; Metadata Blocks are not allowed
+ in Single-Block Streams.
+
+
+2.1.2. Multi-Block Stream
+
+ +===============+=======================+
+ | Stream Header | Header Metadata Block |
+ +===============+=======================+
+
+ +============+ +============+=======================+
+ ---> | Data Block | ... | Data Block | Footer Metadata Block |
+ +============+ +============+=======================+
+
+ Notes:
+ - Stream Header is mandatory.
+ - Header Metadata Block is optional.
+ - Each Multi-Block Stream has at least one Data Block. The
+ maximum number of Data Blocks is not limited.
+ - Footer Metadata Block is mandatory.
+
+
+2.2. Stream Header
+
+ +---+---+---+---+---+---+--------------+--+--+--+--+
+ | Header Magic Bytes | Stream Flags | CRC32 |
+ +---+---+---+---+---+---+--------------+--+--+--+--+
+
+
+2.2.1. Header Magic Bytes
+
+ The first six (6) bytes of the Stream are so called Header
+ Magic Bytes. They can be used to identify the file type.
+
+ Using a C array and ASCII:
+ const uint8_t HEADER_MAGIC[6]
+ = { 0xFF, 'L', 'Z', 'M', 'A', 0x00 };
+
+ In plain hexadecimal:
+ FF 4C 5A 4D 41 00
+
+ Notes:
+ - The first byte (0xFF) was chosen so that the files cannot
+ be erroneously detected as being in LZMA_Alone format, in
+ which the first byte is in the the range [0x00, 0xE0].
+ - The sixth byte (0x00) was chosen to prevent applications
+ from misdetecting the file as a text file.
+
+
+2.2.2. Stream Flags
+
+ Bit(s) Mask Description
+ 0-2 0x07 Type of Check (see Section 3.3.1):
+ ID Size Check name
+ 0x00 0 bytes None
+ 0x01 4 bytes CRC32
+ 0x02 4 bytes (Reserved)
+ 0x03 8 bytes CRC64
+ 0x04 16 bytes (Reserved)
+ 0x05 32 bytes SHA-256
+ 0x06 32 bytes (Reserved)
+ 0x07 64 bytes (Reserved)
+ 3 0x08 The CRC32 field is present in Block Headers.
+ 4 0x10 If unset, this is a Single-Block Stream; if set,
+ this is a Multi-Block Stream.
+ 5-7 0xE0 Reserved for future use; must be zero for now.
+
+ Implementations must support at least the Check IDs 0x00 (None)
+ and 0x01 (CRC32). Supporting other Check IDs is optional. If an
+ unsupported Check is used, the decoder must indicate a warning
+ or error.
+
+ If any reserved bit is set, the decoder must indicate an error.
+ It is possible that there is a new field present which the
+ decoder is not aware of, and can thus parse the Stream Header
+ incorrectly.
+
+
+2.2.3. CRC32
+
+ The CRC32 is calculated from the Stream Flags field. It is
+ stored as an unsigned 32-bit little endian integer. If the
+ calculated value does not match the stored one, the decoder
+ must indicate an error.
+
+ Note that this field is always present; the bit in Stream Flags
+ controls only presence of CRC32 in Block Headers.
+
+
+3. Block
+
+ +==============+=================+==============+
+ | Block Header | Compressed Data | Block Footer |
+ +==============+=================+==============+
+
+ There are two types of Blocks:
+ - Data Blocks hold the actual compressed data.
+ - Metadata Blocks hold the Index, Extra, and a few other
+ non-data fields (see Section 5).
+
+ The type of the Block is indicated by the corresponding bit
+ in the Block Flags field (see Section 3.1.1).
+
+
+3.1. Block Header
+
+ +------+------+=================+===================+
+ | Block Flags | Compressed Size | Uncompressed Size |
+ +------+------+=================+===================+
+
+ +======================+--+--+--+--+================+
+ ---> | List of Filter Flags | CRC32 | Header Padding |
+ +======================+--+--+--+--+================+
+
+
+3.1.1. Block Flags
+
+ The first byte of the Block Flags field is a bit field:
+
+ Bit(s) Mask Description
+ 0-2 0x07 Number of filters (0-7)
+ 3 0x08 Use End of Payload Marker (even if
+ Uncompressed Size is stored to Block Header).
+ 4 0x10 The Compressed Size field is present.
+ 5 0x20 The Uncompressed Size field is present.
+ 6 0x40 Reserved for future use; must be zero for now.
+ 7 0x80 This is a Metadata Block.
+
+ The second byte of the Block Flags field is also a bit field:
+
+ Bit(s) Mask Description
+ 0-4 0x1F Size of the Header Padding field (0-31 bytes)
+ 5-7 0xE0 Reserved for future use; must be zero for now.
+
+ The decoder must indicate an error if End of Payload Marker
+ is not used and Uncompressed Size is not stored to the Block
+ Header. Because of this, the first byte of Block Flags can
+ never be a nul byte. This is useful when detecting beginning
+ of the Block after Footer Padding (see Section 3.3.3).
+
+ If any reserved bit is set, the decoder must indicate an error.
+ It is possible that there is a new field present which the
+ decoder is not aware of, and can thus parse the Block Header
+ incorrectly.
+
+
+3.1.2. Compressed Size
+
+ This field is present only if the appropriate bit is set in
+ the Block Flags field (see Section 3.1.1).
+
+ This field contains the size of the Compressed Data field.
+ The size is stored using the encoding described in Section 1.2.
+ If the Compressed Size does not match the real size of the
+ Compressed Data field, the decoder must indicate an error.
+
+ Having the Compressed Size field in the Block Header can be
+ useful for multithreaded decoding when seeking is not possible.
+ If the Blocks are small enough, the decoder can read multiple
+ Blocks into its internal buffer, and decode the Blocks in
+ parallel.
+
+ Compressed Size can also be useful when seeking forwards to
+ a specific location in streamed mode: the decoder can quickly
+ skip over irrelevant Blocks, without decoding them.
+
+
+3.1.3. Uncompressed Size
+
+ This field is present only if the appropriate bit is set in
+ the Block Flags field (see Section 3.1.1).
+
+ The Uncompressed Size field contains the size of the Block
+ after uncompressing.
+
+ Storing Uncompressed Size serves several purposes:
+ - The decoder will know when all of the data has been
+ decoded without an explicit End of Payload Marker.
+ - The decoder knows how much memory it needs to allocate
+ for a temporary buffer in multithreaded mode.
+ - Simple error detection: wrong size indicates a broken file.
+ - Sometimes it is useful to know the file size without
+ uncompressing the file.
+
+ It should be noted that the only reliable way to find out what
+ the real uncompressed size is is to uncompress the Block,
+ because the Block Header and Metadata Block fields may contain
+ (intentionally or unintentionally) invalid information.
+
+ Uncompressed Size is stored using the encoding described in
+ Section 1.2. If the Uncompressed Size does not match the
+ real uncompressed size, the decoder must indicate an error.
+
+
+3.1.4. List of Filter Flags
+
+ +================+================+ +================+
+ | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
+ +================+================+ +================+
+
+ The number of Filter Flags fields is stored in the Block Flags
+ field (see Section 3.1.1). As a special case, if the number of
+ Filter Flags fields is zero, it is equivalent to having the
+ Copy filter as the only filter.
+
+ The format of each Filter Flags field is as follows:
+
+ +------+=============+=============================+
+ | Misc | External ID | External Size of Properties |
+ +------+=============+=============================+
+
+ +===================+
+ ---> | Filter Properties |
+ +===================+
+
+ The list of officially defined Filter IDs and the formats of
+ their Filter Properties are described in Section 4.3.
+
+
+3.1.4.1. Misc
+
+ To save space, the most commonly used Filter IDs and the
+ Size of Filter Properties are encoded in a single byte.
+ Depending on the contents of the Misc field, Filter ID is
+ the value of the Misc or External ID field.
+
+ Value Filter ID Size of Filter Properties
+ 0x00 - 0x1F Misc 0 bytes
+ 0x20 - 0x3F Misc 1 byte
+ 0x40 - 0x5F Misc 2 bytes
+ 0x60 - 0x7F Misc 3 bytes
+ 0x80 - 0x9F Misc 4 bytes
+ 0xA0 - 0xBF Misc 5 bytes
+ 0xC0 - 0xDF Misc 6 bytes
+ 0xE0 - 0xFE External ID 0-30 bytes
+ 0xFF External ID External Size of Properties
+
+ The following code demonstrates parsing the Misc field and,
+ when needed, the External ID and External Size of Properties
+ fields.
+
+ uint64_t id;
+ uint64_t properties_size;
+ uint8_t misc = read_byte();
+
+ if (misc >= 0xE0) {
+ id = read_variable_length_integer();
+
+ if (misc == 0xFF)
+ properties_size = read_variable_length_integer();
+ else
+ properties_size = misc - 0xE0;
+
+ } else {
+ id = misc;
+ properties_size = misc / 0x20;
+ }
+
+
+3.1.4.2. External ID
+
+ This field is present only if the Misc field contains a value
+ that indicates usage of External ID. The External ID is stored
+ using the encoding described in Section 1.2.
+
+
+3.1.4.3. External Size of Properties
+
+ This field is present only if the Misc field contains a value
+ that indicates usage of External Size of Properties. The size
+ of Filter Properties is stored using the encoding described in
+ Section 1.2.
+
+
+3.1.4.4. Filter Properties
+
+ Size of this field depends on the Misc field (Section 3.1.4.1)
+ and, if present, External Size of Properties field (Section
+ 3.1.4.3). The format of this field is depends on the selected
+ filter; see Section 4.3 for details.
+
+
+3.1.5. CRC32
+
+ This field is present only if the appropriate bit is set in
+ the Stream Flags field (see Section 2.2.2).
+
+ The CRC32 is calculated over everything in the Block Header
+ field except the Header Padding field and the CRC32 field
+ itself. It is stored as an unsigned 32-bit little endian
+ integer. If the calculated value does not match the stored
+ one, the decoder must indicate an error.
+
+
+3.1.6. Header Padding
+
+ This field contains as many nul bytes as indicated by the value
+ stored in the Header Flags field. If the Header Padding field
+ contains any non-nul bytes, the decoder must indicate an error.
+
+ The intent of the Header Padding field is to allow alignment
+ of Compressed Data. The usefulness of alignment is described
+ in Section 4.3.
+
+
+3.2. Compressed Data
+
+ The format of Compressed Data depends on Block Flags and List
+ of Filter Flags. Excluding the descriptions of the simplest
+ filters in Section 4, the format of the filter-specific encoded
+ data is out of scope of this document.
+
+ Note a special case: if End of Payload Marker (see Section
+ 3.1.1) is not used and Uncompressed Size is zero, the size
+ of the Compressed Data field is always zero.
+
+
+3.3. Block Footer
+
+ +=======+===============+================+
+ | Check | Stream Footer | Footer Padding |
+ +=======+===============+================+
+
+
+3.3.1. Check
+
+ The type and size of the Check field depends on which bits
+ are set in the Stream Flags field (see Section 2.2.2).
+
+ The Check, when used, is calculated from the original
+ uncompressed data. If the calculated Check does not match the
+ stored one, the decoder must indicate an error. If the selected
+ type of Check is not supported by the decoder, it must indicate
+ a warning or error.
+
+
+3.3.2. Stream Footer
+
+ +===================+===============+--------------+
+ | Uncompressed Size | Backward Size | Stream Flags |
+ +===================+===============+--------------+
+
+ +----------+---------+
+ ---> | Footer Magic Bytes |
+ +----------+---------+
+
+ Stream Footer is present only in
+ - Data Block of a Single-Block Stream; and
+ - Footer Metadata Block of a Multi-Block Stream.
+
+ The Stream Footer field is placed inside Block Footer, because
+ no padding is allowed between Check and Stream Footer.
+
+
+3.3.2.1. Uncompressed Size
+
+ This field is present only in the Data Block of a Single-Block
+ Stream if Uncompressed Size is not stored to the Block Header
+ (see Section 3.1.1). Without the Uncompressed Size field in
+ Stream Footer it would not be possible to quickly find out
+ the Uncompressed Size of the Stream in all cases.
+
+ Uncompressed Size is stored using the encoding described in
+ Section 1.2. If the stored value does not match the real
+ uncompressed size of the Single-Block Stream, the decoder must
+ indicate an error.
+
+
+3.3.2.2. Backward Size
+
+ This field contains the total size of the Block Header,
+ Compressed Data, Check, and Uncompressed Size fields. The
+ value is stored using the encoding described in Section 1.2.
+ If the Backward Size does not match the real total size of
+ the appropriate fields, the decoder must indicate an error.
+
+ Implementations reading the Stream backwards should notice
+ that the value in this field can never be zero.
+
+
+3.3.2.3. Stream Flags
+
+ This is a copy of the Stream Flags field from the Stream
+ Header. The information stored to Stream Flags is needed
+ when parsing the Stream backwards.
+
+
+3.3.2.4. Footer Magic Bytes
+
+ As the last step of the decoding process, the decoder must
+ verify the existence of Footer Magic Bytes. If they are not
+ found, an error must be indicated.
+
+ Using a C array and ASCII:
+ const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
+
+ In hexadecimal:
+ 59 5A
+
+ The primary reason to have Footer Magic Bytes is to make
+ it easier to detect incomplete files quickly, without
+ uncompressing. If the file does not end with Footer Magic Bytes
+ (excluding Footer Padding described in Section 3.3.3), it
+ cannot be undamaged, unless someone has intentionally appended
+ garbage after the end of the Stream. (Appending garbage at the
+ end of the file does not prevent uncompressing the file, but
+ may give a warning or error depending on the decoder
+ implementation.)
+
+
+3.3.3. Footer Padding
+
+ In certain situations it is convenient to be able to pad
+ Blocks or Streams to be multiples of, for example, 512 bytes.
+ Footer Padding makes this possible. Note that this is in no
+ way required to enforce alignment in the way described in
+ Section 4.3; the Header Padding field is enough for that.
+
+ When Footer Padding is used, it must contain only nul bytes.
+ Any non-nul byte should be considered as the beginning of
+ a new Block or Stream.
+
+ The possibility of Padding should be taken into account when
+ designing an application that wants to find out information
+ about a Stream by parsing Footer Metadata Block.
+
+ Support for Padding was inspired by a related note in
+ [GNU-tar].
+
+
+4. Filters
+
+ The Block Flags field defines how many filters are used. When
+ more than one filter is used, the filters are chained; that is,
+ the output of one filter is the input of another filter. The
+ following figure illustrates the direction of data flow.
+
+ v Uncompressed Data ^
+ | Filter 0 |
+ Encoder | Filter 1 | Decoder
+ | ... |
+ | Filter n |
+ v Compressed Data ^
+
+ The filters are independent from each other, except that they
+ must cooperate a little to make it possible, in all cases, to
+ detect when all of the data has been decoded. In addition, the
+ filters should cooperate in the encoder to keep the alignment
+ optimal.
+
+
+4.1. Detecting when All Data Has Been Decoded
+
+ There must be a way for the decoder to detect when all of the
+ Compressed Data has been decoded. This is simple when only
+ one filter is used, but a bit more complex when multiple
+ filters are chained.
+
+ This file format supports three methods to detect when all of
+ the data has been decoded:
+ - Uncompressed size
+ - End of Input
+ - End of Payload Marker
+
+ In both encoder and decoder, filters are initialized starting
+ from the first filter in the chain. For each filter, one of
+ these three methods is used.
+
+
+4.1.1. With Uncompressed Size
+
+ This method is the only method supported by all filters.
+ It must be used when uncompressed size is known by the
+ filter-specific encoder or decoder. In practice this means
+ that Uncompressed Size has been stored to the Block Header.
+
+ In case of the first filter in the chain, the uncompressed size
+ given to the filter-specific encoder or decoder equals the
+ Uncompressed Size stored in the Block Header. For the rest of
+ the filters in the chain, uncompressed size is the size of the
+ output data of the previous filter in the chain.
+
+ Note that when Use End of Payload Marker bit is set in Block
+ Flags, Uncompressed Size is considered to be unknown even if
+ it was present in the Block Header. Thus, if End of Payload
+ Marker is used, uncompressed size of all of the filters in
+ the chain is unknown, and can never be used to detect when
+ all of the data has been decoded.
+
+ Once the correct number of bytes has been written out, the
+ filter-specific decoder indicates to its caller that all of
+ the data has been decoded. If the filter-specific decoder
+ detects End of Input or End of Payload Marker before the
+ correct number of bytes is decoded, the decoder must indicate
+ an error.
+
+
+4.1.2. With End of Input
+
+ Most filters will know that all of the data has been decoded
+ when the End of Input data has been reached. Once the filter
+ knows that it has received the input data in its entirety,
+ it finishes its job, and indicates to its caller that all of
+ the data has been decoded. The filter-specific decoder must
+ indicate an error if it detects End of Payload Marker.
+
+ Note that this method can work only when the filter is not
+ the last filter in the chain, because only another filter
+ can indicate the End of Input data. In practice this means,
+ that a filter later in the chain must support embedding
+ End of Payload Marker.
+
+ When a filter that cannot embed End of Payload Marker is the
+ last filter in the chain, Subblock filter is appended to the
+ chain as an implicit filter. In the simplest case, this occurs
+ when no filters are specified, and Uncompressed Size is unknown
+ or the End of Payload Marker bit is set in Block Flags.
+
+
+4.1.3. With End of Payload Marker
+
+ End of Payload Marker is a filter-specific bit sequence that
+ indicates the end of data. It is supported by only a few
+ filters. It is used when uncompressed size is unknown, and
+ the filter
+ - doesn't support End of Input; or
+ - is the last filter in the chain.
+
+ End of Payload Marker is embedded at the end of the encoded
+ data by the filter-specific encoder. When the filter-specific
+ decoder detects the embedded End of Payload Marker, the decoder
+ knows that all of the data has been decoded. Then it finishes
+ its job, and indicates to its caller that all of the data has
+ been decoded. If the filter-specific decoder detects End of
+ Input before End of Payload Marker, the decoder must indicate
+ an error.
+
+ If the filter supports both End of Input and End of Payload
+ Marker, the former is used, unless the filter is the last
+ filter in the chain.
+
+
+4.2. Alignment
+
+ Some filters give better compression ratio or are faster
+ when the input or output data is aligned. For optimal results,
+ the encoder should try to enforce proper alignment when
+ possible. Not enforcing alignment in the encoder is not
+ an error. Thus, the decoder must be able to handle files with
+ suboptimal alignment.
+
+ Alignment of uncompressed input data is usually the job of
+ the application producing the data. For example, to get the
+ best results, an archiver tool should make sure that all
+ PowerPC executable files in the archive stream start at
+ offsets that are multiples of four bytes.
+
+ Some filters, for example LZMA, can be configured to take
+ advantage of specified alignment of input data. Note that
+ taking advantage of aligned input can be benefical also when
+ a filter is not the first filter in the chain. For example,
+ if you compress PowerPC executables, you may want to use the
+ PowerPC filter and chain that with the LZMA filter. Because not
+ only the input but also the output alignment of the PowerPC
+ filter is four bytes, it is now benefical to set LZMA settings
+ so that the LZMA encoder can take advantage of its
+ four-byte-aligned input data.
+
+ The output of the last filter in the chain is stored to the
+ Compressed Data field. Aligning Compressed Data appropriately
+ can increase
+ - speed, if the filtered data is handled multiple bytes at
+ a time by the filter-specific encoder and decoder,
+ because accessing aligned data in computer memory is
+ usually faster; and
+ - compression ratio, if the output data is later compressed
+ with an external compression tool.
+
+ Compressed Data in a Stream can be aligned by using the Header
+ Padding field in the Block Header.
+
+
+4.3. Filters
+
+4.3.1. Copy
+
+ This is a dummy filter that simply copies all data from input
+ to output unmodified.
+
+ Filter ID: 0x00
+ Size of Filter Properties: 0 bytes
+ Changes size of data: No
+
+ Detecting when all of the data has been decoded:
+ Uncompressed size: Yes
+ End of Payload Marker: No
+ End of Input: Yes
+
+ Preferred alignment:
+ Input data: 1 byte
+ Output data: 1 byte
+
+
+4.3.2. Subblock
+
+ The Subblock filter can be used to
+ - embed End of Payload Marker when the otherwise last
+ filter in the chain does not support embedding it; and
+ - apply additional filters in the middle of a Block.
+
+ Filter ID: 0x01
+ Size of Filter Properties: 0 bytes
+ Changes size of data: Yes, unpredictably
+
+ Detecting when all of the data has been decoded:
+ Uncompressed size: Yes
+ End of Payload Marker: Yes
+ End of Input: Yes
+
+ Preferred alignment:
+ Input data: 1 byte
+ Output data: Freely adjustable
+
+
+4.3.2.1. Format of the Encoded Output
+
+ The encoded data from the Subblock filter consist of zero or
+ more Subblocks:
+
+ +==========+==========+
+ | Subblock | Subblock | ...
+ +==========+==========+
+
+ Each Subblock contains two fields:
+
+ +----------------+===============+
+ | Subblock Flags | Subblock Data |
+ +----------------+===============+
+
+ Subblock Flags is a bitfield:
+
+ Bits Mask Description
+ 0-3 0x0F The interpretation of these bits depend on
+ the Subblock Type:
+ - 0x20 Bits 0-3 for Size
+ - 0x30 Bits 0-3 for Repeat Count
+ - Other These bits must be zero.
+ 4-7 0xF0 Subblock Type:
+ - 0x00: Padding
+ - 0x10: End of Payload Marker
+ - 0x20: Data
+ - 0x30: Repeating Data
+ - 0x40: Set Subfilter
+ - 0x50: Unset Subfilter
+ If some other value is detected, the decoder
+ must indicate an error.
+
+ The format of the Subblock Data field depends on Subblock Type.
+
+ Subblocks with the Subblock Type 0x00 (Padding) don't have a
+ Subblock Data field. These Subblocks can be useful for fixing
+ alignment. There can be at maximum of 31 consecutive Subblocks
+ with this Subblock Type; if there are more, the decoder must
+ indicate an error.
+
+ Subblock with the Subblock Type 0x10 (End of Payload Marker)
+ doesn't have a Subblock Data field. The decoder must indicate
+ an error if this Subblock Type is detected when Subfilter is
+ enabled, or when the Subblock filter is not supposed to embed
+ the End of Payload Marker.
+
+ Subblocks with the Subblock Type 0x20 (Data) contain the rest
+ of the Size, which is followed by Size + 1 bytes in the Data
+ field (that is, Data can never be empty):
+
+ +------+------+------+======+
+ | Bits 4-27 for Size | Data |
+ +------+------+------+======+
+
+ Subblocks with the Subblock Type 0x30 (Repeating Data) contain
+ the rest of the Repeat Count, the Size of the Data, and finally
+ the actual Data to be repeated:
+
+ +---------+---------+--------+------+======+
+ | Bits 4-27 for Repeat Count | Size | Data |
+ +---------+---------+--------+------+======+
+
+ The size of the Data field is Size + 1. It is repeated Repeat
+ Count + 1 times. That is, the minimum size of Data is one byte;
+ the maximum size of Data is 256 bytes. The minimum number of
+ repeats is one; the maximum number of repeats is 2^28.
+
+ If Subfilter is not used, the Data field of Subblock Types 0x20
+ and 0x30 is the output of the decoded Subblock filter. If
+ Subfilter is used, Data is the input of the Subfilter, and the
+ decoded output of the Subfilter is the decoded output of the
+ Subblock filter.
+
+ Subblocks with the Subblock Type 0x40 (Set Subfilter) contain
+ a Filter Flags field in Subblock Data:
+
+ +==============+
+ | Filter Flags |
+ +==============+
+
+ It is an error to set the Subfilter to Filter ID 0x00 (Copy)
+ or 0x01 (Subblock). All the other Filter IDs are allowed.
+ The decoder must indicate an error if this Subblock Type is
+ detected when a Subfilter is already enabled.
+
+ Subblocks with the Subblock Type 0x50 (Unset Subfilter) don't
+ have a Subblock Data field. There must be at least one Subblock
+ with Subblock Type 0x20 or 0x30 between Subblocks with Subblock
+ Type 0x40 and 0x50; if there isn't, the decoder must indicate
+ an error.
+
+ Subblock Types 0x40 and 0x50 are always used as a pair: If the
+ Subblock filter has been enabled with Subblock Type 0x40, it
+ must always be disabled later with Subblock Type 0x50.
+ Disabling must be done even if the Subfilter used End of
+ Payload Marker; after the Subfilter has detected End of Payload
+ Marker, the next Subblock that is not Padding must unset the
+ Subfilter.
+
+ When the Subblock filter is used as an implicit filter to embed
+ End of Payload marker, the Subblock Types 0x40 and 0x50 (Set or
+ Unset Subfilter) must not be used. The decoder must indicate an
+ error if it detects any of these Subblock Types in an implicit
+ Subblock filter.
+
+ The following code illustrates the basic structure of a
+ Subblock decoder.
+
+ uint32_t consecutive_padding = 0;
+ bool got_output_with_subfilter = false;
+
+ while (true) {
+ uint32_t size;
+ uint32_t repeat;
+ uint8_t flags = read_byte();
+
+ if (flags != 0)
+ consecutive_padding = 0;
+
+ switch (flags >> 4) {
+ case 0:
+ // Padding
+ if (flags & 0x0F)
+ return DATA_ERROR;
+ if (++consecutive_padding == 32)
+ return DATA_ERROR;
+ break;
+
+ case 1:
+ // End of Payload Marker
+ if (flags & 0x0F)
+ return DATA_ERROR;
+ if (subfilter_enabled || !allow_eopm)
+ return DATA_ERROR;
+ break;
+
+ case 2:
+ // Data
+ size = flags & 0x0F;
+ for (size_t i = 4; i < 28; i += 8)
+ size |= (uint32_t)(read_byte()) << i;
+
+ // If any output is produced, this will
+ // set got_output_with_subfilter to true.
+ copy_data(size);
+ break;
+
+ case 3:
+ // Repeating Data
+ repeat = flags & 0x0F;
+ for (size_t i = 4; i < 28; i += 8)
+ repeat |= (uint32_t)(read_byte()) << i;
+ size = read_byte();
+
+ // If any output is produced, this will
+ // set got_output_with_subfilter to true.
+ copy_repeating_data(size, repeat);
+ break;
+
+ case 4:
+ // Set Subfilter
+ if (flags & 0x0F)
+ return DATA_ERROR;
+ if (subfilter_enabled)
+ return DATA_ERROR;
+ got_output_with_subfilter = false;
+ set_subfilter();
+ break;
+
+ case 5:
+ // Unset Subfilter
+ if (flags & 0x0F)
+ return DATA_ERROR;
+ if (!subfilter_enabled)
+ return DATA_ERROR;
+ if (!got_output_with_subfilter)
+ return DATA_ERROR;
+ unset_subfilter();
+ break;
+
+ default:
+ return DATA_ERROR;
+ }
+ }
+
+
+4.3.3. Delta
+
+ The Delta filter may increase compression ratio when the value
+ of the next byte correlates with the value of an earlier byte
+ at specified distance.
+
+ Filter ID: 0x20
+ Size of Filter Properties: 1 byte
+ Changes size of data: No
+
+ Detecting when all of the data has been decoded:
+ Uncompressed size: Yes
+ End of Payload Marker: No
+ End of Input: Yes
+
+ Preferred alignment:
+ Input data: 1 byte
+ Output data: Same as the original input data
+
+ The Properties byte indicates the delta distance, which can be
+ 1-256 bytes backwards from the current byte: 0x00 indicates
+ distance of 1 byte and 0xFF distance of 256 bytes.
+
+
+4.3.3.1. Format of the Encoded Output
+
+ The code below illustrates both encoding and decoding with
+ the Delta filter.
+
+ // Distance is in the range [1, 256].
+ const unsigned int distance = get_properties_byte() + 1;
+ uint8_t pos = 0;
+ uint8_t delta[256];
+
+ memset(delta, 0, sizeof(delta));
+
+ while (1) {
+ const int byte = read_byte();
+ if (byte == EOF)
+ break;
+
+ uint8_t tmp = delta[(uint8_t)(distance + pos)];
+ if (is_encoder) {
+ tmp = (uint8_t)(byte) - tmp;
+ delta[pos] = (uint8_t)(byte);
+ } else {
+ tmp = (uint8_t)(byte) + tmp;
+ delta[pos] = tmp;
+ }
+
+ write_byte(tmp);
+ --pos;
+ }
+
+
+4.3.4. LZMA
+
+ LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purporse
+ compression algorithm with high compression ratio and fast
+ decompression. LZMA based on LZ77 and range coding algorithms.
+
+ Filter ID: 0x40
+ Size of Filter Properties: 2 bytes
+ Changes size of data: Yes, unpredictably
+
+ Detecting when all of the data has been decoded:
+ Uncompressed size: Yes
+ End of Payload Marker: Yes
+ End of Input: No
+
+ Preferred alignment:
+ Input data: Adjustable to 1/2/4/8/16 byte(s)
+ Output data: 1 byte
+
+ At the time of writing, there is no other documentation about
+ how LZMA works than the source code in LZMA SDK. Once such
+ documentation gets written, it will probably be published as
+ a separate document, because including the documentation here
+ would lengthen this document considerably.
+
+ The format of the Filter Properties field is as follows:
+
+ +-----------------+------------------+
+ | LZMA Properties | Dictionary Flags |
+ +-----------------+------------------+
+
+
+4.3.4.1. LZMA Properties
+
+ The LZMA Properties bits contain three properties. An
+ abbreviation is given in parentheses, followed by the value
+ range of the property. The field consists of
+
+ 1) the number of literal context bits (lc, [0, 8]);
+ 2) the number of literal position bits (lp, [0, 4]); and
+ 3) the number of position bits (pb, [0, 4]).
+
+ They are encoded using the following formula:
+
+ LZMA Properties = (pb * 5 + lp) * 9 + lc
+
+ The following C code illustrates a straightforward way to
+ decode the properties:
+
+ uint8_t lc, lp, pb;
+ uint8_t prop = get_lzma_properties() & 0xFF;
+ if (prop > (4 * 5 + 4) * 9 + 8)
+ return LZMA_PROPERTIES_ERROR;
+
+ pb = prop / (9 * 5);
+ prop -= pb * 9 * 5;
+ lp = prop / 9;
+ lc = prop - lp * 9;
+
+
+4.3.4.2. Dictionary Flags
+
+ Currently the lowest six bits of the Dictionary Flags field
+ are in use:
+
+ Bits Mask Description
+ 0-5 0x3F Dictionary Size
+ 6-7 0xC0 Reserved for future use; must be zero for now.
+
+ Dictionary Size is encoded with one-bit mantissa and five-bit
+ exponent. To avoid wasting space, one-byte dictionary has its
+ own special value.
+
+ Raw value Mantissa Exponent Dictionary size
+ 0 1 0 1 byte
+ 1 2 0 2 bytes
+ 2 3 0 3 bytes
+ 3 2 1 4 bytes
+ 4 3 1 6 bytes
+ 5 2 2 8 bytes
+ 6 3 2 12 bytes
+ 7 2 3 16 bytes
+ 8 3 3 24 bytes
+ 9 2 4 32 bytes
+ ... ... ... ...
+ 61 2 30 2 GiB
+ 62 3 30 3 GiB
+ 63 2 31 4 GiB (*)
+
+ (*) The real maximum size of the dictionary is one byte
+ less than 4 GiB, because the distance of 4 GiB is
+ reserved for End of Payload Marker.
+
+ Instead of having a table in the decoder, the dictionary size
+ can be decoded using the following C code:
+
+ uint64_t dictionary_size;
+ const uint8_t bits = get_dictionary_flags() & 0x3F;
+ if (bits == 0) {
+ dictionary_size = 1;
+ } else {
+ dictionary_size = 2 | ((bits + 1) & 1);
+ dictionary_size = dictionary_size << ((bits - 1) / 2);
+ }
+
+
+4.3.5. Branch/Call/Jump Filters for Executables
+
+ These filters convert relative branch, call, and jump
+ instructions to their absolute counterparts in executable
+ files. This conversion increases redundancy and thus
+ compression ratio.
+
+ Size of Filter Properties: 0 or 4 bytes
+ Changes size of data: No
+
+ Detecting when all of the data has been decoded:
+ Uncompressed size: Yes
+ End of Payload Marker: No
+ End of Input: Yes
+
+ Below is the list of filters in this category. The alignment
+ is the same for both input and output data.
+
+ Filter ID Alignment Description
+ 0x04 1 byte x86 filter (BCJ)
+ 0x05 4 bytes PowerPC (big endian) filter
+ 0x06 16 bytes IA64 filter
+ 0x07 4 bytes ARM (little endian) filter
+ 0x08 2 bytes ARM Thumb (little endian) filter
+ 0x09 4 bytes SPARC filter
+
+ If the size of Filter Properties is four bytes, the Filter
+ Properties field contains the start offset used for address
+ conversions. It is stored as an unsigned 32-bit little endian
+ integer. If the size of Filter Properties is zero, the start
+ offset is zero.
+
+ Setting the start offset may be useful if an executable has
+ multiple sections, and there are many cross-section calls.
+ Taking advantage of this feature usually requires usage of
+ the Subblock filter.
+
+
+5. Metadata
+
+ Metadata is stored in Metadata Blocks, which can be in the
+ beginning or at the end of a Multi-Block Stream. Because of
+ Blocks, it is possible to compress Metadata in the same way
+ as the actual data is compressed. This Section describes the
+ format of the data stored in Metadata Blocks.
+
+ +----------------+===============================+
+ | Metadata Flags | Size of Header Metadata Block |
+ +----------------+===============================+
+
+ +============+===================+=======+=======+
+ ---> | Total Size | Uncompressed Size | Index | Extra |
+ +============+===================+=======+=======+
+
+ Stream must be parseable backwards. That is, there must be
+ a way to locate the beginning of the Stream by starting from
+ the end of the Stream. Thus, the Footer Metadata Block must
+ contain the Total Size field or the Index field. If the Stream
+ has Header Metadata Block, also the Size of Header Metadata
+ Block field must be present in Footer Metadata Block.
+
+ It must be possible to quickly locate the Blocks in
+ non-streamed mode. Thus, the Index field must be present
+ at least in one Metadata Block.
+
+ If the above conditions are not met, the decoder must indicate
+ an error.
+
+ There should be no additional data after the last field. If
+ there is, the the decoder should indicate an error.
+
+
+5.1. Metadata Flags
+
+ This field describes which fields are present in a Metadata
+ Block:
+
+ Bit(s) Mask Desription
+ 0 0x01 Size of Header Metadata Block is present.
+ 1 0x02 Total Size is present.
+ 2 0x04 Uncompressed Size is present.
+ 3 0x08 Index is present.
+ 4-6 0x70 Reserve for future use; must be zero for now.
+ 7 0x80 Extra is present.
+
+ If any reserved bit is set, the decoder must indicate an error.
+ It is possible that there is a new field present which the
+ decoder is not aware of, and can thus parse the Metadata
+ incorrectly.
+
+
+5.2. Size of Header Metadata Block
+
+ This field is present only if the appropriate bit is set in
+ the Metadata Flags field (see Section 5.1).
+
+ Size of Header Metadata Block is needed to make it possible to
+ parse the Stream backwards. The size is stored using the
+ encoding described in Section 1.2. The decoder must verify that
+ that the value stored in this field is non-zero. In Footer
+ Metadata Block, the decoder must also verify that the stored
+ size matches the real size of Header Metadata Block. In the
+ Header Meatadata Block, the value of this field is ignored as
+ long as it is not zero.
+
+
+5.3. Total Size
+
+ This field is present only if the appropriate bit is set in the
+ Metadata Flags field (see Section 5.1).
+
+ This field contains the total size of the Data Blocks in the
+ Stream. Total Size is stored using the encoding described in
+ Section 1.2. If the stored value does not match the real total
+ size of the Data Blocks, the decoder must indicate an error.
+ The value of this field must be non-zero.
+
+ Total Size can be used to quickly locate the beginning or end
+ of the Stream. This can be useful for example when doing
+ random-access reading, and the Index field is not in the
+ Metadata Block currently being read.
+
+ It is useless to have both Total Size and Index in the same
+ Metadata Block, because Total Size can be calculated from the
+ Index field.
+
+
+5.4. Uncompressed Size
+
+ This field is present only if the appropriate bit is set in the
+ Metadata Flags field (see Section 5.1).
+
+ This field contains the total uncompressed size of the Data
+ Blocks in the Stream. Uncompresssed Size is stored using the
+ encoding described in Section 1.2. If the stored value does not
+ match the real uncompressed size of the Data Blocks, the
+ decoder must indicate an error.
+
+ It is useless to have both Uncompressed Size and Index in
+ the same Metadata Block, because Uncompressed Size can be
+ calculated from the Index field.
+
+
+5.5. Index
+
+ +=======================+=============+====================+
+ | Number of Data Blocks | Total Sizes | Uncompressed Sizes |
+ +=======================+=============+====================+
+
+ Index serves several purporses. Using it, one can
+ - verify that all Blocks in a Stream have been processed;
+ - find out the Uncompressed Size of a Stream; and
+ - quickly access the beginning of any Block (random access).
+
+
+5.5.1. Number of Data Blocks
+
+ This field contains the number of Data Blocks in the Stream.
+ The value is stored using the encoding described in Section
+ 1.2. If the decoder has decoded all the Data Blocks of the
+ Stream, and then notices that the Number of Records doesn't
+ match the real number of Data Blocks, the decoder must
+ indicate an error. The value of this field must be non-zero.
+
+
+5.5.2. Total Sizes
+
+ +============+============+
+ | Total Size | Total Size | ...
+ +============+============+
+
+ This field lists the Total Sizes of every Data Block in the
+ Stream. There are as many Total Size fields as indicated by
+ the Number of Data Blocks field.
+
+ Total Size is the size of Block Header, Compressed Data, and
+ Block Footer. It is stored using the encoding described in
+ Section 1.2. If the Total Sizes do not match the real sizes
+ of respective Blocks, the decoder should indicate an error.
+ All the Total Size fields must have a non-zero value.
+
+
+5.5.3. Uncompressed Sizes
+
+ +===================+===================+
+ | Uncompressed Size | Uncompressed Size | ...
+ +===================+===================+
+
+ This field lists the Uncompressed Sizes of every Data Block
+ in the Stream. There are as many Uncompressed Size fields as
+ indicated by the Number of Records field.
+
+ Uncompressed Sizes are stored using the encoding described
+ in Section 1.2. If the Uncompressed Sizes do not match the
+ real sizes of respective Blocks, the decoder shoud indicate
+ an error.
+
+
+5.6. Extra
+
+ This field is present only if the appropriate bit is set in the
+ Metadata Flags field (see Section 5.1). Note that the bit does
+ not indicate that there is any data in the Extra field; it only
+ indicates that Extra may be non-empty.
+
+ The Extra field contains only information that is not required
+ to properly uncompress the Stream or to do random-access
+ reading. Supporting the Extra field is optional. In case the
+ decoder doesn't support the Extra field, it should silently
+ ignore it.
+
+ Extra consists of zero or more Records:
+
+ +========+========+
+ | Record | Record | ...
+ +========+========+
+
+ Excluding Records with Record ID 0x00, each Record contains
+ three fields:
+
+ +==========+==============+======+
+ | Reord ID | Size of Data | Data |
+ +==========+==============+======+
+
+ The Record ID and Size of Data are stored using the encoding
+ described in Section 1.2. Data can be binary or UTF-8
+ [RFC-3629] strings. Non-UTF-8 strings should be avoided.
+ Because the Size of Data is known, there is no need to
+ terminate strings with a nul byte, although doing so should
+ not be considered an error.
+
+ The Record IDs are divided in two categories:
+ - Safe-to-Copy Records may be preserved as is when the
+ Stream is modified in ways that don't change the actual
+ uncompressed data. Examples of such operatings include
+ recompressing and adding, modifying, or deleting unrelated
+ Extra Records.
+ - Unsafe-to-Copy Records should be removed (and possibly
+ recreated) when any kind of changes are made to the Stream.
+
+ When the actual uncompressed data is modified, all Records
+ should be removed (and possibly recreated), unless the
+ application knows that the Data stored to the Record(s) is
+ still valid.
+
+ The following subsections describe the standard Record IDs and
+ the format of their Data fields. Safe-to-Copy Records have an
+ odd ID, while Unsafe-to-Copy Records have an even ID.
+
+
+5.6.1. 0x00: Dummy/Padding
+
+ This Record is special, because it doesn't have the Size of
+ Data or Data fields.
+
+ Dummy Records can be used, for example, to fill Metadata Block
+ when a few bytes of extra space has been reserved for it. There
+ can be any number of Dummy Records.
+
+
+5.6.2. 0x01: OpenPGP Signature
+
+ OpenPGP signature is computed from uncompressed data. The
+ signature can be used to verify that the contents of a Stream
+ has been created by a trustworthy source.
+
+ If the decoder supports decoding concatenated Streams, it
+ must indicate an error when verifying OpenPGP signatures if
+ there is more than one Stream.
+
+ OpenPGP format is documented in [RFC-2440].
+
+
+5.6.3. 0x02: Filter Information
+
+ The Filter Information Record contains information about the
+ filters used in the Stream. This field can be used to quickly
+ - display which filters are used in each Block;
+ - check if all the required filters are supported by the
+ current decoder version; and
+ - check how much memory is required to decode each Block.
+
+ The format of the Filter Information field is as follows:
+
+ +=================+=================+
+ | Block 0 Filters | Block 1 Filters | ...
+ +=================+=================+
+
+ There can be at maximum of as many Block Filters fields as
+ there are Data Blocks in the Stream. The format of the Block
+ Filters field is as follows:
+
+ +------------------+======================+============+
+ | Block Properties | List of Filter Flags | Subfilters |
+ +------------------+======================+============+
+
+ Block Properties is a bitfield:
+
+ Bit(s) Mask Description
+ 0-2 0x07 Number of filters (0-7)
+ 3 0x08 End of Payload Marker is used.
+ 4 0x10 The Subfilters field is present.
+ 5-7 0xE0 Reserved for future use; must be zero for now.
+
+ The contents of the List of Filter Flags field must match the
+ List of Filter Flags field in the respective Block Header.
+
+ The Subfilters field may be present only if the List of Filter
+ Flags contains a Filter Flags field for a Subblock filter. The
+ format of the Subfilters field is as follows:
+
+ +======================+=========================+
+ | Number of Subfilters | List of Subfilter Flags |
+ +======================+=========================+
+
+ The value stored in the Number of Subfilters field is stored
+ using the encoding described in Section 1.2. The List of
+ Subfilter Flags field contains as many Filter Flags fields
+ as indicated by the Number of Subfilters field. These Filter
+ Flags fields list some or all the Subfilters used via the
+ Subblock filter. The order of the listed Subfilters is not
+ significant.
+
+ Decoders supporting this Record should indicate a warning or
+ error if this Record contains Filter Flags that are not
+ actually used by the respective Blocks.
+
+
+5.6.4. 0x03: Comment
+
+ Free-form comment is stored in UTF-8 [RFC-3629] encoding.
+
+ The beginning of a new line should be indicated using the
+ ASCII Line Feed character (0x0A). When the Line Feed character
+ is not the native way to indicate new line in the underlying
+ operating system, the encoder and decoder should convert the
+ newline characters to and from Line Feeds.
+
+
+5.6.5. 0x04: List of Checks
+
+ +=======+=======+
+ | Check | Check | ...
+ +=======+=======+
+
+ There are as many Check fields as there are Blocks in the
+ Stream. The size of Check fields depend on Stream Flags
+ (see Section 2.2.2).
+
+ Decoders supporting this Record should indicate a warning or
+ error if the Checks don't match the respective Blocks.
+
+
+5.6.6. 0x05: Original Filename
+
+ Original filename is stored in UTF-8 [RFC-3629] encoding.
+
+ The filename must not include any path, only the filename
+ itself. Special care must be taken to prevent directory
+ traversal vulnerabilities.
+
+ When files are moved between different operating systems, it
+ is possible that filename valid in the source system is not
+ valid in the target system. It is implementation defined how
+ the decoder handles this kind of situations.
+
+
+5.6.7. 0x07: Modification Time
+
+ Modification time is stored as POSIX time, as an unsigned
+ little endian integer. The number of bits depends on the
+ Size of Data field. Note that the usage of unsigned integer
+ limits the earliest representable time to 1970-01-01T00:00:00.
+
+
+5.6.8. 0x09: High-Resolution Modification Time
+
+ This Record extends the `0x04: Modification time' Record with
+ a subsecond time information. There are two supported formats
+ of this field, which can be distinguished by looking at the
+ Size of Data field.
+
+ Size Data
+ 3 [0; 9,999,999] times 100 nanoseconds
+ 4 [0; 999,999,999] nanoseconds
+
+ The value is stored as an unsigned 24-bit or 32-bit little
+ endian integer.
+
+
+5.6.9. 0x0B: MIME Type
+
+ MIME type of the uncompressed Stream. This can be used to
+ detect the content type. [IANA-MIME]
+
+
+5.6.10. 0x0D: Homepage URL
+
+ This field can be used, for example, when distributing software
+ packages (sources or binaries). The field would indicate the
+ homepage of the program.
+
+ For details on how to encode URLs, see [RFC-1738].
+
+
+6. Custom Filter and Extra Record IDs
+
+ If a developer wants to use custom Filter or Extra Record IDs,
+ he has two choices. The first choice is to contact Lasse Collin
+ and ask him to allocate a range of IDs for the developer.
+
+ The second choice is to generate a 40-bit random integer,
+ which the developer can use as his personal Developer ID.
+ To minimalize the risk of collisions, Developer ID has to be
+ a randomly generated integer, not manually selected "hex word".
+ The following command, which works on many free operating
+ systems, can be used to generate Developer ID:
+
+ dd if=/dev/urandom bs=5 count=1 | hexdump
+
+ The developer can then use his Developer ID to create unique
+ (well, hopefully unique) Filter and Extra Record IDs.
+
+ Bits Mask Description
+ 0-15 0x0000_0000_0000_FFFF Filter or Extra Record ID
+ 16-55 0x00FF_FFFF_FFFF_0000 Developer ID
+ 56-62 0x7F00_0000_0000_0000 Static prefix: 0x7F
+
+ The resulting 63-bit integer will use 9 bytes of space when
+ stored using the encoding described in Section 1.2. To get
+ a shorter ID, see the beginning of this Section how to
+ request a custom ID range.
+
+ Note that Filter and Metadata Record IDs are in their own
+ namespaces. That is, you can use the same ID value as Filter ID
+ and Metadata Record ID, and the meanings of the IDs do not need
+ to be related to each other.
+
+
+6.1. Reserved Custom Filter ID Ranges
+
+ Range Description
+ 0x0000_0000 - 0x0000_00DF IDs fitting into the Misc field
+ 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility
+ 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility
+
+
+7. Cyclic Redundancy Checks
+
+ There are several incompatible variations to calculate CRC32
+ and CRC64. For simplicity and clarity, complete examples are
+ provided to calculate the checks as they are used in this file
+ format. Implementations may use different code as long as it
+ gives identical results.
+
+ The program below reads data from standard input, calculates
+ the CRC32 and CRC64 values, and prints the calculated values
+ as big endian hexadecimal strings to standard output.
+
+ #include <sys/types.h>
+ #include <inttypes.h>
+ #include <stdio.h>
+
+ uint32_t crc32_table[256];
+ uint64_t crc64_table[256];
+
+ void
+ init(void)
+ {
+ static const uint32_t poly32 = UINT32_C(0xEDB88320);
+ static const uint64_t poly64
+ = UINT64_C(0xC96C5795D7870F42);
+
+ for (size_t i = 0; i < 256; ++i) {
+ uint32_t crc32 = i;
+ uint64_t crc64 = i;
+
+ for (size_t j = 0; j < 8; ++j) {
+ if (crc32 & 1)
+ crc32 = (crc32 >> 1) ^ poly32;
+ else
+ crc32 >>= 1;
+
+ if (crc64 & 1)
+ crc64 = (crc64 >> 1) ^ poly64;
+ else
+ crc64 >>= 1;
+ }
+
+ crc32_table[i] = crc32;
+ crc64_table[i] = crc64;
+ }
+ }
+
+ uint32_t
+ crc32(const uint8_t *buf, size_t size, uint32_t crc)
+ {
+ crc = ~crc;
+ for (size_t i = 0; i < size; ++i)
+ crc = crc32_table[buf[i] ^ (crc & 0xFF)]
+ ^ (crc >> 8);
+ return ~crc;
+ }
+
+ uint64_t
+ crc64(const uint8_t *buf, size_t size, uint64_t crc)
+ {
+ crc = ~crc;
+ for (size_t i = 0; i < size; ++i)
+ crc = crc64_table[buf[i] ^ (crc & 0xFF)]
+ ^ (crc >> 8);
+ return ~crc;
+ }
+
+ int
+ main()
+ {
+ init();
+
+ uint32_t value32 = 0;
+ uint64_t value64 = 0;
+ uint64_t total_size = 0;
+ uint8_t buf[8192];
+
+ while (1) {
+ const size_t buf_size = fread(buf, 1, 8192, stdin);
+ if (buf_size == 0)
+ break;
+
+ total_size += buf_size;
+ value32 = crc32(buf, buf_size, value32);
+ value64 = crc64(buf, buf_size, value64);
+ }
+
+ printf("Bytes: %" PRIu64 "\n", total_size);
+ printf("CRC-32: 0x%08" PRIX32 "\n", value32);
+ printf("CRC-64: 0x%016" PRIX64 "\n", value64);
+
+ return 0;
+ }
+
+
+8. References
+
+8.1. Normative References
+
+ [RFC-1738]
+ Uniform Resource Locators (URL)
+ http://www.ietf.org/rfc/rfc1738.txt
+
+ [RFC-2119]
+ Key words for use in RFCs to Indicate Requirement Levels
+ http://www.ietf.org/rfc/rfc2119.txt
+
+ [RFC-2440]
+ OpenPGP Message Format
+ http://www.ietf.org/rfc/rfc2440.txt
+
+ [RFC-3629]
+ UTF-8, a transformation format of ISO 10646
+ http://www.ietf.org/rfc/rfc3629.txt
+
+ [IANA-MIME]
+ MIME Media Types
+ http://www.iana.org/assignments/media-types/
+
+
+8.2. Informative References
+
+ LZMA SDK - The original LZMA implementation
+ http://7-zip.org/sdk.html
+
+ LZMA Utils - LZMA adapted to POSIX-like systems
+ http://tukaani.org/lzma/
+
+ [RFC-1952]
+ GZIP file format specification version 4.3
+ http://www.ietf.org/rfc/rfc1952.txt
+ - Notation of byte boxes in section `2.1. Overall conventions'
+
+ [GNU-tar]
+ GNU tar 1.16.1 manual
+ http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
+ - Node 9.4.2 `Blocking Factor', paragraph that begins
+ `gzip will complain about trailing garbage'
+ - Note that this URL points to the latest version of the
+ manual, and may some day not contain the note which is in
+ 1.16.1. For the exact version of the manual, download GNU
+ tar 1.16.1: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.16.1.tar.gz
+
diff --git a/doc/history.txt b/doc/history.txt
new file mode 100644
index 00000000..55293062
--- /dev/null
+++ b/doc/history.txt
@@ -0,0 +1,140 @@
+
+LZMA Utils history
+------------------
+
+Tukaani distribution
+
+ In 2005, there was a small group working on Tukaani distribution, which
+ was a Slackware fork. One of the project goals was to fit the distro on
+ a single 700 MiB ISO-9660 image. Using LZMA instead of gzip helped a
+ lot. Roughly speaking, one could fit data that took 1000 MiB in gzipped
+ form into 700 MiB with LZMA. Naturally compression ratio varied across
+ packages, but this was what we got on average.
+
+ Slackware packages have traditionally had .tgz as the filename suffix,
+ which is an abbreviation of .tar.gz. A logical naming for LZMA
+ compressed packages was .tlz, being an abbreviation of .tar.lzma.
+
+ At the end of the year 2007, there's no distribution under the Tukaani
+ project anymore. Development of LZMA Utils still continues. Still,
+ there are .tlz packages around, because at least Vector Linux (a
+ Slackware based distribution) uses LZMA for its packages.
+
+ First versions of the modified pkgtools used the LZMA_Alone tool from
+ Igor Pavlov's LZMA SDK as is. It was fine, because users wouldn't need
+ to interact with LZMA_Alone directly. But people soon wanted to use
+ LZMA for other files too, and the interface of LZMA_Alone wasn't
+ comfortable for those used to gzip and bzip2.
+
+
+First steps of LZMA Utils
+
+ The first version of LZMA Utils (4.22.0) included a shell script called
+ lzmash. It was wrapper that had gzip-like command line interface. It
+ used the LZMA_Alone tool from LZMA SDK to do all the real work. zgrep,
+ zdiff, and related scripts from gzip were adapted work with LZMA and
+ were part of the first LZMA Utils release too.
+
+ LZMA Utils 4.22.0 included also lzmadec, which was a small (less than
+ 10 KiB) decoder-only command line tool. It was written on top of the
+ decoder-only C code found from the LZMA SDK. lzmadec was convenient in
+ situations where LZMA_Alone (a few hundred KiB) would be too big.
+
+ lzmash and lzmadec were written by Lasse Collin.
+
+
+Second generation
+
+ The lzmash script was an ugly and not very secure hack. The last
+ version of LZMA Utils to use lzmash was 4.27.1.
+
+ LZMA Utils 4.32.0beta1 introduced a new lzma command line tool written
+ by Ville Koskinen. It was written in C++, and used the encoder and
+ decoder from C++ LZMA SDK with little modifications. This tool replaced
+ both the lzmash script and the LZMA_Alone command line tool in LZMA
+ Utils.
+
+ Introducing this new tool caused some temporary incompatibilities,
+ because LZMA_Alone executable was simply named lzma like the new
+ command line tool, but they had completely different command line
+ interface. The file format was still the same.
+
+ Lasse wrote liblzmadec, which was a small decoder-only library based on
+ the C code found from LZMA SDK. liblzmadec had API similar to zlib,
+ although there were some significant differences, which made it
+ non-trivial to use it in some applications designed for zlib and
+ libbzip2.
+
+ The lzmadec command line tool was converted to use liblzmadec.
+
+ Alexandre Sauvé helped converting build system to use GNU Autotools.
+ This made is easier to test for certain less portable features needed
+ by the new command line tool.
+
+ Since the new command line tool never got completely finished (for
+ example, it didn't support LZMA_OPT environment variable), the intent
+ was to not call 4.32.x stable. Similarly, liblzmadec wasn't polished,
+ but appeared to work well enough, so some people started using it too.
+
+ Because the development of the third generation of LZMA Utils was
+ delayed considerably (roughly two years), the 4.32.x branch had to be
+ kept maintained. It got some bug fixes now and then, and finally it was
+ decided to call it stable, although most of the missing features were
+ never added.
+
+
+File format problems
+
+ The file format used by LZMA_Alone was primitive. It was designed for
+ embedded systems in mind, and thus provided only minimal set of
+ features. The two biggest problems for non-embedded use were lack of
+ magic bytes and integrity check.
+
+ Igor and Lasse started developing a new file format with some help from
+ Ville Koskinen, Mark Adler and Mikko Pouru. Designing the new format
+ took quite a long time. It was mostly because Lasse was quite slow at
+ getting things done due to personal reasons.
+
+ Near the end of the year 2007 the new format was practically finished.
+ Compared to LZMA_Alone format and the .gz format used by gzip, the new
+ .lzma format is quite complex as a whole. This means that tools having
+ *full* support for the new format would be larger and more complex than
+ the tools supporting only the old LZMA_Alone format.
+
+ For the situations where the full support for the .lzma format wouldn't
+ be required (embedded systems, operating system kernels), the new
+ format has a well-defined subset, which is easy to support with small
+ amount of code. It wouldn't be as small as an implementation using the
+ LZMA_Alone format, but the difference shouldn't be significant.
+
+ The new .lzma format allows dividing the data in multiple independent
+ blocks, which can be compressed and uncompressed independenly. This
+ makes multi-threading possible with algorithms that aren't inherently
+ parallel (such as LZMA). There's also a central index of the sizes of
+ the blocks, which makes it possible to do limited random-access reading
+ with granularity of the block size.
+
+ The new .lzma format uses the same filename suffix that was used for
+ LZMA_Alone files. The advantage is that users using the new tools won't
+ notice the change to the new format. The disadvantage is that the old
+ tools won't work with the new files.
+
+
+Third generation
+
+ LZMA Utils 4.42.0alphas drop the rest of the C++ LZMA SDK. The LZMA and
+ other included filters (algorithm implementations) are still directly
+ based on LZMA SDK, but ported to C.
+
+ liblzma is now the core of LZMA Utils. It has zlib-like API, which
+ doesn't suffer from the problems of the API of liblzmadec. liblzma
+ supports not only LZMA, but several other filters, which together
+ can improve compression ratio even further with certain file types.
+
+ The lzma and lzmadec command line tools have been rewritten. They uses
+ liblzma to do the actual compressing or uncompressing.
+
+ The development of LZMA Utils 4.42.x is still in alpha stage. Several
+ features are still missing or don't fully work yet. Documentation is
+ also very minimal.
+
diff --git a/doc/liblzma-advanced.txt b/doc/liblzma-advanced.txt
new file mode 100644
index 00000000..d829a33a
--- /dev/null
+++ b/doc/liblzma-advanced.txt
@@ -0,0 +1,324 @@
+
+Advanced features of liblzma
+----------------------------
+
+0. Introduction
+
+ Most developers need only the basic features of liblzma. These
+ features allow single-threaded encoding and decoding of .lzma files
+ in streamed mode.
+
+ In some cases developers want more. The .lzma file format is
+ designed to allow multi-threaded encoding and decoding and limited
+ random-access reading. These features are possible in non-streamed
+ mode and limitedly also in streamed mode.
+
+ To take advange of these features, the application needs a custom
+ .lzma file format handler. liblzma provides a set of tools to ease
+ this task, but it's still quite a bit of work to get a good custom
+ .lzma handler done.
+
+
+1. Where to begin
+
+ Start by reading the .lzma file format specification. Understanding
+ the basics of the .lzma file structure is required to implement a
+ custom .lzma file handler and to understand the rest of this document.
+
+
+2. The basic components
+
+2.1. Stream Header and tail
+
+ Stream Header begins the .lzma Stream and Stream tail ends it. Stream
+ Header is defined in the file format specification, but Stream tail
+ isn't (thus I write "tail" with a lower-case letter). Stream tail is
+ simply the Stream Flags and the Footer Magic Bytes fields together.
+ It was done this way in liblzma, because the Block coders take care
+ of the rest of the stuff in the Stream Footer.
+
+ For now, the size of Stream Header is fixed to 11 bytes. The header
+ <lzma/stream_flags.h> defines LZMA_STREAM_HEADER_SIZE, which you
+ should use instead of a hardcoded number. Similarly, Stream tail
+ is fixed to 3 bytes, and there is a constant LZMA_STREAM_TAIL_SIZE.
+
+ It is possible, that a future version of the .lzma format will have
+ variable-sized Stream Header and tail. As of writing, this seems so
+ unlikely though, that it was considered simplest to just use a
+ constant instead of providing a functions to get and store the sizes
+ of the Stream Header and tail.
+
+
+2.x. Stream tail
+
+ For now, the size of Stream tail is fixed to 3 bytes. The header
+ <lzma/stream_flags.h> defines LZMA_STREAM_TAIL_SIZE, which you
+ should use instead of a hardcoded number.
+
+
+3. Keeping track of size information
+
+ The lzma_info_* functions found from <lzma/info.h> should ease the
+ task of keeping track of sizes of the Blocks and also the Stream
+ as a whole. Using these functions is strongly recommended, because
+ there are surprisingly many situations where an error can occur,
+ and these functions check for possible errors every time some new
+ information becomes available.
+
+ If you find lzma_info_* functions lacking something that you would
+ find useful, please contact the author.
+
+
+3.1. Start offset of the Stream
+
+ If you are storing the .lzma Stream inside anothe file format, or
+ for some other reason are placing the .lzma Stream to somewhere
+ else than to the beginning of the file, you should tell the starting
+ offset of the Stream using lzma_info_start_offset_set().
+
+ The start offset of the Stream is used for two distinct purporses.
+ First, knowing the start offset of the Stream allows
+ lzma_info_alignment_get() to correctly calculate the alignment of
+ every Block. This information is given to the Block encoder, which
+ will calculate the size of Header Padding so that Compressed Data
+ is alignment at an optimal offset.
+
+ Another use for start offset of the Stream is in random-access
+ reading. If you set the start offset of the Stream, lzma_info_locate()
+ will be able to calculate the offset relative to the beginning of the
+ file containing the Stream (instead of offset relative to the
+ beginning of the Stream).
+
+
+3.2. Size of Stream Header
+
+ While the size of Stream Header is constant (11 bytes) in the current
+ version of the .lzma file format, this may change in future.
+
+
+3.3. Size of Header Metadata Block
+
+ This information is needed when doing random-access reading, and
+ to verify the value of this field stored in Footer Metadata Block.
+
+
+3.4. Total Size of the Data Blocks
+
+
+3.5. Uncompressed Size of Data Blocks
+
+
+3.6. Index
+
+
+
+
+x. Alignment
+
+ There are a few slightly different types of alignment issues when
+ working with .lzma files.
+
+ The .lzma format doesn't strictly require any kind of alignment.
+ However, if the encoder carefully optimizes the alignment in all
+ situations, it can improve compression ratio, speed of the encoder
+ and decoder, and slightly help if the files get damaged and need
+ recovery.
+
+ Alignment has the most significant effect compression ratio FIXME
+
+
+x.1. Compression ratio
+
+ Some filters take advantage of the alignment of the input data.
+ To get the best compression ratio, make sure that you feed these
+ filters correctly aligned data.
+
+ Some filters (e.g. LZMA) don't necessarily mind too much if the
+ input doesn't match the preferred alignment. With these filters
+ the penalty in compression ratio depends on the specific type of
+ data being compressed.
+
+ Other filters (e.g. PowerPC executable filter) won't work at all
+ with data that is improperly aligned. While the data can still
+ be de-filtered back to its original form, the benefit of the
+ filtering (better compression ratio) is completely lost, because
+ these filters expect certain patterns at properly aligned offsets.
+ The compression ratio may even worse with incorrectly aligned input
+ than without the filter.
+
+
+x.1.1. Inter-filter alignment
+
+ When there are multiple filters chained, checking the alignment can
+ be useful not only with the input of the first filter and output of
+ the last filter, but also between the filters.
+
+ Inter-filter alignment important especially with the Subblock filter.
+
+
+x.1.2. Further compression with external tools
+
+ This is relatively rare situation in practice, but still worth
+ understanding.
+
+ Let's say that there are several SPARC executables, which are each
+ filtered to separate .lzma files using only the SPARC filter. If
+ Uncompressed Size is written to the Block Header, the size of Block
+ Header may vary between the .lzma files. If no Padding is used in
+ the Block Header to correct the alignment, the starting offset of
+ the Compressed Data field will be differently aligned in different
+ .lzma files.
+
+ All these .lzma files are archived into a single .tar archive. Due
+ to nature of the .tar format, every file is aligned inside the
+ archive to an offset that is a multiple of 512 bytes.
+
+ The .tar archive is compressed into a new .lzma file using the LZMA
+ filter with options, that prefer input alignment of four bytes. Now
+ if the independent .lzma files don't have the same alignment of
+ the Compressed Data fields, the LZMA filter will be unable to take
+ advantage of the input alignment between the files in the .tar
+ archive, which reduces compression ratio.
+
+ Thus, even if you have only single Block per file, it can be good for
+ compression ratio to align the Compressed Data to optimal offset.
+
+
+x.2. Speed
+
+ Most modern computers are faster when multi-byte data is located
+ at aligned offsets in RAM. Proper alignment of the Compressed Data
+ fields can slightly increase the speed of some filters.
+
+
+x.3. Recovery
+
+ Aligning every Block Header to start at an offset with big enough
+ alignment may ease or at least speed up recovery of broken files.
+
+
+y. Typical usage cases
+
+y.x. Parsing the Stream backwards
+
+ You may need to parse the Stream backwards if you need to get
+ information such as the sizes of the Stream, Index, or Extra.
+ The basic procedure to do this follows.
+
+ Locate the end of the Stream. If the Stream is stored as is in a
+ standalone .lzma file, simply seek to the end of the file and start
+ reading backwards using appropriate buffer size. The file format
+ specification allows arbitrary amount of Footer Padding (zero or more
+ NUL bytes), which you skip before trying to decode the Stream tail.
+
+ Once you have located the end of the Stream (a non-NULL byte), make
+ sure you have at least the last LZMA_STREAM_TAIL_SIZE bytes of the
+ Stream in a buffer. If there isn't enough bytes left from the file,
+ the file is too small to contain a valid Stream. Decode the Stream
+ tail using lzma_stream_tail_decoder(). Store the offset of the first
+ byte of the Stream tail; you will need it later.
+
+ You may now want to do some internal verifications e.g. if the Check
+ type is supported by the liblzma build you are using.
+
+ Decode the Backward Size field with lzma_vli_reverse_decode(). The
+ field is at maximum of LZMA_VLI_BYTES_MAX bytes long. Check that
+ Backward Size is not zero. Store the offset of the first byte of
+ the Backward Size; you will need it later.
+
+ Now you know the Total Size of the last Block of the Stream. It's the
+ value of Backward Size plus the size of the Backward Size field. Note
+ that you cannot use lzma_vli_size() to calculate the size since there
+ might be padding; you need to use the real observed size of the
+ Backward Size field.
+
+ At this point, the operation continues differently for Single-Block
+ and Multi-Block Streams.
+
+
+y.x.1. Single-Block Stream
+
+ There might be Uncompressed Size field present in the Stream Footer.
+ You cannot know it for sure unless you have already parsed the Block
+ Header earlier. For security reasons, you probably want to try to
+ decode the Uncompressed Size field, but you must not indicate any
+ error if decoding fails. Later you can give the decoded Uncompressed
+ Size to Block decoder if Uncopmressed Size isn't otherwise known;
+ this prevents it from producing too much output in case of (possibly
+ intentionally) corrupt file.
+
+ Calculate the the start offset of the Stream:
+
+ backward_offset - backward_size - LZMA_STREAM_HEADER_SIZE
+
+ backward_offset is the offset of the first byte of the Backward Size
+ field. Remember to check for integer overflows, which can occur with
+ invalid input files.
+
+ Seek to the beginning of the Stream. Decode the Stream Header using
+ lzma_stream_header_decoder(). Verify that the decoded Stream Flags
+ match the values found from Stream tail. You can use the
+ lzma_stream_flags_is_equal() macro for this.
+
+ Decode the Block Header. Verify that it isn't a Metadata Block, since
+ Single-Block Streams cannot have Metadata. If Uncompressed Size is
+ present in the Block Header, the value you tried to decode from the
+ Stream Footer must be ignored, since Uncompressed Size wasn't actually
+ present there. If Block Header doesn't have Uncompressed Size, and
+ decoding the Uncompressed Size field from the Stream Footer failed,
+ the file is corrupt.
+
+ If you were only looking for the Uncompressed Size of the Stream,
+ you now got that information, and you can stop processing the Stream.
+
+ To decode the Block, the same instructions apply as described in
+ FIXME. However, because you have some extra known information decoded
+ from the Stream Footer, you should give this information to the Block
+ decoder so that it can verify it while decoding:
+ - If Uncompressed Size is not present in the Block Header, set
+ lzma_options_block.uncompressed_size to the value you decoded
+ from the Stream Footer.
+ - Always set lzma_options_block.total_size to backward_size +
+ size_of_backward_size (you calculated this sum earlier already).
+
+
+y.x.2. Multi-Block Stream
+
+ Calculate the start offset of the Footer Metadata Block:
+
+ backward_offset - backward_size
+
+ backward_offset is the offset of the first byte of the Backward Size
+ field. Remember to check for integer overflows, which can occur with
+ broken input files.
+
+ Decode the Block Header. Verify that it is a Metadata Block. Set
+ lzma_options_block.total_size to backward_size + size_of_backward_size
+ (you calculated this sum earlier already). Then decode the Footer
+ Metadata Block.
+
+ Store the decoded Footer Metadata to lzma_info structure using
+ lzma_info_set_metadata(). Set also the offset of the Backward Size
+ field using lzma_info_size_set(). Then you can get the start offset
+ of the Stream using lzma_info_size_get(). Note that any of these steps
+ may fail so don't omit error checking.
+
+ Seek to the beginning of the Stream. Decode the Stream Header using
+ lzma_stream_header_decoder(). Verify that the decoded Stream Flags
+ match the values found from Stream tail. You can use the
+ lzma_stream_flags_is_equal() macro for this.
+
+ If you were only looking for the Uncompressed Size of the Stream,
+ it's possible that you already have it now. If Uncompressed Size (or
+ whatever information you were looking for) isn't available yet,
+ continue by decoding also the Header Metadata Block. (If some
+ information is missing, the Header Metadata Block has to be present.)
+
+ Decoding the Data Blocks goes the same way as described in FIXME.
+
+
+y.x.3. Variations
+
+ If you know the offset of the beginning of the Stream, you may want
+ to parse the Stream Header before parsing the Stream tail.
+
diff --git a/doc/liblzma-hacking.txt b/doc/liblzma-hacking.txt
new file mode 100644
index 00000000..64390bcb
--- /dev/null
+++ b/doc/liblzma-hacking.txt
@@ -0,0 +1,112 @@
+
+Hacking liblzma
+---------------
+
+0. Preface
+
+ This document gives some overall information about the internals of
+ liblzma, which should make it easier to start reading and modifying
+ the code.
+
+
+1. Programming language
+
+ liblzma was written in C99. If you use GCC, this means that you need
+ at least GCC 3.x.x. GCC 2 isn't and won't be supported.
+
+ Some GCC-specific extensions are used *conditionally*. They aren't
+ required to build a full-featured library. Don't make the code rely
+ on any non-standard compiler extensions or even C99 features that
+ aren't portable between almost-C99 compatible compilers (for example
+ non-static inlines).
+
+ The public API headers are in C89. This is to avoid frustrating those
+ who maintain programs, which are strictly in C89 or C++.
+
+ An assumption about sizeof(size_t) is made. If this assumption is
+ wrong, some porting is probably needed:
+
+ sizeof(uint32_t) <= sizeof(size_t) <= sizeof(uint64_t)
+
+
+2. Internal vs. external API
+
+
+
+ Input Output
+ v Application ^
+ | liblzma public API |
+ | Stream coder |
+ | Block coder |
+ | Filter coder |
+ | ... |
+ v Filter coder ^
+
+
+ Application
+ `-- liblzma public API
+ `-- Stream coder
+ |-- Stream info handler
+ |-- Stream Header coder
+ |-- Block Header coder
+ | `-- Filter Flags coder
+ |-- Metadata coder
+ | `-- Block coder
+ | `-- Filter 0
+ | `-- Filter 1
+ | ...
+ |-- Data Block coder
+ | `-- Filter 0
+ | `-- Filter 1
+ | ...
+ `-- Stream tail coder
+
+
+
+x. Designing new filters
+
+ All filters must be designed so that the decoder cannot consume
+ arbitrary amount input without producing any decoded output. Failing
+ to follow this rule makes liblzma vulnerable to DoS attacks if
+ untrusted files are decoded (usually they are untrusted).
+
+ An example should clarify the reason behind this requirement: There
+ are two filters in the chain. The decoder of the first filter produces
+ huge amount of output (many gigabytes or more) with a few bytes of
+ input, which gets passed to the decoder of the second filter. If the
+ data passed to the second filter is interpreted as something that
+ produces no output (e.g. padding), the filter chain as a whole
+ produces no output and consumes no input for a long period of time.
+
+ The above problem was present in the first versions of the Subblock
+ filter. A tiny .lzma file could have taken several years to decode
+ while it wouldn't produce any output at all. The problem was fixed
+ by adding limits for number of consecutive Padding bytes, and requiring
+ that some decoded output must be produced between Set Subfilter and
+ Unset Subfilter.
+
+
+x. Implementing new filters
+
+ If the filter supports embedding End of Payload Marker, make sure that
+ when your filter detects End of Payload Marker,
+ - the usage of End of Payload Marker is actually allowed (i.e. End
+ of Input isn't used); and
+ - it also checks that there is no more input coming from the next
+ filter in the chain.
+
+ The second requirement is slightly tricky. It's possible that the next
+ filter hasn't returned LZMA_STREAM_END yet. It may even need a few
+ bytes more input before it will do so. You need to give it as much
+ input as it needs, and verify that it doesn't produce any output.
+
+ Don't call the next filter in the chain after it has returned
+ LZMA_STREAM_END (except in encoder if action == LZMA_SYNC_FLUSH).
+ It will result undefined behavior.
+
+ Be pedantic. If the input data isn't exactly valid, reject it.
+
+ At the moment, liblzma isn't modular. You will need to edit several
+ files in src/liblzma/common to include support for a new filter. grep
+ for LZMA_FILTER_LZMA to locate the files needing changes.
+
diff --git a/doc/liblzma-intro.txt b/doc/liblzma-intro.txt
new file mode 100644
index 00000000..9cbd63a9
--- /dev/null
+++ b/doc/liblzma-intro.txt
@@ -0,0 +1,188 @@
+
+Introduction to liblzma
+-----------------------
+
+Writing applications to work with liblzma
+
+ liblzma API is split in several subheaders to improve readability and
+ maintainance. The subheaders must not be #included directly; simply
+ use `#include <lzma.h>' instead.
+
+ Those who have used zlib should find liblzma's API easy to use.
+ To developers who haven't used zlib before, I recommend learning
+ zlib first, because zlib has excellent documentation.
+
+ While the API is similar to that of zlib, there are some major
+ differences, which are summarized below.
+
+ For basic stream encoding, zlib has three functions (deflateInit(),
+ deflate(), and deflateEnd()). Similarly, there are three functions
+ for stream decoding (inflateInit(), inflate(), and inflateEnd()).
+ liblzma has only single coding and ending function. Thus, to
+ encode one may use, for example, lzma_stream_encoder_single(),
+ lzma_code(), and lzma_end(). Simlarly for decoding, one may
+ use lzma_auto_decoder(), lzma_code(), and lzma_end().
+
+ zlib has deflateReset() and inflateReset() to reset the stream
+ structure without reallocating all the memory. In liblzma, all
+ coder initialization functions are like zlib's reset functions:
+ the first-time initializations are done with the same functions
+ as the reinitializations (resetting).
+
+ To make all this work, liblzma needs to know when lzma_stream
+ doesn't already point to an allocated and initialized coder.
+ This is achieved by initializing lzma_stream structure with
+ LZMA_STREAM_INIT (static initialization) or LZMA_STREAM_INIT_VAR
+ (for exampple when new lzma_stream has been allocated with malloc()).
+ This initialization should be done exactly once per lzma_stream
+ structure to avoid leaking memory. Calling lzma_end() will leave
+ lzma_stream into a state comparable to the state achieved with
+ LZMA_STREAM_INIT and LZMA_STREAM_INIT_VAR.
+
+ Example probably clarifies a lot. With zlib, compression goes
+ roughly like this:
+
+ z_stream strm;
+ deflateInit(&strm, level);
+ deflate(&strm, Z_RUN);
+ deflate(&strm, Z_RUN);
+ ...
+ deflate(&strm, Z_FINISH);
+ deflateEnd(&strm) or deflateReset(&strm)
+
+ With liblzma, it's slightly different:
+
+ lzma_stream strm = LZMA_STREAM_INIT;
+ lzma_stream_encoder_single(&strm, &options);
+ lzma_code(&strm, LZMA_RUN);
+ lzma_code(&strm, LZMA_RUN);
+ ...
+ lzma_code(&strm, LZMA_FINISH);
+ lzma_end(&strm) or reinitialize for new coding work
+
+ Reinitialization in the last step can be any function that can
+ initialize lzma_stream; it doesn't need to be the same function
+ that was used for the previous initialization. If it is the same
+ function, liblzma will usually be able to re-use most of the
+ existing memory allocations (depends on how much the initialization
+ options change). If you reinitialize with different function,
+ liblzma will automatically free the memory of the previous coder.
+
+
+File formats
+
+ liblzma supports multiple container formats for the compressed data.
+ Different initialization functions initialize the lzma_stream to
+ process different container formats. See the details from the public
+ header files.
+
+ The following functions are the most commonly used:
+
+ - lzma_stream_encoder_single(): Encodes Single-Block Stream; this
+ the recommended format for most purporses.
+
+ - lzma_alone_encoder(): Useful if you need to encode into the
+ legacy LZMA_Alone format.
+
+ - lzma_auto_decoder(): Decoder that automatically detects the
+ file format; recommended when you decode compressed files on
+ disk, because this way compatibility with the legacy LZMA_Alone
+ format is transparent.
+
+ - lzma_stream_decoder(): Decoder for Single- and Multi-Block
+ Streams; this is good if you want to accept only .lzma Streams.
+
+
+Filters
+
+ liblzma supports multiple filters (algorithm implementations). The new
+ .lzma format supports filter-chain having up to seven filters. In the
+ filter chain, the output of one filter is input of the next filter in
+ the chain. The legacy LZMA_Alone format supports only one filter, and
+ that must always be LZMA.
+
+ General-purporse compression:
+
+ LZMA The main algorithm of liblzma (surprise!)
+
+ Branch/Call/Jump filters for executables:
+
+ x86 This filter is known as BCJ in 7-Zip
+ IA64 IA-64 (Itanium)
+ PowerPC Big endian PowerPC
+ ARM
+ ARM-Thumb
+ SPARC
+
+ Other filters:
+
+ Copy Dummy filter that simply copies all the data
+ from input to output.
+
+ Subblock Multi-purporse filter, that can
+ - embed End of Payload Marker if the previous
+ filter in the chain doesn't support it; and
+ - apply Subfilters, which filter only part
+ of the same compressed Block in the Stream.
+
+ Branch/Call/Jump filters never change the size of the data. They
+ should usually be used as a pre-filter for some compression filter
+ like LZMA.
+
+
+Integrity checks
+
+ The .lzma Stream format uses CRC32 as the integrity check for
+ different file format headers. It is possible to omit CRC32 from
+ the Block Headers, but not from Stream Header. This is the reason
+ why CRC32 code cannot be disabled when building liblzma (in addition,
+ the LZMA encoder uses CRC32 for hashing, so that's another reason).
+
+ The integrity check of the actual data is calculated from the
+ uncompressed data. This check can be CRC32, CRC64, or SHA256.
+ It can also be omitted completely, although that usually is not
+ a good thing to do. There are free IDs left, so support for new
+ checks algorithms can be added later.
+
+
+API and ABI stability
+
+ The API and ABI of liblzma isn't stable yet, although no huge
+ changes should happen. One potential place for change is the
+ lzma_options_subblock structure.
+
+ In the 4.42.0alpha phase, the shared library version number won't
+ be updated even if ABI breaks. I don't want to track the ABI changes
+ yet. Just rebuild everything when you upgrade liblzma until we get
+ to the beta stage.
+
+
+Size of the library
+
+ While liblzma isn't huge, it is quite far from the smallest possible
+ LZMA implementation: full liblzma binary (with support for all
+ filters and other features) is way over 100 KiB, but the plain raw
+ LZMA decoder is only 5-10 KiB.
+
+ To decrease the size of the library, you can omit parts of the library
+ by passing certain options to the `configure' script. Disabling
+ everything but the decoders of the require filters will usually give
+ you a small enough library, but if you need a decoder for example
+ embedded in the operating system kernel, the code from liblzma probably
+ isn't suitable as is.
+
+ If you need a minimal implementation supporting .lzma Streams, you
+ may need to do partial rewrite. liblzma uses stateful API like zlib.
+ That increases the size of the library. Using callback API or even
+ simpler buffer-to-buffer API would allow smaller implementation.
+
+ LZMA SDK contains smaller LZMA decoder written in ANSI-C than
+ liblzma, so you may want to take a look at that code. However,
+ it doesn't (at least not yet) support the new .lzma Stream format.
+
+
+Documentation
+
+ There's no other documentation than the public headers and this
+ text yet. Real docs will be written some day, I hope.
+
diff --git a/doc/liblzma-security.txt b/doc/liblzma-security.txt
new file mode 100644
index 00000000..487637ed
--- /dev/null
+++ b/doc/liblzma-security.txt
@@ -0,0 +1,219 @@
+
+Using liblzma securely
+----------------------
+
+0. Introduction
+
+ This document discusses how to use liblzma securely. There are issues
+ that don't apply to zlib or libbzip2, so reading this document is
+ strongly recommended even for those who are very familiar with zlib
+ or libbzip2.
+
+ While making liblzma itself as secure as possible is essential, it's
+ out of scope of this document.
+
+
+1. Memory usage
+
+ The memory usage of liblzma varies a lot.
+
+
+1.1. Problem sources
+
+1.1.1. Block coder
+
+ The memory requirements of Block encoder depend on the used filters
+ and their settings. The memory requirements of the Block decoder
+ depend on the which filters and with which filter settings the Block
+ was encoded. Usually the memory requirements of a decoder are equal
+ or less than the requirements of the encoder with the same settings.
+
+ While the typical memory requirements to decode a Block is from a few
+ hundred kilobytes to tens of megabytes, a maliciously constructed
+ files can require a lot more RAM to decode. With the current filters,
+ the maximum amount is about 7 GiB. If you use multi-threaded decoding,
+ every Block can require this amount of RAM, thus a four-threaded
+ decoder could suddenly try to allocate 28 GiB of RAM.
+
+ If you don't limit the maximum memory usage in any way, and there are
+ no resource limits set on the operating system side, one malicious
+ input file can run the system out of memory, or at least make it swap
+ badly for a long time. This is exceptionally bad on servers e.g.
+ email server doing virus scanning on incoming messages.
+
+
+1.1.2. Metadata decoder
+
+ Multi-Block .lzma files contain at least one Metadata Block.
+ Externally the Metadata Blocks are similar to Data Blocks, so all
+ the issues mentioned about memory usage of Data Blocks applies to
+ Metadata Blocks too.
+
+ The uncompressed content of Metadata Blocks contain information about
+ the Stream as a whole, and optionally some Extra Records. The
+ information about the Stream is kept in liblzma's internal data
+ structures in RAM. Extra Records can contain arbitrary data. They are
+ not interpreted by liblzma, but liblzma will provide them to the
+ application in uninterpreted form if the application wishes so.
+
+ Usually the Uncompressed Size of a Metadata Block is small. Even on
+ extreme cases, it shouldn't be much bigger than a few megabytes. Once
+ the Metadata has been parsed into native data structures in liblzma,
+ it usually takes a little more memory than in the encoded form. For
+ all normal files, this is no problem, since the resulting memory usage
+ won't be too much.
+
+ The problem is that a maliciously constructed Metadata Block can
+ contain huge amount of "information", which liblzma will try to store
+ in its internal data structures. This may cause liblzma to allocate
+ all the available RAM unless some kind of resource usage limits are
+ applied.
+
+ Note that the Extra Records in Metadata are always parsed but, but
+ memory is allocated for them only if the application has requested
+ liblzma to provide the Extra Records to the application.
+
+
+1.2. Solutions
+
+ If you need to decode files from untrusted sources (most people do),
+ you must limit the memory usage to avoid denial of service (DoS)
+ conditions caused by malicious input files.
+
+ The first step is to find out how much memory you are allowed consume
+ at maximum. This may be a hardcoded constant or derived from the
+ available RAM; whatever is appropriate in the application.
+
+ The simplest solution is to use setrlimit() if the kernel supports
+ RLIMIT_AS, which limits the memory usage of the whole process.
+ For more portable and fine-grained limitting, you can use
+ memory limitter functions found from <lzma/memlimit.h>.
+
+
+1.2.1. Encoder
+
+ lzma_memory_usage() will give you a rough estimate about the memory
+ usage of the given filter chain. To dramatically simplify the internal
+ implementation, this function doesn't take into account all the small
+ helper data structures needed in various places; only the structures
+ with significant memory usage are taken into account. Still, the
+ accuracy of this function should be well within a mebibyte.
+
+ The Subblock filter is a special case. If a Subfilter has been
+ specified, it isn't taken into account when lzma_memory_usage()
+ calculates the memory usage. You need to calculate the memory usage
+ of the Subfilter separately.
+
+ Keeping track of Blocks in a Multi-Block Stream takes a few dozen
+ bytes of RAM per Block (size of the lzma_index structure plus overhead
+ of malloc()). It isn't a good idea to put tens of thousands of Blocks
+ into a Stream unless you have a very good reason to do so (compressed
+ dictionary could be an example of such situation).
+
+ Also keep the number and sizes of Extra Records sane. If you produce
+ the list of Extra Records automatically from some untrusted source,
+ you should not only validate the content of these Records, but also
+ their memory usage.
+
+
+1.2.2. Decoder
+
+ A single-threaded decoder should simply use a memory limitter and
+ indicate an error if it runs out of memory.
+
+ Memory-limitting with multi-threaded decoding is tricky. The simple
+ solution is to divide the maximum allowed memory usage with the
+ maximum allowed threads, and give each Block decoder their own
+ independent lzma_memory_limitter. The drawback is that if one Block
+ needs notably more RAM than any other Block, the decoder will run out
+ of memory when in reality there would be plenty of free RAM.
+
+ An attractive alternative would be using shared lzma_memory_limitter.
+ Depending on the application and the expected type of input, this may
+ either be the best solution or a source of hard-to-repeat problems.
+ Consider the following requirements:
+ - You use at maximum of n threads.
+ - x(i) is the decoder memory requirements of the Block number i
+ in an expected input Stream.
+ - The memory limitter is set to higher value than the sum of n
+ highest values x(i).
+
+ (If you are better at explaining the above conditions, please
+ contribute your improved version.)
+
+ If the above conditions aren't met, it is possible that the decoding
+ will fail unpredictably. That is, on the same machine using the same
+ settings, the decoding may sometimes succeed and sometimes fail. This
+ is because sometimes threads may run so that the Blocks with highest
+ memory usage are tried to be decoded at the same time.
+
+ Most .lzma files have all the Blocks encoded with identical settings,
+ or at least the memory usage won't vary dramatically. That's why most
+ multi-threaded decoders probably want to use the simple "separate
+ lzma_memory_limitter for each thread" solution, possibly fallbacking
+ to single-threaded mode in case the per-thread memory limits aren't
+ enough in multi-threaded mode.
+
+FIXME: Memory usage of Stream info.
+
+[
+
+]
+
+
+2. Huge uncompressed output
+
+2.1. Data Blocks
+
+ Decoding a tiny .lzma file can produce huge amount of uncompressed
+ output. There is an example file of 45 bytes, which decodes to 64 PiB
+ (that's 2^56 bytes). Uncompressing such a file to disk is likely to
+ fill even a bigger disk array. If the data is written to a pipe, it
+ may not fill the disk, but would still take very long time to finish.
+
+ To avoid denial of service conditions caused by huge amount of
+ uncompressed output, applications using liblzma should use some method
+ to limit the amount of output produced. The exact method depends on
+ the application.
+
+ All valid .lzma Streams make it possible to find out the uncompressed
+ size of the Stream without actually uncompressing the data. This
+ information is available in at least one of the Metadata Blocks.
+ Once the uncompressed size is parsed, the decoder can verify that
+ it doesn't exceed certain limits (e.g. available disk space).
+
+ When the uncompressed size is known, the decoder can actively keep
+ track of the amount of output produced so far, and that it doesn't
+ exceed the known uncompressed size. If it does exceed, the file is
+ known to be corrupt and an error should be indicated without
+ continuing to decode the rest of the file.
+
+ Unfortunately, finding the uncompressed size beforehand is often
+ possible only in non-streamed mode, because the needed information
+ could be in the Footer Metdata Block, which (obviously) is at the
+ end of the Stream. In purely streamed mode decoding, one may need to
+ use some rough arbitrary limits to prevent the problems described in
+ the beginning of this section.
+
+
+2.2. Metadata
+
+ Metadata is stored in Metadata Blocks, which are very similar to
+ Data Blocks. Thus, the uncompressed size can be huge just like with
+ Data Blocks. The difference is, that the contents of Metadata Blocks
+ aren't given to the application as is, but parsed by liblzma. Still,
+ reading through a huge Metadata can take very long time, effectively
+ creating a denial of service like piping decoded a Data Block to
+ another process would do.
+
+ At first it would seem that using a memory limitter would prevent
+ this issue as a side effect. But it does so only if the application
+ requests liblzma to allocate the Extra Records and provide them to
+ the application. If Extra Records aren't requested, they aren't
+ allocated either. Still, the Extra Records are being read through
+ to validate that the Metadata is in proper format.
+
+ The solution is to limit the Uncompressed Size of a Metadata Block
+ to some relatively large value. This will make liblzma to give an
+ error when the given limit is reached.
+
diff --git a/doc/lzma-intro.txt b/doc/lzma-intro.txt
new file mode 100644
index 00000000..bde8a059
--- /dev/null
+++ b/doc/lzma-intro.txt
@@ -0,0 +1,107 @@
+
+Introduction to the lzma command line tool
+------------------------------------------
+
+Overview
+
+ The lzma command line tool is similar to gzip and bzip2, but for
+ compressing and uncompressing .lzma files.
+
+
+Supported file formats
+
+ By default, the tool creates files in the new .lzma format. This can
+ be overriden with --format=FMT command line option. Use --format=alone
+ to create files in the old LZMA_Alone format.
+
+ By default, the tool uncompresses both the new .lzma format and
+ LZMA_Alone format. This is to make it transparent to switch from
+ the old LZMA_Alone format to the new .lzma format. Since both
+ formats use the same filename suffix, average user should never
+ notice which format was used.
+
+
+Differences to gzip and bzip2
+
+ Standard input and output
+
+ Both gzip and bzip2 refuse to write compressed data to a terminal and
+ read compressed data from a terminal. With gzip (but not with bzip2),
+ this can be overriden with the `--force' option. lzma follows the
+ behavior of gzip here.
+
+ Usage of LZMA_OPT environment variable
+
+ gzip and bzip2 read GZIP and BZIP2 environment variables at startup.
+ These variables may contain extra command line options.
+
+ gzip and bzip2 allow passing not only options, but also end-of-options
+ indicator (`--') and filenames via the environment variable. No quoting
+ is supported with the filenames.
+
+ Here are examples with gzip. bzip2 behaves identically.
+
+ bash$ echo asdf > 'foo bar'
+ bash$ GZIP='"foo bar"' gzip
+ gzip: "foo: No such file or directory
+ gzip: bar": No such file or directory
+
+ bash$ GZIP=-- gzip --help
+ gzip: --help: No such file or directory
+
+ lzma silently ignores all non-option arguments given via the
+ environment variable LZMA_OPT. Like on the command line, everything
+ after `--' is taken as non-options, and thus ignored in LZMA_OPT.
+
+ bash$ LZMA_OPT='--help' lzma --version # Displays help
+ bash$ LZMA_OPT='-- --help' lzma --version # Displays version
+
+
+Filter chain presets
+
+ Like in gzip and bzip2, lzma supports numbered presets from 1 to 9
+ where 1 is the fastest and 9 the best compression. 1 and 2 are for
+ fast compressing with small memory usage, 3 to 6 for good compression
+ ratio with medium memory usage, and 7 to 9 for excellent compression
+ ratio with higher memory requirements. The default is 7 if memory
+ usage limit allows.
+
+ In future, there will probably be an option like --preset=NAME, which
+ will contain more special presets for specific file types.
+
+ It's also possible that there will be some heuristics to select good
+ filters. For example, the tool could detect when a .tar archive is
+ being compressed, and enable x86 filter only for those files in the
+ .tar archive that are ELF or PE executables for x86.
+
+
+Specifying custom filter chains
+
+ Custom filter chains are specified by using long options with the name
+ of the filters in correct order. For example, to pass the input data to
+ the x86 filter and the output of that to the LZMA filter, the following
+ command will do:
+
+ lzma --x86 --lzma filename
+
+ Some filters accept options, which are specified as a comma-separated
+ list of key=value pairs:
+
+ lzma --delta=distance=4 --lzma=dict=4Mi,lc=8,lp=2 filename
+
+
+Memory usage control
+
+ By default, the command line tool limits memory usage to 1/3 of the
+ available physical RAM. If no preset or custom filter chain has been
+ given, the default preset will be used. If the memory limit is too
+ low for the default preset, the tool will silently switch to lower
+ preset.
+
+ When a preset or a custom filter chain has been specified and the
+ memory limit is too low, an error message is displayed and no files
+ are processed.
+
+ If the decoder hits the memory usage limit, an error is displayed and
+ no more files are processed.
+