Advanced features of liblzma ---------------------------- 0. Introduction Most developers need only the basic features of liblzma. These features allow single-threaded encoding and decoding of .lzma files in streamed mode. In some cases developers want more. The .lzma file format is designed to allow multi-threaded encoding and decoding and limited random-access reading. These features are possible in non-streamed mode and limitedly also in streamed mode. To take advange of these features, the application needs a custom .lzma file format handler. liblzma provides a set of tools to ease this task, but it's still quite a bit of work to get a good custom .lzma handler done. 1. Where to begin Start by reading the .lzma file format specification. Understanding the basics of the .lzma file structure is required to implement a custom .lzma file handler and to understand the rest of this document. 2. The basic components 2.1. Stream Header and tail Stream Header begins the .lzma Stream and Stream tail ends it. Stream Header is defined in the file format specification, but Stream tail isn't (thus I write "tail" with a lower-case letter). Stream tail is simply the Stream Flags and the Footer Magic Bytes fields together. It was done this way in liblzma, because the Block coders take care of the rest of the stuff in the Stream Footer. For now, the size of Stream Header is fixed to 11 bytes. The header <lzma/stream_flags.h> defines LZMA_STREAM_HEADER_SIZE, which you should use instead of a hardcoded number. Similarly, Stream tail is fixed to 3 bytes, and there is a constant LZMA_STREAM_TAIL_SIZE. It is possible, that a future version of the .lzma format will have variable-sized Stream Header and tail. As of writing, this seems so unlikely though, that it was considered simplest to just use a constant instead of providing a functions to get and store the sizes of the Stream Header and tail. 2.x. Stream tail For now, the size of Stream tail is fixed to 3 bytes. The header <lzma/stream_flags.h> defines LZMA_STREAM_TAIL_SIZE, which you should use instead of a hardcoded number. 3. Keeping track of size information The lzma_info_* functions found from <lzma/info.h> should ease the task of keeping track of sizes of the Blocks and also the Stream as a whole. Using these functions is strongly recommended, because there are surprisingly many situations where an error can occur, and these functions check for possible errors every time some new information becomes available. If you find lzma_info_* functions lacking something that you would find useful, please contact the author. 3.1. Start offset of the Stream If you are storing the .lzma Stream inside anothe file format, or for some other reason are placing the .lzma Stream to somewhere else than to the beginning of the file, you should tell the starting offset of the Stream using lzma_info_start_offset_set(). The start offset of the Stream is used for two distinct purporses. First, knowing the start offset of the Stream allows lzma_info_alignment_get() to correctly calculate the alignment of every Block. This information is given to the Block encoder, which will calculate the size of Header Padding so that Compressed Data is alignment at an optimal offset. Another use for start offset of the Stream is in random-access reading. If you set the start offset of the Stream, lzma_info_locate() will be able to calculate the offset relative to the beginning of the file containing the Stream (instead of offset relative to the beginning of the Stream). 3.2. Size of Stream Header While the size of Stream Header is constant (11 bytes) in the current version of the .lzma file format, this may change in future. 3.3. Size of Header Metadata Block This information is needed when doing random-access reading, and to verify the value of this field stored in Footer Metadata Block. 3.4. Total Size of the Data Blocks 3.5. Uncompressed Size of Data Blocks 3.6. Index x. Alignment There are a few slightly different types of alignment issues when working with .lzma files. The .lzma format doesn't strictly require any kind of alignment. However, if the encoder carefully optimizes the alignment in all situations, it can improve compression ratio, speed of the encoder and decoder, and slightly help if the files get damaged and need recovery. Alignment has the most significant effect compression ratio FIXME x.1. Compression ratio Some filters take advantage of the alignment of the input data. To get the best compression ratio, make sure that you feed these filters correctly aligned data. Some filters (e.g. LZMA) don't necessarily mind too much if the input doesn't match the preferred alignment. With these filters the penalty in compression ratio depends on the specific type of data being compressed. Other filters (e.g. PowerPC executable filter) won't work at all with data that is improperly aligned. While the data can still be de-filtered back to its original form, the benefit of the filtering (better compression ratio) is completely lost, because these filters expect certain patterns at properly aligned offsets. The compression ratio may even worse with incorrectly aligned input than without the filter. x.1.1. Inter-filter alignment When there are multiple filters chained, checking the alignment can be useful not only with the input of the first filter and output of the last filter, but also between the filters. Inter-filter alignment important especially with the Subblock filter. x.1.2. Further compression with external tools This is relatively rare situation in practice, but still worth understanding. Let's say that there are several SPARC executables, which are each filtered to separate .lzma files using only the SPARC filter. If Uncompressed Size is written to the Block Header, the size of Block Header may vary between the .lzma files. If no Padding is used in the Block Header to correct the alignment, the starting offset of the Compressed Data field will be differently aligned in different .lzma files. All these .lzma files are archived into a single .tar archive. Due to nature of the .tar format, every file is aligned inside the archive to an offset that is a multiple of 512 bytes. The .tar archive is compressed into a new .lzma file using the LZMA filter with options, that prefer input alignment of four bytes. Now if the independent .lzma files don't have the same alignment of the Compressed Data fields, the LZMA filter will be unable to take advantage of the input alignment between the files in the .tar archive, which reduces compression ratio. Thus, even if you have only single Block per file, it can be good for compression ratio to align the Compressed Data to optimal offset. x.2. Speed Most modern computers are faster when multi-byte data is located at aligned offsets in RAM. Proper alignment of the Compressed Data fields can slightly increase the speed of some filters. x.3. Recovery Aligning every Block Header to start at an offset with big enough alignment may ease or at least speed up recovery of broken files. y. Typical usage cases y.x. Parsing the Stream backwards You may need to parse the Stream backwards if you need to get information such as the sizes of the Stream, Index, or Extra. The basic procedure to do this follows. Locate the end of the Stream. If the Stream is stored as is in a standalone .lzma file, simply seek to the end of the file and start reading backwards using appropriate buffer size. The file format specification allows arbitrary amount of Footer Padding (zero or more NUL bytes), which you skip before trying to decode the Stream tail. Once you have located the end of the Stream (a non-NULL byte), make sure you have at least the last LZMA_STREAM_TAIL_SIZE bytes of the Stream in a buffer. If there isn't enough bytes left from the file, the file is too small to contain a valid Stream. Decode the Stream tail using lzma_stream_tail_decoder(). Store the offset of the first byte of the Stream tail; you will need it later. You may now want to do some internal verifications e.g. if the Check type is supported by the liblzma build you are using. Decode the Backward Size field with lzma_vli_reverse_decode(). The field is at maximum of LZMA_VLI_BYTES_MAX bytes long. Check that Backward Size is not zero. Store the offset of the first byte of the Backward Size; you will need it later. Now you know the Total Size of the last Block of the Stream. It's the value of Backward Size plus the size of the Backward Size field. Note that you cannot use lzma_vli_size() to calculate the size since there might be padding; you need to use the real observed size of the Backward Size field. At this point, the operation continues differently for Single-Block and Multi-Block Streams. y.x.1. Single-Block Stream There might be Uncompressed Size field present in the Stream Footer. You cannot know it for sure unless you have already parsed the Block Header earlier. For security reasons, you probably want to try to decode the Uncompressed Size field, but you must not indicate any error if decoding fails. Later you can give the decoded Uncompressed Size to Block decoder if Uncopmressed Size isn't otherwise known; this prevents it from producing too much output in case of (possibly intentionally) corrupt file. Calculate the the start offset of the Stream: backward_offset - backward_size - LZMA_STREAM_HEADER_SIZE backward_offset is the offset of the first byte of the Backward Size field. Remember to check for integer overflows, which can occur with invalid input files. Seek to the beginning of the Stream. Decode the Stream Header using lzma_stream_header_decoder(). Verify that the decoded Stream Flags match the values found from Stream tail. You can use the lzma_stream_flags_is_equal() macro for this. Decode the Block Header. Verify that it isn't a Metadata Block, since Single-Block Streams cannot have Metadata. If Uncompressed Size is present in the Block Header, the value you tried to decode from the Stream Footer must be ignored, since Uncompressed Size wasn't actually present there. If Block Header doesn't have Uncompressed Size, and decoding the Uncompressed Size field from the Stream Footer failed, the file is corrupt. If you were only looking for the Uncompressed Size of the Stream, you now got that information, and you can stop processing the Stream. To decode the Block, the same instructions apply as described in FIXME. However, because you have some extra known information decoded from the Stream Footer, you should give this information to the Block decoder so that it can verify it while decoding: - If Uncompressed Size is not present in the Block Header, set lzma_options_block.uncompressed_size to the value you decoded from the Stream Footer. - Always set lzma_options_block.total_size to backward_size + size_of_backward_size (you calculated this sum earlier already). y.x.2. Multi-Block Stream Calculate the start offset of the Footer Metadata Block: backward_offset - backward_size backward_offset is the offset of the first byte of the Backward Size field. Remember to check for integer overflows, which can occur with broken input files. Decode the Block Header. Verify that it is a Metadata Block. Set lzma_options_block.total_size to backward_size + size_of_backward_size (you calculated this sum earlier already). Then decode the Footer Metadata Block. Store the decoded Footer Metadata to lzma_info structure using lzma_info_set_metadata(). Set also the offset of the Backward Size field using lzma_info_size_set(). Then you can get the start offset of the Stream using lzma_info_size_get(). Note that any of these steps may fail so don't omit error checking. Seek to the beginning of the Stream. Decode the Stream Header using lzma_stream_header_decoder(). Verify that the decoded Stream Flags match the values found from Stream tail. You can use the lzma_stream_flags_is_equal() macro for this. If you were only looking for the Uncompressed Size of the Stream, it's possible that you already have it now. If Uncompressed Size (or whatever information you were looking for) isn't available yet, continue by decoding also the Header Metadata Block. (If some information is missing, the Header Metadata Block has to be present.) Decoding the Data Blocks goes the same way as described in FIXME. y.x.3. Variations If you know the offset of the beginning of the Stream, you may want to parse the Stream Header before parsing the Stream tail.