diff options
Diffstat (limited to 'doc/liblzma-advanced.txt')
-rw-r--r-- | doc/liblzma-advanced.txt | 324 |
1 files changed, 324 insertions, 0 deletions
diff --git a/doc/liblzma-advanced.txt b/doc/liblzma-advanced.txt new file mode 100644 index 00000000..d829a33a --- /dev/null +++ b/doc/liblzma-advanced.txt @@ -0,0 +1,324 @@ + +Advanced features of liblzma +---------------------------- + +0. Introduction + + Most developers need only the basic features of liblzma. These + features allow single-threaded encoding and decoding of .lzma files + in streamed mode. + + In some cases developers want more. The .lzma file format is + designed to allow multi-threaded encoding and decoding and limited + random-access reading. These features are possible in non-streamed + mode and limitedly also in streamed mode. + + To take advange of these features, the application needs a custom + .lzma file format handler. liblzma provides a set of tools to ease + this task, but it's still quite a bit of work to get a good custom + .lzma handler done. + + +1. Where to begin + + Start by reading the .lzma file format specification. Understanding + the basics of the .lzma file structure is required to implement a + custom .lzma file handler and to understand the rest of this document. + + +2. The basic components + +2.1. Stream Header and tail + + Stream Header begins the .lzma Stream and Stream tail ends it. Stream + Header is defined in the file format specification, but Stream tail + isn't (thus I write "tail" with a lower-case letter). Stream tail is + simply the Stream Flags and the Footer Magic Bytes fields together. + It was done this way in liblzma, because the Block coders take care + of the rest of the stuff in the Stream Footer. + + For now, the size of Stream Header is fixed to 11 bytes. The header + <lzma/stream_flags.h> defines LZMA_STREAM_HEADER_SIZE, which you + should use instead of a hardcoded number. Similarly, Stream tail + is fixed to 3 bytes, and there is a constant LZMA_STREAM_TAIL_SIZE. + + It is possible, that a future version of the .lzma format will have + variable-sized Stream Header and tail. As of writing, this seems so + unlikely though, that it was considered simplest to just use a + constant instead of providing a functions to get and store the sizes + of the Stream Header and tail. + + +2.x. Stream tail + + For now, the size of Stream tail is fixed to 3 bytes. The header + <lzma/stream_flags.h> defines LZMA_STREAM_TAIL_SIZE, which you + should use instead of a hardcoded number. + + +3. Keeping track of size information + + The lzma_info_* functions found from <lzma/info.h> should ease the + task of keeping track of sizes of the Blocks and also the Stream + as a whole. Using these functions is strongly recommended, because + there are surprisingly many situations where an error can occur, + and these functions check for possible errors every time some new + information becomes available. + + If you find lzma_info_* functions lacking something that you would + find useful, please contact the author. + + +3.1. Start offset of the Stream + + If you are storing the .lzma Stream inside anothe file format, or + for some other reason are placing the .lzma Stream to somewhere + else than to the beginning of the file, you should tell the starting + offset of the Stream using lzma_info_start_offset_set(). + + The start offset of the Stream is used for two distinct purporses. + First, knowing the start offset of the Stream allows + lzma_info_alignment_get() to correctly calculate the alignment of + every Block. This information is given to the Block encoder, which + will calculate the size of Header Padding so that Compressed Data + is alignment at an optimal offset. + + Another use for start offset of the Stream is in random-access + reading. If you set the start offset of the Stream, lzma_info_locate() + will be able to calculate the offset relative to the beginning of the + file containing the Stream (instead of offset relative to the + beginning of the Stream). + + +3.2. Size of Stream Header + + While the size of Stream Header is constant (11 bytes) in the current + version of the .lzma file format, this may change in future. + + +3.3. Size of Header Metadata Block + + This information is needed when doing random-access reading, and + to verify the value of this field stored in Footer Metadata Block. + + +3.4. Total Size of the Data Blocks + + +3.5. Uncompressed Size of Data Blocks + + +3.6. Index + + + + +x. Alignment + + There are a few slightly different types of alignment issues when + working with .lzma files. + + The .lzma format doesn't strictly require any kind of alignment. + However, if the encoder carefully optimizes the alignment in all + situations, it can improve compression ratio, speed of the encoder + and decoder, and slightly help if the files get damaged and need + recovery. + + Alignment has the most significant effect compression ratio FIXME + + +x.1. Compression ratio + + Some filters take advantage of the alignment of the input data. + To get the best compression ratio, make sure that you feed these + filters correctly aligned data. + + Some filters (e.g. LZMA) don't necessarily mind too much if the + input doesn't match the preferred alignment. With these filters + the penalty in compression ratio depends on the specific type of + data being compressed. + + Other filters (e.g. PowerPC executable filter) won't work at all + with data that is improperly aligned. While the data can still + be de-filtered back to its original form, the benefit of the + filtering (better compression ratio) is completely lost, because + these filters expect certain patterns at properly aligned offsets. + The compression ratio may even worse with incorrectly aligned input + than without the filter. + + +x.1.1. Inter-filter alignment + + When there are multiple filters chained, checking the alignment can + be useful not only with the input of the first filter and output of + the last filter, but also between the filters. + + Inter-filter alignment important especially with the Subblock filter. + + +x.1.2. Further compression with external tools + + This is relatively rare situation in practice, but still worth + understanding. + + Let's say that there are several SPARC executables, which are each + filtered to separate .lzma files using only the SPARC filter. If + Uncompressed Size is written to the Block Header, the size of Block + Header may vary between the .lzma files. If no Padding is used in + the Block Header to correct the alignment, the starting offset of + the Compressed Data field will be differently aligned in different + .lzma files. + + All these .lzma files are archived into a single .tar archive. Due + to nature of the .tar format, every file is aligned inside the + archive to an offset that is a multiple of 512 bytes. + + The .tar archive is compressed into a new .lzma file using the LZMA + filter with options, that prefer input alignment of four bytes. Now + if the independent .lzma files don't have the same alignment of + the Compressed Data fields, the LZMA filter will be unable to take + advantage of the input alignment between the files in the .tar + archive, which reduces compression ratio. + + Thus, even if you have only single Block per file, it can be good for + compression ratio to align the Compressed Data to optimal offset. + + +x.2. Speed + + Most modern computers are faster when multi-byte data is located + at aligned offsets in RAM. Proper alignment of the Compressed Data + fields can slightly increase the speed of some filters. + + +x.3. Recovery + + Aligning every Block Header to start at an offset with big enough + alignment may ease or at least speed up recovery of broken files. + + +y. Typical usage cases + +y.x. Parsing the Stream backwards + + You may need to parse the Stream backwards if you need to get + information such as the sizes of the Stream, Index, or Extra. + The basic procedure to do this follows. + + Locate the end of the Stream. If the Stream is stored as is in a + standalone .lzma file, simply seek to the end of the file and start + reading backwards using appropriate buffer size. The file format + specification allows arbitrary amount of Footer Padding (zero or more + NUL bytes), which you skip before trying to decode the Stream tail. + + Once you have located the end of the Stream (a non-NULL byte), make + sure you have at least the last LZMA_STREAM_TAIL_SIZE bytes of the + Stream in a buffer. If there isn't enough bytes left from the file, + the file is too small to contain a valid Stream. Decode the Stream + tail using lzma_stream_tail_decoder(). Store the offset of the first + byte of the Stream tail; you will need it later. + + You may now want to do some internal verifications e.g. if the Check + type is supported by the liblzma build you are using. + + Decode the Backward Size field with lzma_vli_reverse_decode(). The + field is at maximum of LZMA_VLI_BYTES_MAX bytes long. Check that + Backward Size is not zero. Store the offset of the first byte of + the Backward Size; you will need it later. + + Now you know the Total Size of the last Block of the Stream. It's the + value of Backward Size plus the size of the Backward Size field. Note + that you cannot use lzma_vli_size() to calculate the size since there + might be padding; you need to use the real observed size of the + Backward Size field. + + At this point, the operation continues differently for Single-Block + and Multi-Block Streams. + + +y.x.1. Single-Block Stream + + There might be Uncompressed Size field present in the Stream Footer. + You cannot know it for sure unless you have already parsed the Block + Header earlier. For security reasons, you probably want to try to + decode the Uncompressed Size field, but you must not indicate any + error if decoding fails. Later you can give the decoded Uncompressed + Size to Block decoder if Uncopmressed Size isn't otherwise known; + this prevents it from producing too much output in case of (possibly + intentionally) corrupt file. + + Calculate the the start offset of the Stream: + + backward_offset - backward_size - LZMA_STREAM_HEADER_SIZE + + backward_offset is the offset of the first byte of the Backward Size + field. Remember to check for integer overflows, which can occur with + invalid input files. + + Seek to the beginning of the Stream. Decode the Stream Header using + lzma_stream_header_decoder(). Verify that the decoded Stream Flags + match the values found from Stream tail. You can use the + lzma_stream_flags_is_equal() macro for this. + + Decode the Block Header. Verify that it isn't a Metadata Block, since + Single-Block Streams cannot have Metadata. If Uncompressed Size is + present in the Block Header, the value you tried to decode from the + Stream Footer must be ignored, since Uncompressed Size wasn't actually + present there. If Block Header doesn't have Uncompressed Size, and + decoding the Uncompressed Size field from the Stream Footer failed, + the file is corrupt. + + If you were only looking for the Uncompressed Size of the Stream, + you now got that information, and you can stop processing the Stream. + + To decode the Block, the same instructions apply as described in + FIXME. However, because you have some extra known information decoded + from the Stream Footer, you should give this information to the Block + decoder so that it can verify it while decoding: + - If Uncompressed Size is not present in the Block Header, set + lzma_options_block.uncompressed_size to the value you decoded + from the Stream Footer. + - Always set lzma_options_block.total_size to backward_size + + size_of_backward_size (you calculated this sum earlier already). + + +y.x.2. Multi-Block Stream + + Calculate the start offset of the Footer Metadata Block: + + backward_offset - backward_size + + backward_offset is the offset of the first byte of the Backward Size + field. Remember to check for integer overflows, which can occur with + broken input files. + + Decode the Block Header. Verify that it is a Metadata Block. Set + lzma_options_block.total_size to backward_size + size_of_backward_size + (you calculated this sum earlier already). Then decode the Footer + Metadata Block. + + Store the decoded Footer Metadata to lzma_info structure using + lzma_info_set_metadata(). Set also the offset of the Backward Size + field using lzma_info_size_set(). Then you can get the start offset + of the Stream using lzma_info_size_get(). Note that any of these steps + may fail so don't omit error checking. + + Seek to the beginning of the Stream. Decode the Stream Header using + lzma_stream_header_decoder(). Verify that the decoded Stream Flags + match the values found from Stream tail. You can use the + lzma_stream_flags_is_equal() macro for this. + + If you were only looking for the Uncompressed Size of the Stream, + it's possible that you already have it now. If Uncompressed Size (or + whatever information you were looking for) isn't available yet, + continue by decoding also the Header Metadata Block. (If some + information is missing, the Header Metadata Block has to be present.) + + Decoding the Data Blocks goes the same way as described in FIXME. + + +y.x.3. Variations + + If you know the offset of the beginning of the Stream, you may want + to parse the Stream Header before parsing the Stream tail. + |