diff options
author | Lasse Collin <lasse.collin@tukaani.org> | 2009-05-01 11:28:52 +0300 |
---|---|---|
committer | Lasse Collin <lasse.collin@tukaani.org> | 2009-05-01 11:28:52 +0300 |
commit | be06858d5cf8ba46557395035d821dc332f3f830 (patch) | |
tree | 603491cf2b789dd19afd7f3cc6185873f1a36cb8 /doc/liblzma-advanced.txt | |
parent | Added documentation about the legacy .lzma file format. (diff) | |
download | xz-be06858d5cf8ba46557395035d821dc332f3f830.tar.xz |
Remove docs that are too outdated to be updated
(rewrite will be better).
Diffstat (limited to 'doc/liblzma-advanced.txt')
-rw-r--r-- | doc/liblzma-advanced.txt | 324 |
1 files changed, 0 insertions, 324 deletions
diff --git a/doc/liblzma-advanced.txt b/doc/liblzma-advanced.txt deleted file mode 100644 index 6e1c9834..00000000 --- a/doc/liblzma-advanced.txt +++ /dev/null @@ -1,324 +0,0 @@ - -Advanced features of liblzma ----------------------------- - -0. Introduction - - Most developers need only the basic features of liblzma. These - features allow single-threaded encoding and decoding of .lzma files - in streamed mode. - - In some cases developers want more. The .lzma file format is - designed to allow multi-threaded encoding and decoding and limited - random-access reading. These features are possible in non-streamed - mode and limitedly also in streamed mode. - - To take advange of these features, the application needs a custom - .lzma file format handler. liblzma provides a set of tools to ease - this task, but it's still quite a bit of work to get a good custom - .lzma handler done. - - -1. Where to begin - - Start by reading the .lzma file format specification. Understanding - the basics of the .lzma file structure is required to implement a - custom .lzma file handler and to understand the rest of this document. - - -2. The basic components - -2.1. Stream Header and tail - - Stream Header begins the .lzma Stream and Stream tail ends it. Stream - Header is defined in the file format specification, but Stream tail - isn't (thus I write "tail" with a lower-case letter). Stream tail is - simply the Stream Flags and the Footer Magic Bytes fields together. - It was done this way in liblzma, because the Block coders take care - of the rest of the stuff in the Stream Footer. - - For now, the size of Stream Header is fixed to 11 bytes. The header - <lzma/stream_flags.h> defines LZMA_STREAM_HEADER_SIZE, which you - should use instead of a hardcoded number. Similarly, Stream tail - is fixed to 3 bytes, and there is a constant LZMA_STREAM_TAIL_SIZE. - - It is possible, that a future version of the .lzma format will have - variable-sized Stream Header and tail. As of writing, this seems so - unlikely though, that it was considered simplest to just use a - constant instead of providing a functions to get and store the sizes - of the Stream Header and tail. - - -2.x. Stream tail - - For now, the size of Stream tail is fixed to 3 bytes. The header - <lzma/stream_flags.h> defines LZMA_STREAM_TAIL_SIZE, which you - should use instead of a hardcoded number. - - -3. Keeping track of size information - - The lzma_info_* functions found from <lzma/info.h> should ease the - task of keeping track of sizes of the Blocks and also the Stream - as a whole. Using these functions is strongly recommended, because - there are surprisingly many situations where an error can occur, - and these functions check for possible errors every time some new - information becomes available. - - If you find lzma_info_* functions lacking something that you would - find useful, please contact the author. - - -3.1. Start offset of the Stream - - If you are storing the .lzma Stream inside anothe file format, or - for some other reason are placing the .lzma Stream to somewhere - else than to the beginning of the file, you should tell the starting - offset of the Stream using lzma_info_start_offset_set(). - - The start offset of the Stream is used for two distinct purporses. - First, knowing the start offset of the Stream allows - lzma_info_alignment_get() to correctly calculate the alignment of - every Block. This information is given to the Block encoder, which - will calculate the size of Header Padding so that Compressed Data - is alignment at an optimal offset. - - Another use for start offset of the Stream is in random-access - reading. If you set the start offset of the Stream, lzma_info_locate() - will be able to calculate the offset relative to the beginning of the - file containing the Stream (instead of offset relative to the - beginning of the Stream). - - -3.2. Size of Stream Header - - While the size of Stream Header is constant (11 bytes) in the current - version of the .lzma file format, this may change in future. - - -3.3. Size of Header Metadata Block - - This information is needed when doing random-access reading, and - to verify the value of this field stored in Footer Metadata Block. - - -3.4. Total Size of the Data Blocks - - -3.5. Uncompressed Size of Data Blocks - - -3.6. Index - - - - -x. Alignment - - There are a few slightly different types of alignment issues when - working with .lzma files. - - The .lzma format doesn't strictly require any kind of alignment. - However, if the encoder carefully optimizes the alignment in all - situations, it can improve compression ratio, speed of the encoder - and decoder, and slightly help if the files get damaged and need - recovery. - - Alignment has the most significant effect compression ratio FIXME - - -x.1. Compression ratio - - Some filters take advantage of the alignment of the input data. - To get the best compression ratio, make sure that you feed these - filters correctly aligned data. - - Some filters (e.g. LZMA) don't necessarily mind too much if the - input doesn't match the preferred alignment. With these filters - the penalty in compression ratio depends on the specific type of - data being compressed. - - Other filters (e.g. PowerPC executable filter) won't work at all - with data that is improperly aligned. While the data can still - be de-filtered back to its original form, the benefit of the - filtering (better compression ratio) is completely lost, because - these filters expect certain patterns at properly aligned offsets. - The compression ratio may even worse with incorrectly aligned input - than without the filter. - - -x.1.1. Inter-filter alignment - - When there are multiple filters chained, checking the alignment can - be useful not only with the input of the first filter and output of - the last filter, but also between the filters. - - Inter-filter alignment important especially with the Subblock filter. - - -x.1.2. Further compression with external tools - - This is relatively rare situation in practice, but still worth - understanding. - - Let's say that there are several SPARC executables, which are each - filtered to separate .lzma files using only the SPARC filter. If - Uncompressed Size is written to the Block Header, the size of Block - Header may vary between the .lzma files. If no Padding is used in - the Block Header to correct the alignment, the starting offset of - the Compressed Data field will be differently aligned in different - .lzma files. - - All these .lzma files are archived into a single .tar archive. Due - to nature of the .tar format, every file is aligned inside the - archive to an offset that is a multiple of 512 bytes. - - The .tar archive is compressed into a new .lzma file using the LZMA - filter with options, that prefer input alignment of four bytes. Now - if the independent .lzma files don't have the same alignment of - the Compressed Data fields, the LZMA filter will be unable to take - advantage of the input alignment between the files in the .tar - archive, which reduces compression ratio. - - Thus, even if you have only single Block per file, it can be good for - compression ratio to align the Compressed Data to optimal offset. - - -x.2. Speed - - Most modern computers are faster when multi-byte data is located - at aligned offsets in RAM. Proper alignment of the Compressed Data - fields can slightly increase the speed of some filters. - - -x.3. Recovery - - Aligning every Block Header to start at an offset with big enough - alignment may ease or at least speed up recovery of broken files. - - -y. Typical usage cases - -y.x. Parsing the Stream backwards - - You may need to parse the Stream backwards if you need to get - information such as the sizes of the Stream, Index, or Extra. - The basic procedure to do this follows. - - Locate the end of the Stream. If the Stream is stored as is in a - standalone .lzma file, simply seek to the end of the file and start - reading backwards using appropriate buffer size. The file format - specification allows arbitrary amount of Footer Padding (zero or more - NUL bytes), which you skip before trying to decode the Stream tail. - - Once you have located the end of the Stream (a non-NULL byte), make - sure you have at least the last LZMA_STREAM_TAIL_SIZE bytes of the - Stream in a buffer. If there isn't enough bytes left from the file, - the file is too small to contain a valid Stream. Decode the Stream - tail using lzma_stream_tail_decoder(). Store the offset of the first - byte of the Stream tail; you will need it later. - - You may now want to do some internal verifications e.g. if the Check - type is supported by the liblzma build you are using. - - Decode the Backward Size field with lzma_vli_reverse_decode(). The - field is at maximum of LZMA_VLI_BYTES_MAX bytes long. Check that - Backward Size is not zero. Store the offset of the first byte of - the Backward Size; you will need it later. - - Now you know the Total Size of the last Block of the Stream. It's the - value of Backward Size plus the size of the Backward Size field. Note - that you cannot use lzma_vli_size() to calculate the size since there - might be padding; you need to use the real observed size of the - Backward Size field. - - At this point, the operation continues differently for Single-Block - and Multi-Block Streams. - - -y.x.1. Single-Block Stream - - There might be Uncompressed Size field present in the Stream Footer. - You cannot know it for sure unless you have already parsed the Block - Header earlier. For security reasons, you probably want to try to - decode the Uncompressed Size field, but you must not indicate any - error if decoding fails. Later you can give the decoded Uncompressed - Size to Block decoder if Uncopmressed Size isn't otherwise known; - this prevents it from producing too much output in case of (possibly - intentionally) corrupt file. - - Calculate the start offset of the Stream: - - backward_offset - backward_size - LZMA_STREAM_HEADER_SIZE - - backward_offset is the offset of the first byte of the Backward Size - field. Remember to check for integer overflows, which can occur with - invalid input files. - - Seek to the beginning of the Stream. Decode the Stream Header using - lzma_stream_header_decoder(). Verify that the decoded Stream Flags - match the values found from Stream tail. You can use the - lzma_stream_flags_is_equal() macro for this. - - Decode the Block Header. Verify that it isn't a Metadata Block, since - Single-Block Streams cannot have Metadata. If Uncompressed Size is - present in the Block Header, the value you tried to decode from the - Stream Footer must be ignored, since Uncompressed Size wasn't actually - present there. If Block Header doesn't have Uncompressed Size, and - decoding the Uncompressed Size field from the Stream Footer failed, - the file is corrupt. - - If you were only looking for the Uncompressed Size of the Stream, - you now got that information, and you can stop processing the Stream. - - To decode the Block, the same instructions apply as described in - FIXME. However, because you have some extra known information decoded - from the Stream Footer, you should give this information to the Block - decoder so that it can verify it while decoding: - - If Uncompressed Size is not present in the Block Header, set - lzma_options_block.uncompressed_size to the value you decoded - from the Stream Footer. - - Always set lzma_options_block.total_size to backward_size + - size_of_backward_size (you calculated this sum earlier already). - - -y.x.2. Multi-Block Stream - - Calculate the start offset of the Footer Metadata Block: - - backward_offset - backward_size - - backward_offset is the offset of the first byte of the Backward Size - field. Remember to check for integer overflows, which can occur with - broken input files. - - Decode the Block Header. Verify that it is a Metadata Block. Set - lzma_options_block.total_size to backward_size + size_of_backward_size - (you calculated this sum earlier already). Then decode the Footer - Metadata Block. - - Store the decoded Footer Metadata to lzma_info structure using - lzma_info_set_metadata(). Set also the offset of the Backward Size - field using lzma_info_size_set(). Then you can get the start offset - of the Stream using lzma_info_size_get(). Note that any of these steps - may fail so don't omit error checking. - - Seek to the beginning of the Stream. Decode the Stream Header using - lzma_stream_header_decoder(). Verify that the decoded Stream Flags - match the values found from Stream tail. You can use the - lzma_stream_flags_is_equal() macro for this. - - If you were only looking for the Uncompressed Size of the Stream, - it's possible that you already have it now. If Uncompressed Size (or - whatever information you were looking for) isn't available yet, - continue by decoding also the Header Metadata Block. (If some - information is missing, the Header Metadata Block has to be present.) - - Decoding the Data Blocks goes the same way as described in FIXME. - - -y.x.3. Variations - - If you know the offset of the beginning of the Stream, you may want - to parse the Stream Header before parsing the Stream tail. - |