Advanced features of liblzma
----------------------------
0. Introduction
Most developers need only the basic features of liblzma. These
features allow single-threaded encoding and decoding of .lzma files
in streamed mode.
In some cases developers want more. The .lzma file format is
designed to allow multi-threaded encoding and decoding and limited
random-access reading. These features are possible in non-streamed
mode and limitedly also in streamed mode.
To take advange of these features, the application needs a custom
.lzma file format handler. liblzma provides a set of tools to ease
this task, but it's still quite a bit of work to get a good custom
.lzma handler done.
1. Where to begin
Start by reading the .lzma file format specification. Understanding
the basics of the .lzma file structure is required to implement a
custom .lzma file handler and to understand the rest of this document.
2. The basic components
2.1. Stream Header and tail
Stream Header begins the .lzma Stream and Stream tail ends it. Stream
Header is defined in the file format specification, but Stream tail
isn't (thus I write "tail" with a lower-case letter). Stream tail is
simply the Stream Flags and the Footer Magic Bytes fields together.
It was done this way in liblzma, because the Block coders take care
of the rest of the stuff in the Stream Footer.
For now, the size of Stream Header is fixed to 11 bytes. The header
<lzma/stream_flags.h> defines LZMA_STREAM_HEADER_SIZE, which you
should use instead of a hardcoded number. Similarly, Stream tail
is fixed to 3 bytes, and there is a constant LZMA_STREAM_TAIL_SIZE.
It is possible, that a future version of the .lzma format will have
variable-sized Stream Header and tail. As of writing, this seems so
unlikely though, that it was considered simplest to just use a
constant instead of providing a functions to get and store the sizes
of the Stream Header and tail.
2.x. Stream tail
For now, the size of Stream tail is fixed to 3 bytes. The header
<lzma/stream_flags.h> defines LZMA_STREAM_TAIL_SIZE, which you
should use instead of a hardcoded number.
3. Keeping track of size information
The lzma_info_* functions found from <lzma/info.h> should ease the
task of keeping track of sizes of the Blocks and also the Stream
as a whole. Using these functions is strongly recommended, because
there are surprisingly many situations where an error can occur,
and these functions check for possible errors every time some new
information becomes available.
If you find lzma_info_* functions lacking something that you would
find useful, please contact the author.
3.1. Start offset of the Stream
If you are storing the .lzma Stream inside anothe file format, or
for some other reason are placing the .lzma Stream to somewhere
else than to the beginning of the file, you should tell the starting
offset of the Stream using lzma_info_start_offset_set().
The start offset of the Stream is used for two distinct purporses.
First, knowing the start offset of the Stream allows
lzma_info_alignment_get() to correctly calculate the alignment of
every Block. This information is given to the Block encoder, which
will calculate the size of Header Padding so that Compressed Data
is alignment at an optimal offset.
Another use for start offset of the Stream is in random-access
reading. If you set the start offset of the Stream, lzma_info_locate()
will be able to calculate the offset relative to the beginning of the
file containing the Stream (instead of offset relative to the
beginning of the Stream).
3.2. Size of Stream Header
While the size of Stream Header is constant (11 bytes) in the current
version of the .lzma file format, this may change in future.
3.3. Size of Header Metadata Block
This information is needed when doing random-access reading, and
to verify the value of this field stored in Footer Metadata Block.
3.4. Total Size of the Data Blocks
3.5. Uncompressed Size of Data Blocks
3.6. Index
x. Alignment
There are a few slightly different types of alignment issues when
working with .lzma files.
The .lzma format doesn't strictly require any kind of alignment.
However, if the encoder carefully optimizes the alignment in all
situations, it can improve compression ratio, speed of the encoder
and decoder, and slightly help if the files get damaged and need
recovery.
Alignment has the most significant effect compression ratio FIXME
x.1. Compression ratio
Some filters take advantage of the alignment of the input data.
To get the best compression ratio, make sure that you feed these
filters correctly aligned data.
Some filters (e.g. LZMA) don't necessarily mind too much if the
input doesn't match the preferred alignment. With these filters
the penalty in compression ratio depends on the specific type of
data being compressed.
Other filters (e.g. PowerPC executable filter) won't work at all
with data that is improperly aligned. While the data can still
be de-filtered back to its original form, the benefit of the
filtering (better compression ratio) is completely lost, because
these filters expect certain patterns at properly aligned offsets.
The compression ratio may even worse with incorrectly aligned input
than without the filter.
x.1.1. Inter-filter alignment
When there are multiple filters chained, checking the alignment can
be useful not only with the input of the first filter and output of
the last filter, but also between the filters.
Inter-filter alignment important especially with the Subblock filter.
x.1.2. Further compression with external tools
This is relatively rare situation in practice, but still worth
understanding.
Let's say that there are several SPARC executables, which are each
filtered to separate .lzma files using only the SPARC filter. If
Uncompressed Size is written to the Block Header, the size of Block
Header may vary between the .lzma files. If no Padding is used in
the Block Header to correct the alignment, the starting offset of
the Compressed Data field will be differently aligned in different
.lzma files.
All these .lzma files are archived into a single .tar archive. Due
to nature of the .tar format, every file is aligned inside the
archive to an offset that is a multiple of 512 bytes.
The .tar archive is compressed into a new .lzma file using the LZMA
filter with options, that prefer input alignment of four bytes. Now
if the independent .lzma files don't have the same alignment of
the Compressed Data fields, the LZMA filter will be unable to take
advantage of the input alignment between the files in the .tar
archive, which reduces compression ratio.
Thus, even if you have only single Block per file, it can be good for
compression ratio to align the Compressed Data to optimal offset.
x.2. Speed
Most modern computers are faster when multi-byte data is located
at aligned offsets in RAM. Proper alignment of the Compressed Data
fields can slightly increase the speed of some filters.
x.3. Recovery
Aligning every Block Header to start at an offset with big enough
alignment may ease or at least speed up recovery of broken files.
y. Typical usage cases
y.x. Parsing the Stream backwards
You may need to parse the Stream backwards if you need to get
information such as the sizes of the Stream, Index, or Extra.
The basic procedure to do this follows.
Locate the end of the Stream. If the Stream is stored as is in a
standalone .lzma file, simply seek to the end of the file and start
reading backwards using appropriate buffer size. The file format
specification allows arbitrary amount of Footer Padding (zero or more
NUL bytes), which you skip before trying to decode the Stream tail.
Once you have located the end of the Stream (a non-NULL byte), make
sure you have at least the last LZMA_STREAM_TAIL_SIZE bytes of the
Stream in a buffer. If there isn't enough bytes left from the file,
the file is too small to contain a valid Stream. Decode the Stream
tail using lzma_stream_tail_decoder(). Store the offset of the first
byte of the Stream tail; you will need it later.
You may now want to do some internal verifications e.g. if the Check
type is supported by the liblzma build you are using.
Decode the Backward Size field with lzma_vli_reverse_decode(). The
field is at maximum of LZMA_VLI_BYTES_MAX bytes long. Check that
Backward Size is not zero. Store the offset of the first byte of
the Backward Size; you will need it later.
Now you know the Total Size of the last Block of the Stream. It's the
value of Backward Size plus the size of the Backward Size field. Note
that you cannot use lzma_vli_size() to calculate the size since there
might be padding; you need to use the real observed size of the
Backward Size field.
At this point, the operation continues differently for Single-Block
and Multi-Block Streams.
y.x.1. Single-Block Stream
There might be Uncompressed Size field present in the Stream Footer.
You cannot know it for sure unless you have already parsed the Block
Header earlier. For security reasons, you probably want to try to
decode the Uncompressed Size field, but you must not indicate any
error if decoding fails. Later you can give the decoded Uncompressed
Size to Block decoder if Uncopmressed Size isn't otherwise known;
this prevents it from producing too much output in case of (possibly
intentionally) corrupt file.
Calculate the start offset of the Stream:
backward_offset - backward_size - LZMA_STREAM_HEADER_SIZE
backward_offset is the offset of the first byte of the Backward Size
field. Remember to check for integer overflows, which can occur with
invalid input files.
Seek to the beginning of the Stream. Decode the Stream Header using
lzma_stream_header_decoder(). Verify that the decoded Stream Flags
match the values found from Stream tail. You can use the
lzma_stream_flags_is_equal() macro for this.
Decode the Block Header. Verify that it isn't a Metadata Block, since
Single-Block Streams cannot have Metadata. If Uncompressed Size is
present in the Block Header, the value you tried to decode from the
Stream Footer must be ignored, since Uncompressed Size wasn't actually
present there. If Block Header doesn't have Uncompressed Size, and
decoding the Uncompressed Size field from the Stream Footer failed,
the file is corrupt.
If you were only looking for the Uncompressed Size of the Stream,
you now got that information, and you can stop processing the Stream.
To decode the Block, the same instructions apply as described in
FIXME. However, because you have some extra known information decoded
from the Stream Footer, you should give this information to the Block
decoder so that it can verify it while decoding:
- If Uncompressed Size is not present in the Block Header, set
lzma_options_block.uncompressed_size to the value you decoded
from the Stream Footer.
- Always set lzma_options_block.total_size to backward_size +
size_of_backward_size (you calculated this sum earlier already).
y.x.2. Multi-Block Stream
Calculate the start offset of the Footer Metadata Block:
backward_offset - backward_size
backward_offset is the offset of the first byte of the Backward Size
field. Remember to check for integer overflows, which can occur with
broken input files.
Decode the Block Header. Verify that it is a Metadata Block. Set
lzma_options_block.total_size to backward_size + size_of_backward_size
(you calculated this sum earlier already). Then decode the Footer
Metadata Block.
Store the decoded Footer Metadata to lzma_info structure using
lzma_info_set_metadata(). Set also the offset of the Backward Size
field using lzma_info_size_set(). Then you can get the start offset
of the Stream using lzma_info_size_get(). Note that any of these steps
may fail so don't omit error checking.
Seek to the beginning of the Stream. Decode the Stream Header using
lzma_stream_header_decoder(). Verify that the decoded Stream Flags
match the values found from Stream tail. You can use the
lzma_stream_flags_is_equal() macro for this.
If you were only looking for the Uncompressed Size of the Stream,
it's possible that you already have it now. If Uncompressed Size (or
whatever information you were looking for) isn't available yet,
continue by decoding also the Header Metadata Block. (If some
information is missing, the Header Metadata Block has to be present.)
Decoding the Data Blocks goes the same way as described in FIXME.
y.x.3. Variations
If you know the offset of the beginning of the Stream, you may want
to parse the Stream Header before parsing the Stream tail.