diff options
Diffstat (limited to 'doc/liblzma-hacking.txt')
-rw-r--r-- | doc/liblzma-hacking.txt | 112 |
1 files changed, 112 insertions, 0 deletions
diff --git a/doc/liblzma-hacking.txt b/doc/liblzma-hacking.txt new file mode 100644 index 00000000..64390bcb --- /dev/null +++ b/doc/liblzma-hacking.txt @@ -0,0 +1,112 @@ + +Hacking liblzma +--------------- + +0. Preface + + This document gives some overall information about the internals of + liblzma, which should make it easier to start reading and modifying + the code. + + +1. Programming language + + liblzma was written in C99. If you use GCC, this means that you need + at least GCC 3.x.x. GCC 2 isn't and won't be supported. + + Some GCC-specific extensions are used *conditionally*. They aren't + required to build a full-featured library. Don't make the code rely + on any non-standard compiler extensions or even C99 features that + aren't portable between almost-C99 compatible compilers (for example + non-static inlines). + + The public API headers are in C89. This is to avoid frustrating those + who maintain programs, which are strictly in C89 or C++. + + An assumption about sizeof(size_t) is made. If this assumption is + wrong, some porting is probably needed: + + sizeof(uint32_t) <= sizeof(size_t) <= sizeof(uint64_t) + + +2. Internal vs. external API + + + + Input Output + v Application ^ + | liblzma public API | + | Stream coder | + | Block coder | + | Filter coder | + | ... | + v Filter coder ^ + + + Application + `-- liblzma public API + `-- Stream coder + |-- Stream info handler + |-- Stream Header coder + |-- Block Header coder + | `-- Filter Flags coder + |-- Metadata coder + | `-- Block coder + | `-- Filter 0 + | `-- Filter 1 + | ... + |-- Data Block coder + | `-- Filter 0 + | `-- Filter 1 + | ... + `-- Stream tail coder + + + +x. Designing new filters + + All filters must be designed so that the decoder cannot consume + arbitrary amount input without producing any decoded output. Failing + to follow this rule makes liblzma vulnerable to DoS attacks if + untrusted files are decoded (usually they are untrusted). + + An example should clarify the reason behind this requirement: There + are two filters in the chain. The decoder of the first filter produces + huge amount of output (many gigabytes or more) with a few bytes of + input, which gets passed to the decoder of the second filter. If the + data passed to the second filter is interpreted as something that + produces no output (e.g. padding), the filter chain as a whole + produces no output and consumes no input for a long period of time. + + The above problem was present in the first versions of the Subblock + filter. A tiny .lzma file could have taken several years to decode + while it wouldn't produce any output at all. The problem was fixed + by adding limits for number of consecutive Padding bytes, and requiring + that some decoded output must be produced between Set Subfilter and + Unset Subfilter. + + +x. Implementing new filters + + If the filter supports embedding End of Payload Marker, make sure that + when your filter detects End of Payload Marker, + - the usage of End of Payload Marker is actually allowed (i.e. End + of Input isn't used); and + - it also checks that there is no more input coming from the next + filter in the chain. + + The second requirement is slightly tricky. It's possible that the next + filter hasn't returned LZMA_STREAM_END yet. It may even need a few + bytes more input before it will do so. You need to give it as much + input as it needs, and verify that it doesn't produce any output. + + Don't call the next filter in the chain after it has returned + LZMA_STREAM_END (except in encoder if action == LZMA_SYNC_FLUSH). + It will result undefined behavior. + + Be pedantic. If the input data isn't exactly valid, reject it. + + At the moment, liblzma isn't modular. You will need to edit several + files in src/liblzma/common to include support for a new filter. grep + for LZMA_FILTER_LZMA to locate the files needing changes. + |