LZ4 Format Description | |
Last revised: 2012-02-12 | |
Author : Y. Collet | |
This is not a formal specification, but intents to provide enough information | |
to anyone willing to produce LZ4-compatible compressed streams. | |
LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding. | |
There is no entropy encoder backend nor framing layer -- the latter is | |
assumed to be handled by other parts of the system. | |
This document only describes the format, not how the LZ4 compressor nor | |
decompressor actually works. The correctness of the decompressor should not | |
depend on implementation details of the compressor, and vice versa. | |
The most important design principle behind LZ4 is simplicity. | |
It is meant to create an easy to read and maintain source code. | |
It also helps later on for optimisations, compactness, and speed. | |
-- Compressed stream format -- | |
An LZ4 compressed stream is composed of sequences. | |
Schematically, a sequence is a suite of literals, followed by a match copy. | |
Each sequence starts with a token. | |
The token is a one byte value, separated into two 4-bits fields. | |
Therefore each field ranges from 0 to 15. | |
The first field uses the 4 high-bits of the token. | |
It provides the length of literals to follow. | |
A literal is a not-compressed byte. | |
If it is 0, then there is no literal. | |
If it is 15, then we need to add some more bytes to indicate the full length. | |
Each additionnal byte then represent a value from 0 to 255, | |
which is added to the previous value to produce a total length. | |
When the byte value is 255, another byte is output. | |
There can be any number of bytes following the token. There is no "size limit". | |
(Sidenote this is why a not-compressible input stream is expanded by 0.4%). | |
Example 1 : A length of 48 will be represented as : | |
- 15 : value for the 4-bits High field | |
- 33 : (=48-15) remaining length to reach 48 | |
Example 2 : A length of 280 will be represented as : | |
- 15 : value for the 4-bits High field | |
- 255 : following byte is maxed, since 280-15 >= 255 | |
- 10 : (=280 - 15 - 255) ) remaining length to reach 280 | |
Example 3 : A length of 15 will be represented as : | |
- 15 : value for the 4-bits High field | |
- 0 : (=15-15) yes, the zero must be output | |
Following the token and optional length bytes, are the literals themselves. | |
They are exactly as numerous as previously decoded (length of literals). | |
It's possible that there are zero literal. | |
Following the literals is the match copy operation. | |
It starts by the offset. | |
This is a 2 bytes value, in little endian format : | |
the lower byte is the first one in the stream. | |
The offset represents the position of the match to be copied from. | |
1 means "current position - 1 byte". | |
The maximum offset value is 65535, 65536 cannot be coded. | |
Note that 0 is an invalid value, not used. | |
Then we need to extract the match length. | |
For this, we use the second token field, the low 4-bits. | |
Value, obviously, ranges from 0 to 15. | |
However here, 0 means that the copy operation will be minimal. | |
The minimum length of a match, called minmatch, is 4. | |
As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes. | |
Similar to literal length, on reaching the highest possible value (15), | |
we output additional bytes, one at a time, with values ranging from 0 to 255. | |
They are added to total to provide the final match length. | |
A 255 value means there is another byte to read and add. | |
There is no limit to the number of optional bytes that can be output this way. | |
(This points towards a maximum achievable compression ratio of ~250). | |
With the offset and the matchlength, | |
the decoder can now proceed to copy the data from the already decoded buffer. | |
On decoding the matchlength, we reach the end of the compressed sequence, | |
and therefore start another one. | |
-- Parsing restrictions -- | |
There are specific parsing rules to respect in order to remain compatible | |
with assumptions made by the decoder : | |
1) The last 5 bytes are always literals | |
2) The last match must start at least 12 bytes before end of stream | |
Consequently, a file with less than 13 bytes cannot be compressed. | |
These rules are in place to ensure that the decoder | |
will never read beyond the input buffer, nor write beyond the output buffer. | |
Note that the last sequence is also incomplete, | |
and stops right after literals. | |
-- Additional notes -- | |
There is no assumption nor limits to the way the compressor | |
searches and selects matches within the source stream. | |
It could be a fast scan, a multi-probe, a full search using BST, | |
standard hash chains or MMC, well whatever. | |
Advanced parsing strategies can also be implemented, such as lazy match, | |
or full optimal parsing. | |
All these trade-off offer distinctive speed/memory/compression advantages. | |
Whatever the method used by the compressor, its result will be decodable | |
by any LZ4 decoder if it follows the format described above. | |