| LZ4 Block Format Description |
| ============================ |
| Last revised: 2015-05-07. |
| Author : Yann Collet |
| |
| |
| This specification is intended for developers |
| willing to produce LZ4-compatible compressed data blocks |
| using any programming language. |
| |
| LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding. |
| There is no entropy encoder back-end nor framing layer. |
| The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]). |
| This design is assumed to favor simplicity and speed. |
| It helps later on for optimizations, compactness, and features. |
| |
| This document describes only the block format, |
| not how the compressor nor decompressor actually work. |
| The correctness of the decompressor should not depend |
| on implementation details of the compressor, and vice versa. |
| |
| [LZ4 Frame format]: lz4_Frame_format.md |
| |
| |
| |
| Compressed block format |
| ----------------------- |
| An LZ4 compressed block is composed of sequences. |
| A sequence is a suite of literals (not-compressed bytes), |
| followed by a match copy. |
| |
| Each sequence starts with a token. |
| The token is a one byte value, separated into two 4-bits fields. |
| Therefore each field ranges from 0 to 15. |
| |
| |
| The first field uses the 4 high-bits of the token. |
| It provides the length of literals to follow. |
| |
| If the field value is 0, then there is no literal. |
| If it is 15, then we need to add some more bytes to indicate the full length. |
| Each additional byte then represent a value from 0 to 255, |
| which is added to the previous value to produce a total length. |
| When the byte value is 255, another byte is output. |
| There can be any number of bytes following the token. There is no "size limit". |
| (Side note : this is why a not-compressible input block is expanded by 0.4%). |
| |
| Example 1 : A length of 48 will be represented as : |
| |
| - 15 : value for the 4-bits High field |
| - 33 : (=48-15) remaining length to reach 48 |
| |
| Example 2 : A length of 280 will be represented as : |
| |
| - 15 : value for the 4-bits High field |
| - 255 : following byte is maxed, since 280-15 >= 255 |
| - 10 : (=280 - 15 - 255) ) remaining length to reach 280 |
| |
| Example 3 : A length of 15 will be represented as : |
| |
| - 15 : value for the 4-bits High field |
| - 0 : (=15-15) yes, the zero must be output |
| |
| Following the token and optional length bytes, are the literals themselves. |
| They are exactly as numerous as previously decoded (length of literals). |
| It's possible that there are zero literal. |
| |
| |
| Following the literals is the match copy operation. |
| |
| It starts by the offset. |
| This is a 2 bytes value, in little endian format |
| (the 1st byte is the "low" byte, the 2nd one is the "high" byte). |
| |
| The offset represents the position of the match to be copied from. |
| 1 means "current position - 1 byte". |
| The maximum offset value is 65535, 65536 cannot be coded. |
| Note that 0 is an invalid value, not used. |
| |
| Then we need to extract the match length. |
| For this, we use the second token field, the low 4-bits. |
| Value, obviously, ranges from 0 to 15. |
| However here, 0 means that the copy operation will be minimal. |
| The minimum length of a match, called minmatch, is 4. |
| As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes. |
| Similar to literal length, on reaching the highest possible value (15), |
| we output additional bytes, one at a time, with values ranging from 0 to 255. |
| They are added to total to provide the final match length. |
| A 255 value means there is another byte to read and add. |
| There is no limit to the number of optional bytes that can be output this way. |
| (This points towards a maximum achievable compression ratio of about 250). |
| |
| Decoding the matchlength reaches the end of current sequence. |
| Next byte will be the start of another sequence. |
| But before moving to next sequence, |
| it's time to use the decoded match position and length. |
| The decoder copies matchlength bytes from match position to current position. |
| |
| In some cases, matchlength is larger than offset. |
| Therefore, match pos + match length > current pos, |
| which means that later bytes to copy are not yet decoded. |
| This is called an "overlap match", and must be handled with special care. |
| The most common case is an offset of 1, |
| meaning the last byte is repeated matchlength times. |
| |
| |
| Parsing restrictions |
| ----------------------- |
| There are specific parsing rules to respect in order to remain compatible |
| with assumptions made by the decoder : |
| |
| 1. The last 5 bytes are always literals. In other words, the last five bytes |
| from the uncompressed input (or all bytes, if the input has less than five |
| bytes) must be encoded as literals on behalf of the last sequence. |
| The last sequence is incomplete, and stops right after the literals. |
| 2. The last match must start at least 12 bytes before end of block. |
| Consequently, a block with less than 13 bytes cannot be compressed. |
| |
| These rules are in place to ensure that the decoder |
| will never read beyond the input buffer, nor write beyond the output buffer. |
| |
| 1. To copy literals from a non-last sequence, an 8-byte copy instruction |
| can always be safely issued (without reading past the input), because |
| the literals are followed by a 2-byte offset, and the last sequence |
| is at least 1+5 bytes long. |
| 2. TODO: explain the benefits of the second restriction. |
| |
| Empty inputs are either unrepresentable or can be represented with a null byte, |
| which can be interpreted as a token without literals and without a match. |
| |
| |
| Additional notes |
| ----------------------- |
| There is no assumption nor limits to the way the compressor |
| searches and selects matches within the source data block. |
| It could be a fast scan, a multi-probe, a full search using BST, |
| standard hash chains or MMC, well whatever. |
| |
| Advanced parsing strategies can also be implemented, such as lazy match, |
| or full optimal parsing. |
| |
| All these trade-off offer distinctive speed/memory/compression advantages. |
| Whatever the method used by the compressor, its result will be decodable |
| by any LZ4 decoder if it follows the format specification described above. |