doc/lz4_Block_format.md - lz4 - Rivoreo Source Code Repositories

 LZ4 Block Format Description
 ============================
 Last revised: 2018-04-25.
 Author : Yann Collet


 This specification is intended for developers
 willing to produce LZ4-compatible compressed data blocks
 using any programming language.

 LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
 There is no entropy encoder back-end nor framing layer.
 The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]).
 This design is assumed to favor simplicity and speed.
 It helps later on for optimizations, compactness, and features.

 This document describes only the block format,
 not how the compressor nor decompressor actually work.
 The correctness of the decompressor should not depend
 on implementation details of the compressor, and vice versa.

 [LZ4 Frame format]: lz4_Frame_format.md


 Compressed block format
 -----------------------
 An LZ4 compressed block is composed of sequences.
 A sequence is a suite of literals (not-compressed bytes),
 followed by a match copy.

 Each sequence starts with a `token`.
 The `token` is a one byte value, separated into two 4-bits fields.
 Therefore each field ranges from 0 to 15.


 The first field uses the 4 high-bits of the token.
 It provides the length of literals to follow.

 If the field value is 0, then there is no literal.
 If it is 15, then we need to add some more bytes to indicate the full length.
 Each additional byte then represent a value from 0 to 255,
 which is added to the previous value to produce a total length.
 When the byte value is 255, another byte is output.
 There can be any number of bytes following `token`. There is no "size limit".
 (Side note : this is why a not-compressible input block is expanded by 0.4%).

 Example 1 : A literal length of 48 will be represented as :

   - 15 : value for the 4-bits High field
   - 33 : (=48-15) remaining length to reach 48

 Example 2 : A literal length of 280 will be represented as :

   - 15  : value for the 4-bits High field
   - 255 : following byte is maxed, since 280-15 >= 255
   - 10  : (=280 - 15 - 255) ) remaining length to reach 280

 Example 3 : A literal length of 15 will be represented as :

   - 15 : value for the 4-bits High field
   - 0  : (=15-15) yes, the zero must be output

 Following `token` and optional length bytes, are the literals themselves.
 They are exactly as numerous as previously decoded (length of literals).
 It's possible that there are zero literal.


 Following the literals is the match copy operation.

 It starts by the `offset`.
 This is a 2 bytes value, in little endian format
 (the 1st byte is the "low" byte, the 2nd one is the "high" byte).

 The `offset` represents the position of the match to be copied from.
 1 means "current position - 1 byte".
 The maximum `offset` value is 65535, 65536 cannot be coded.
 Note that 0 is an invalid value, not used.

 Then we need to extract the `matchlength`.
 For this, we use the second token field, the low 4-bits.
 Value, obviously, ranges from 0 to 15.
 However here, 0 means that the copy operation will be minimal.
 The minimum length of a match, called `minmatch`, is 4.
 As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
 Similar to literal length, on reaching the highest possible value (15),
 we output additional bytes, one at a time, with values ranging from 0 to 255.
 They are added to total to provide the final match length.
 A 255 value means there is another byte to read and add.
 There is no limit to the number of optional bytes that can be output this way.
 (This points towards a maximum achievable compression ratio of about 250).

 Decoding the `matchlength` reaches the end of current sequence.
 Next byte will be the start of another sequence.
 But before moving to next sequence,
 it's time to use the decoded match position and length.
 The decoder copies `matchlength` bytes from match position to current position.

 In some cases, `matchlength` is larger than `offset`.
 Therefore, `match_pos + matchlength > current_pos`,
 which means that later bytes to copy are not yet decoded.
 This is called an "overlap match", and must be handled with special care.
 A common case is an offset of 1,
 meaning the last byte is repeated `matchlength` times.


 Parsing restrictions
 -----------------------
 There are specific parsing rules to respect in order to remain compatible
 with assumptions made by the decoder :

 1. The last 5 bytes are always literals.  In other words, the last five bytes
    from the uncompressed input (or all bytes, if the input has less than five
    bytes) must be encoded as literals on behalf of the last sequence.
    The last sequence is incomplete, and stops right after the literals.
 2. The last match must start at least 12 bytes before end of block.
    The last match is part of the penultimate sequence,
    since the last sequence stops right after literals.
    Note that, as a consequence, blocks < 13 bytes cannot be compressed.

 These rules are in place to ensure that the decoder
 can speculatively execute copy instructions
 without ever reading nor writing beyond provided I/O buffers.

 1. To copy literals from a non-last sequence, an 8-byte copy instruction
    can always be safely issued (without reading past the input),
    because literals are followed by a 2-byte offset,
    and last sequence is at least 1+5 bytes long.
 2. Similarly, a match operation can speculatively copy up to 12 bytes
    while remaining within output buffer boundaries.

 Empty inputs can be represented with a zero byte,
 interpreted as a token without literals and without a match.


 Additional notes
 -----------------------
 There is no assumption nor limits to the way the compressor
 searches and selects matches within the source data block.
 It could be a fast scan, a multi-probe, a full search using BST,
 standard hash chains or MMC, well whatever.

 Advanced parsing strategies can also be implemented, such as lazy match,
 or full optimal parsing.

 All these trade-off offer distinctive speed/memory/compression advantages.
 Whatever the method used by the compressor, its result will be decodable
 by any LZ4 decoder if it follows the format specification described above.
	LZ4 Block Format Description
	============================
	Last revised: 2018-04-25.
	Author : Yann Collet


	This specification is intended for developers
	willing to produce LZ4-compatible compressed data blocks
	using any programming language.

	LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
	There is no entropy encoder back-end nor framing layer.
	The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]).
	This design is assumed to favor simplicity and speed.
	It helps later on for optimizations, compactness, and features.

	This document describes only the block format,
	not how the compressor nor decompressor actually work.
	The correctness of the decompressor should not depend
	on implementation details of the compressor, and vice versa.

	[LZ4 Frame format]: lz4_Frame_format.md



	Compressed block format
	-----------------------
	An LZ4 compressed block is composed of sequences.
	A sequence is a suite of literals (not-compressed bytes),
	followed by a match copy.

	Each sequence starts with a `token`.
	The `token` is a one byte value, separated into two 4-bits fields.
	Therefore each field ranges from 0 to 15.


	The first field uses the 4 high-bits of the token.
	It provides the length of literals to follow.

	If the field value is 0, then there is no literal.
	If it is 15, then we need to add some more bytes to indicate the full length.
	Each additional byte then represent a value from 0 to 255,
	which is added to the previous value to produce a total length.
	When the byte value is 255, another byte is output.
	There can be any number of bytes following `token`. There is no "size limit".
	(Side note : this is why a not-compressible input block is expanded by 0.4%).

	Example 1 : A literal length of 48 will be represented as :

	- 15 : value for the 4-bits High field
	- 33 : (=48-15) remaining length to reach 48

	Example 2 : A literal length of 280 will be represented as :

	- 15 : value for the 4-bits High field
	- 255 : following byte is maxed, since 280-15 >= 255
	- 10 : (=280 - 15 - 255) ) remaining length to reach 280

	Example 3 : A literal length of 15 will be represented as :

	- 15 : value for the 4-bits High field
	- 0 : (=15-15) yes, the zero must be output

	Following `token` and optional length bytes, are the literals themselves.
	They are exactly as numerous as previously decoded (length of literals).
	It's possible that there are zero literal.


	Following the literals is the match copy operation.

	It starts by the `offset`.
	This is a 2 bytes value, in little endian format
	(the 1st byte is the "low" byte, the 2nd one is the "high" byte).

	The `offset` represents the position of the match to be copied from.
	1 means "current position - 1 byte".
	The maximum `offset` value is 65535, 65536 cannot be coded.
	Note that 0 is an invalid value, not used.

	Then we need to extract the `matchlength`.
	For this, we use the second token field, the low 4-bits.
	Value, obviously, ranges from 0 to 15.
	However here, 0 means that the copy operation will be minimal.
	The minimum length of a match, called `minmatch`, is 4.
	As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
	Similar to literal length, on reaching the highest possible value (15),
	we output additional bytes, one at a time, with values ranging from 0 to 255.
	They are added to total to provide the final match length.
	A 255 value means there is another byte to read and add.
	There is no limit to the number of optional bytes that can be output this way.
	(This points towards a maximum achievable compression ratio of about 250).

	Decoding the `matchlength` reaches the end of current sequence.
	Next byte will be the start of another sequence.
	But before moving to next sequence,
	it's time to use the decoded match position and length.
	The decoder copies `matchlength` bytes from match position to current position.

	In some cases, `matchlength` is larger than `offset`.
	Therefore, `match_pos + matchlength > current_pos`,
	which means that later bytes to copy are not yet decoded.
	This is called an "overlap match", and must be handled with special care.
	A common case is an offset of 1,
	meaning the last byte is repeated `matchlength` times.


	Parsing restrictions
	-----------------------
	There are specific parsing rules to respect in order to remain compatible
	with assumptions made by the decoder :

	1. The last 5 bytes are always literals. In other words, the last five bytes
	from the uncompressed input (or all bytes, if the input has less than five
	bytes) must be encoded as literals on behalf of the last sequence.
	The last sequence is incomplete, and stops right after the literals.
	2. The last match must start at least 12 bytes before end of block.
	The last match is part of the penultimate sequence,
	since the last sequence stops right after literals.
	Note that, as a consequence, blocks < 13 bytes cannot be compressed.

	These rules are in place to ensure that the decoder
	can speculatively execute copy instructions
	without ever reading nor writing beyond provided I/O buffers.

	1. To copy literals from a non-last sequence, an 8-byte copy instruction
	can always be safely issued (without reading past the input),
	because literals are followed by a 2-byte offset,
	and last sequence is at least 1+5 bytes long.
	2. Similarly, a match operation can speculatively copy up to 12 bytes
	while remaining within output buffer boundaries.

	Empty inputs can be represented with a zero byte,
	interpreted as a token without literals and without a match.


	Additional notes
	-----------------------
	There is no assumption nor limits to the way the compressor
	searches and selects matches within the source data block.
	It could be a fast scan, a multi-probe, a full search using BST,
	standard hash chains or MMC, well whatever.

	Advanced parsing strategies can also be implemented, such as lazy match,
	or full optimal parsing.

	All these trade-off offer distinctive speed/memory/compression advantages.
	Whatever the method used by the compressor, its result will be decodable
	by any LZ4 decoder if it follows the format specification described above.