lz4.c: refactor the decoding routines

I noticed that LZ4_decompress_generic is sometimes instantiated with
identical set of parameters, or (what's worse) with a subtly different
sets of parameters.  For example, LZ4_decompress_fast_withPrefix64k is
instantiated as follows:

    return LZ4_decompress_generic(source, dest, 0, originalSize, endOnOutputSize,
		full, 0, withPrefix64k, (BYTE*)dest - 64 KB, NULL, 64 KB);

while the equivalent withPrefix64k call in LZ4_decompress_usingDict_generic
passes 0 for the last argument instead of 64 KB.  It turns out that there
is no difference in this case: if you change 64 KB to 0 KB in
LZ4_decompress_fast_withPrefix64k, you get the same binary code.

Moreover, because it's been clarified that LZ4_decompress_fast doesn't
check match offsets, it is now obvious that both of these fast/withPrefix64k
instantiations are simply redundant.  Exactly because LZ4_decompress_fast
doesn't check offsets, it serves well with any prefixed dictionary.

There's a difference, though, with LZ4_decompress_safe_withPrefix64k.
It also passes 64 KB as the last argument, and if you change that to 0,
as in LZ4_decompress_usingDict_generic, you get a completely different
binary code.  It seems that passing 0 enables offset checking:

    const int checkOffset = ((safeDecode) && (dictSize < (int)(64 KB)));

However, the resulting code seems to run a bit faster.  How come
enabling extra checks can make the code run faster?  Curiouser and
curiouser!  This needs extra study.  Currently I take the view that
the dictSize should be set to non-zero when nothing else will do,
i.e. when passing the external dictionary via dictStart.  Otherwise,
lowPrefix betrays just enough information about the dictionary.

    * * *

Anyway, with this change, I instantiate all the necessary cases as
functions with distinctive names, which also take fewer arguments and
are therefore less error-prone.  I also make the functions non-inline.
(The compiler won't inline the functions because they are used more than
once.  Hence I attach LZ4_FORCE_O2_GCC_PPC64LE to the instances while
removing from the callers.)  The number of instances is now is reduced
from 18 (safe+fast+partial+4*continue+4*prefix+4*dict+2*prefix64+forceExtDict)
down to 7 (safe+fast+partial+2*prefix+2*dict).  The size of the code is
not the only issue here.  Separate helper function are much more
amenable to profile-guided optimization: it is enough to profile only
a few basic functions, while the other less-often used functions, such
as LZ4_decompress_*_continue, will benefit automatically.

This is the list of LZ4_decompress* functions in liblz4.so, sorted by size.
Exported functions are marked with a capital T.

$ nm -S lib/liblz4.so |grep -wi T |grep LZ4_decompress |sort -k2
0000000000016260 0000000000000005 T LZ4_decompress_fast_withPrefix64k
0000000000016dc0 0000000000000025 T LZ4_decompress_fast_usingDict
0000000000016d80 0000000000000040 T LZ4_decompress_safe_usingDict
0000000000016d10 000000000000006b T LZ4_decompress_fast_continue
0000000000016c70 000000000000009f T LZ4_decompress_safe_continue
00000000000156c0 000000000000059c T LZ4_decompress_fast
0000000000014a90 00000000000005fa T LZ4_decompress_safe
0000000000015c60 00000000000005fa T LZ4_decompress_safe_withPrefix64k
0000000000002280 00000000000005fa t LZ4_decompress_safe_withSmallPrefix
0000000000015090 000000000000062f T LZ4_decompress_safe_partial
0000000000002880 00000000000008ea t LZ4_decompress_fast_extDict
0000000000016270 0000000000000993 t LZ4_decompress_safe_forceExtDict
1 file changed
tree: 4f983e141b3c3651580272408d394161ad19f244
  1. .gitattributes
  2. .gitignore
  3. .travis.yml
  4. INSTALL
  5. LICENSE
  6. Makefile
  7. NEWS
  8. README.md
  9. appveyor.yml
  10. circle.yml
  11. contrib/
  12. doc/
  13. examples/
  14. lib/
  15. programs/
  16. tests/
  17. visual/
README.md

LZ4 - Extremely fast compression

LZ4 is lossless compression algorithm, providing compression speed at 400 MB/s per core, scalable with multi-cores CPU. It features an extremely fast decoder, with speed in multiple GB/s per core, typically reaching RAM speed limits on multi-core systems.

Speed can be tuned dynamically, selecting an “acceleration” factor which trades compression ratio for more speed up. On the other end, a high compression derivative, LZ4_HC, is also provided, trading CPU time for improved compression ratio. All versions feature the same decompression speed.

LZ4 library is provided as open-source software using BSD 2-Clause license.

BranchStatus
masterBuild Status Build status coverity
devBuild Status Build status

Branch Policy:

  • The “master” branch is considered stable, at all times.
  • The “dev” branch is the one where all contributions must be merged before being promoted to master.
    • If you plan to propose a patch, please commit into the “dev” branch, or its own feature branch. Direct commit to “master” are not permitted.

Benchmarks

The benchmark uses lzbench, from @inikep compiled with GCC v6.2.0 on Linux 64-bits. The reference system uses a Core i7-3930K CPU @ 4.5GHz. Benchmark evaluates the compression of reference Silesia Corpus in single-thread mode.

CompressorRatioCompressionDecompression
memcpy1.0007300 MB/s7300 MB/s
LZ4 fast 8 (v1.7.3)1.799911 MB/s3360 MB/s
LZ4 default (v1.7.3)2.101625 MB/s3220 MB/s
LZO 2.092.108620 MB/s845 MB/s
QuickLZ 1.5.02.238510 MB/s600 MB/s
Snappy 1.1.32.091450 MB/s1550 MB/s
LZF v3.62.073365 MB/s820 MB/s
Zstandard 1.1.1 -12.876330 MB/s930 MB/s
Zstandard 1.1.1 -33.164200 MB/s810 MB/s
zlib deflate 1.2.8 -12.730100 MB/s370 MB/s
LZ4 HC -9 (v1.7.3)2.72034 MB/s3240 MB/s
zlib deflate 1.2.8 -63.09933 MB/s390 MB/s

LZ4 is also compatible and well optimized for x32 mode, for which it provides an additional +10% speed performance.

Installation

make
make install     # this command may require root access

LZ4's Makefile supports standard Makefile conventions, including staged installs, redirection, or command redefinition. It is compatible with parallel builds (-j#).

Documentation

The raw LZ4 block compression format is detailed within lz4_Block_format.

To compress an arbitrarily long file or data stream, multiple blocks are required. Organizing these blocks and providing a common header format to handle their content is the purpose of the Frame format, defined into lz4_Frame_format. Interoperable versions of LZ4 must respect this frame format.

Other source versions

Beyond the C reference source, many contributors have created versions of lz4 in multiple languages (Java, C#, Python, Perl, Ruby, etc.). A list of known source ports is maintained on the LZ4 Homepage.