Welcome to gzip_static’s documentation!
Installation
gzip_static can be installed with pip install gzip_static
. There are no
dependencies by default.
The following packages can be installed to enhance the functionality of gzip_static.
zopfli adds zopfli compression to gzip_static. Zopfli compressed files can be decompressed with any gzip-compatible tool and the compressed size is a few percent smaller than files compressed with gzip level 9. This comes with much increased compression time (~125x increase or thereabouts depending on the website). This works great for files that aren’t changed much or at all but always downloaded like stylesheets.
xxhash speeds up the checksumming process. This makes gzip-static about 28% faster when running on a website folder with all the gzip files up to date.
isal speeds up the decompression of gzip files during the checksumming process. This makes gzip-static about 66% faster when running on a website folder with all the gzip files up to date. Isal is only available on 64-bit platforms.
Together xxhash and isal decrease the runtime of checksumming by about 60%, so it is about 2.5 times faster.
These dependencies are all optional and can be installed separately in the environment or with the optional dependency commands:
pip install gzip_static[zopfli]
installs gzip_static and zopfli.pip install gzip_static[performance]
installs gzip_static, xxhash and isal.pip install gzip_static[full]
installs gzip_static, zopfli, xxhash and isal.
Usage
gzip-static usage
usage: gzip-static [-h] [-e EXTENSIONS_FILE] [-l {6,9,11} | --zopfli] [-f]
[--remove-orphans] [-d]
directory
Positional Arguments
- directory
The directory containing the static site
Named Arguments
- -e, --extensions-file
A file with extensions to consider when compressing. Use one line per extension. Check the default for an example. DEFAULT: /home/docs/checkouts/readthedocs.org/user_builds/gzip-static/envs/latest/lib/python3.7/site-packages/gzip_static-0.1.0-py3.7.egg/gzip_static/extensions.txt
Default: /home/docs/checkouts/readthedocs.org/user_builds/gzip-static/envs/latest/lib/python3.7/site-packages/gzip_static-0.1.0-py3.7.egg/gzip_static/extensions.txt
- -l, --compression-level
Possible choices: 6, 9, 11
The compression level that will be used for the gzip compression. Use 11 for zopfli compression (if available). Default: 9
Default: 9
- --zopfli
Use zopfli for the compression. Alias for -l 11 or –compression-level 11.
- -f, --force
Force recompression of all earlier compressed files.
Default: False
- --remove-orphans
Remove gzip files for which the parent file is missing and for which the extension is in the extensions file. For example: page3.html.gz present but no page3.html is present. In that case page3.html.gz will be removed.
Default: False
- -d, --debug
Print debug information to stderr.
Default: False
gzip-static-find-orphans usage
usage: gzip-static-find-orphans [-h] [-e EXTENSIONS_FILE] directory
Positional Arguments
- directory
The directory containing the static site
Named Arguments
- -e, --extensions-file
A file with extensions to consider when compressing. Use one line per extension. Check the default for an example. DEFAULT: /home/docs/checkouts/readthedocs.org/user_builds/gzip-static/envs/latest/lib/python3.7/site-packages/gzip_static-0.1.0-py3.7.egg/gzip_static/extensions.txt
Default: /home/docs/checkouts/readthedocs.org/user_builds/gzip-static/envs/latest/lib/python3.7/site-packages/gzip_static-0.1.0-py3.7.egg/gzip_static/extensions.txt
API Documentation: gzip_static
Functions to compress a website’s static files.
- class gzip_static.GzipStaticResult(created: int, updated: int, skipped: int, deleted: int)
A class containing the results for the gzip_static function.
- property created
Alias for field number 0
- property deleted
Alias for field number 3
- property skipped
Alias for field number 2
- property updated
Alias for field number 1
- gzip_static.compress_idempotent(filepath: Union[str, os.PathLike], compresslevel=9, hash_algorithm=<built-in function openssl_sha1>, force: bool = False) int
Only compress the file if no companion .gz is present that contains the correct contents.
This function ensures the mode, atime and mtime of the gzip file are inherited from the file to be compressed.
- Parameters
filepath – The path to the file.
compresslevel – The compression level. Use 11 for zopfli.
hash_algorithm – The hash_algorithm to check the contents with.
force – Always create a new ‘.gz’ file to overwrite the old one.
- Returns
An integer that stands for the action taken. Matches with the COMPRESSED, RECOMPRESSED and SKIPPED constants in this module.
- gzip_static.compress_path(filepath: Union[str, os.PathLike], compresslevel: int = 9, block_size: int = 32768) None
Compress a file’s contents and write them to a ‘.gz’ file.
Similar to gzip -k <filepath>
- Parameters
filepath – The path to the file
compresslevel – The gzip compression level to use. Use 11 for zopfli compression.
block_size – The size of the chunks read from the file at once.
- gzip_static.find_orphaned_files(dir: Union[str, os.PathLike], extensions: Container[str] = frozenset({'.css', '.htm', '.html', '.js', '.json', '.rss', '.svg', '.txt', '.xml', '.xsl'})) Generator[str, None, None]
Scan a directory recursively for ‘.gz’ files that do not have a parent file with an extension in extensions.
For example
find_orphaned_files(my_dir, set(".html"))
will findindex.html.gz
ifindex.html
is not present. It will not findmyhostedarchive.tar.gz
as.tar
is not in the set of extensions.- Parameters
dir – The directory to scan.
extensions – Extensions of parents file to include.
- Returns
A generator of filepaths of orphaned ‘.gz’ files.
- gzip_static.find_static_files(dir: Union[str, os.PathLike], extensions: Container[str] = frozenset({'.css', '.htm', '.html', '.js', '.json', '.rss', '.svg', '.txt', '.xml', '.xsl'})) Generator[str, None, None]
Scan a directory recursively for files that have an extension in the set of extensions.
- Parameters
dir – The directory to scan.
extensions – A set of extensions to scan for.
- Returns
A generator of filepaths that match the extensions.
- gzip_static.get_extension(filename: str)
The filename’s extension, if any.
This includes the leading period. For example: ‘.txt’
- gzip_static.gzip_static(dir: Union[str, os.PathLike], extensions: Container[str] = frozenset({'.css', '.htm', '.html', '.js', '.json', '.rss', '.svg', '.txt', '.xml', '.xsl'}), compresslevel: int = 9, hash_algorithm=<built-in function openssl_sha1>, force: bool = False, remove_orphans: bool = False) gzip_static.GzipStaticResult
Gzip all static files in a directory and its subdirectories in an idempotent manner.
- Parameters
dir – The directory to recurse through.
extensions – Extensions which are static files.
compresslevel – The compression level that is used when compressing.
hash_algorithm – The hash algorithm is used when checking file contents.
force – Recompress all files regardless if content has changed or not.
remove_orphans – Remove ‘.gz’ files where the parent static file is no longer present.
- Returns
A tuple with 4 entries. The number of compressed, recompressed, skipped and deleted gzip files.
- gzip_static.hash_file_contents(filepath: Union[str, os.PathLike], hash_algorithm=<built-in function openssl_sha1>, block_size: int = 32768) bytes
Read contents from a file and return the hash.
- Parameters
filepath – The path to the file. Paths ending in ‘.gz’ will be automatically decompressed.
hash_algorithm – The hash algorithm to use. Must be hashlib-compatible.
block_size – The size of the chunks read from the file at once.
- Returns
A digest of the hash.
- gzip_static.read_extensions_file(filepath: Union[str, os.PathLike]) Set[str]
Read a file where there is an extension on each line
- Parameters
filepath – The extensions file
- Returns
a set of extensions.
Technical considerations
Choosing a checksum
Different checksums were considered. MD5 is traditionally used for
checksumming, but also SHA-1, SHA-256 and SHA-512 see use as a hashing
algorithm nowadays. Traditionally, cyclic redundancy checks are performed.
These are available in the Python zlib
libraries as the crc32
and
adler32
functions. A fast method called XXHash is also available nowadays for hashing.
There are Python bindings available as a
package on PyPI.
As highlighted in this answer on bleepcoder by the XXHash author cyclic redundancy checks have slightly worse collisions than modern hash algorithms.
The XXHash homepage has a list of
algorithms and their speeds. The SHA1
hash algorithm is the fastest
algorithm available in hashlib.algorithms_guaranteed
.
(This was verified on two different PC’s). Therefore it was chosen as default.
The XXH3_128 algorithm is used when XXhash is installed.
Speedy hashing of small gzip files
Speedy hashing of normal files is quite easy. Open a file, read it in blocks,
feed each block to the hasher and get a checksum in the end. Choose a decent
block size to speed it up slightly. (32K was used here. 128K is used by cat
so choosing more than Python’s default of 8K is quite common).
Speedy hashing of gzip files presents a problem. We can simply use Python’s
gzip.open
which returns a GzipFile
, but that is slow. Just like normal
open
this creates an interface to read the file, but then it gets more
complicated. This gets wrapped into a _PaddedFile
object which is then
wrapped into a _GzipReader
object which is then wrapped by the GzipFile
.
All these layers solve two problems:
A controlled number of bytes can be read from the compressed file. Since the compression ratio can differ along the file it is impossible to grab a certain number of bytes and exactly know the size of the output once decompressed.
_GzipReader.read
has mechanisms built-in to always output the desired numbers of bytes.Gzip allows for multiple members (each consisting of header, compressed body and trailer) to be concatenated together. After a member is decompressed the remaining bytes in the file must be checked for another gzip member.
This functionality creates a lot of overhead. Using Python’s zlib.decompress
with wbits=31
solves this problem as it can compress an in-memory block
in its entirety. It cannot read multiple members but since these gzip files
are compressed by gzip_static itself we know they only contain one member.
However this presents another problem: files have to be read in memory entirely.
This was solved by using a zlib.decompressobj
instead and using the
decompress
method on that object. This works with streaming decompression.
It is not a problem that we do not know before which number of bytes is returned
by the function. This is typically in the 3-6 times the input bytes range.
At best gzip can compress at ratios of ~1000x. (Tested with all zeroes binary,
all ones binary, and a repetition of a single character). So if the input
block size is 8k, we can expect at most 8M bytes be read in memory. This is
acceptable, and this way even large static files of several hundreds of MB can
be checksummed in a streaming fashion.
The great advantage of this method is that most gzip’s will be smaller than 8k.
So only one decompress call is needed. This is almost as fast as in-memory
decompression with zlib.decompress
but allowing streaming.
For example on docs.python.org compressing the static files compresses 6374 static files with a combined size of 481 MB. The resulting gzip sizes are as follows.
gzip 8K or below (one decompress call): 3516
gzip 8K - 16K (two decompress calls): 1560
gzip 16K -24K (three decompress calls): 565
gzip 24K - 32K (four decompress calls): 308
gzip 32k-64k (eight or less decompress calls): 356
gzip larger than 64k: 69
In total 6305 (99%!) of the gzip files are smaller than 64K and can be
decompressed with eight or less calls. Since the gzip.GzipFile
overhead
weighs in very heavy at these small file sizes using zlib.decompressobj
creates a notable speed improvement, reducing decompression time by about
~30% for the docs.python.org website.
The speedup can be even greater when using
python-isal. Using its
isal_zlib.decompressobj
reduces the decompression time with more than 50%.
No brotli support
Brotli is an excellent compression algorithm. Most browsers support it. There are several reasons why it is not supported by gzip_static.
The ngx_brotli module is not provided as a package by either Debian, Ubuntu or CentOS.
Supporting two formats simultaneously makes the code more complex.
This project was made to work with nginx’s gzip plugin to host my
websites. The gzip plugin is builtin in
even the simplest nginx package on Debian (nginx-light
). Getting brotli to
work however is much more work. It needs to be compiled, but it needs to
compiled exactly with the right instructions. Brotli has been around
since 2013 and has tremendous advantages, but
ngx_brotli has not been packaged in Debian for 8 years. The last release
of Debian (bullseye) had 11294 new packages but ngx_brotli is nowhere on the
horizon.
Once a properly working ngx_brotli module is packaged in Debian, I am happy to add brotli support!
Changelog
version 0.1.0
Publish documentation on readthedocs.
Make sure the gzip files inherit file attributes from the parent file.
Add functionality to remove orphaned gzip files.
Speed up the checksumming process with isal and xxhash.
Add zopfli support.
Create functions to compress a website’s static assets idempotently.