API

The API is defined by the header file include/api.hpp. To use the API, add the include/ directory into the include path and link your binary against libraries bcobjs divsufsort.

The API supports only compressing a file bit-optimally. Moreover, the implementation code hasn’t been thoroughly tested, and an inconsistent use of namespaces implies that some implementation types may be dumped in the local namespace, why may lead to name clashes.

We plan to improve the API in subsequent releases.

Encoding format

Some background on the encoding format is needed in order to understand and use the API.

bc-zip is a member of the LZ77 compressor family (in particular, the LZSS variant). As such, a file is compressed as a sequence of text fragments (phrases), which can be represented either as:

Compression is achieved when the majority of text is represented through copy phrases.

Copy phrases are represented as couples <distance, length> and compressed with an integer encoder, which translates integers to succinct bit sequences. bc-zip offers a variety of integer-encoder functions: look at the usage section for more informations about the matter.

Literal phrases are represented as triples <next, L, X>, where:

The next field is needed order to disambiguate between copy- and literal-phrases (which is an effective strategy as the first phrase in the compressed stream must always be a literal phrase).

A sequence of (copy-/literal-)phrases, properly compressed with an integer encoder, representing a text T is denoted as a parsing of T.

For example, a valid parsing of bananaX is <1, 3, ban> <2, 3> <0, 1, X>; the “1” in <1, 3, ban> is the “next-literal” field (because only <2,3> divides <1, 3, ban> from the next literal phrase, <0, 1, X>).

A full compressed file is thus composed of three sections:

A raw compressed buffer is composed of the HEADER section only.

Functions overview

Functions compress and decompress operates with full compressed files (that is, HEADER + BODY), while compress_buffer and decompress_buffer operates directly with raw compressed buffers.

The latter two are useful in a number of situations, such as implementing parallel compressors which “glue” together parsings of entire blocks of T. However, some care is required to use those functions: read carefully the documentation before using them to avoid hard-to-spot bugs.

If compress_buffer is used to obtain a compressed representation of T, the appropriate HEADER and TAIL must be added in order to obtain a full compressed file. Use function create_header to create the HEADER. To get back informations contained therein, use extract_header. To know how much space must be reserved for TAIL, use function safe_buffer_size.

Since literal phrases need to know how many copy phrases there are between the next literal phrases (the next field), carelessly “gluing” together different parsings may result in an inconsistent parsing.

Function fix_parsing is needed in those situations.

It takes as argument, among others, the “glued” parsing (or, in general, a parsing which may have ill-defined nextliteral fields) and a list of unsigned integers which represent the correct sequence of nextliteral fields.

For example, if parsing <42, 3, ban> <2, 3> <5, 1, X> and nextliteral list [1, 0] is given as argument, it returns the parsing <1, 3, ban> <2, 3> <0, 1, X>.

On the other hand, if list [42, 84] is given as argument, it returns parsing <42, 3, ban> <2, 3> <84, 1, X>, which is incorrect.

Thus, it is crucial to properly compute the list of nextliteral fields to the function in order to obtain a correct parsing.

Headers

compress

Compress a text into a full compressed file (HEADER + BODY + TAIL)

std::unique_ptr<byte[]>
compress(
	const char *encoder_name, byte *uncompressed, size_t length, size_t *compressed_length = nullptr
);

Arguments:

encoder_name The integer encoder name (for a list, invoke bc-zip with command list)
uncompressed Pointer to data to be compressed
length Length of data to be compressed
compressed_length If not null, will contain the length of compressed output (in bytes)

Returns:

Pointer to compressed output

decompress

Uncompress a full compressed file (HEADER + BODY + TAIL)

std::unique_ptr<byte[]> decompress(byte *compressed, size_t *decompressed_size = nullptr);

Arguments:

compressed Start of compressed buffer
decompressed_size When not null, will contain the size of the decompressed output (in bytes)

Returns:

Pointer to decompressed output

compress_buffer

Compress a text into a raw compressed buffer (BODY)

std::unique_ptr<byte[]>
compress_buffer(
	const char *encoder_name, byte *uncompressed, size_t length, size_t *compressed_length = nullptr
);

Arguments

encoder_name The integer encoder name (for a list, invoke bc-zip with command list)
uncompressed Pointer to data to be compressed
length Length of data to be compressed
compressed_length If not null, will contain the length of compressed output (in bytes)

Returns:

Pointer to compressed output

decompress_buffer

Uncompress a raw compressed buffer (BODY)

void decompress_buffer(const char *encoder_name, byte *compressed, byte *output, size_t uncompressed_size);

Arguments:

encoder_name Encoder used to compress the data
compressed Pointer to compressed data (just the BODY)
output Pointer to start of decompressed data. Sufficient memory should have been allocated prior to the invocation of this function. Invoke safe_buffer_size to know exactly how much memory to allocate for this buffer.
uncompressed_size Size of uncompressed size (in bytes)

safe_buffer_size

Gets the size of BODY + TAIL given the encoder name and the BODY length. It is named that way because the TAIL must be added in order to safely decompress a raw compressed buffer.

size_t safe_buffer_size(const char *encoder_name, size_t compressed_length);

Arguments:

encoder_name Name of encoder used in compression
compressed_length Parsing length, in bytes

Returns:

Length of containing buffer, in bytes

create_header

Creates an header (to be appended in front of a raw buffer to get a full buffer)

std::unique_ptr<byte[]> create_header(
	const char *encoder_name, std::uint32_t file_size, size_t *header_size
);

Arguments:

encoder_name Encoder name
file_size Length of uncompressed file (in bytes)
header_size Length of returned header, in bytes

Returns:

Pointer to allocated header

extract_header

Extract the header from a full buffer.

byte *extract_header(byte *parsing, char **encoder_name, std::uint32_t *file_size);

Arguments

parsing Pointer to parsing
encoder_name After the invocation, pointer to encoder used to compress the buffer
file_size After the invocation, pointer to uncompressed length (in bytes)

Returns:

Pointer to parsing

fix_parsing

Fix a parsing composed by “gluing” together different edges by properly fixing nextliteral values.

template <typename lits_it>
void fix_parsing(
	std::string encoder, 
	byte *parsing, size_t parsing_length, size_t uncomp_len, 
	byte *output,
	lits_it next_literals
)

Arguments:

encoder Encoder used to obtain the parsing
parsing The parsing to be fixed
parsing_length Parsing length
output The new, fixed parsing. Memory must be allocated in advance