API

The API is defined by the header file include/api.hpp. To use the API, add the include/ directory into the include path and link your binary against libraries bcobjs divsufsort.

The API supports only compressing a file bit-optimally. Moreover, the implementation code hasn’t been thoroughly tested, and an inconsistent use of namespaces implies that some implementation types may be dumped in the local namespace, why may lead to name clashes.

We plan to improve the API in subsequent releases.

Encoding format

Some background on the encoding format is needed in order to understand and use the API.

bc-zip is a member of the LZ77 compressor family (in particular, the LZSS variant). As such, a file is compressed as a sequence of text fragments (phrases), which can be represented either as:

in plaintext form (literal phrases), or
copies to previous occurrences of the same string (copy phrase).

Compression is achieved when the majority of text is represented through copy phrases.

Copy phrases are represented as couples <distance, length> and compressed with an integer encoder, which translates integers to succinct bit sequences. bc-zip offers a variety of integer-encoder functions: look at the usage section for more informations about the matter.

Literal phrases are represented as triples <next, L, X>, where:

X: string in plaintext;
L: length of X (in binary format, its length depends on the integer encoder in use);
next: a 32-bit integer number equal to the number of subsequent copy-phrases in the compressed stream occurring before the next literal-phrase.

The next field is needed order to disambiguate between copy- and literal-phrases (which is an effective strategy as the first phrase in the compressed stream must always be a literal phrase).

A sequence of (copy-/literal-)phrases, properly compressed with an integer encoder, representing a text T is denoted as a parsing of T.

For example, a valid parsing of bananaX is <1, 3, ban> <2, 3> <0, 1, X>; the “1” in <1, 3, ban> is the “next-literal” field (because only <2,3> divides <1, 3, ban> from the next literal phrase, <0, 1, X>).

A full compressed file is thus composed of three sections:

HEADER: contains the original file length and the name of the integer encoder used for compressing the phrases;
BODY: contains the parsing of the original file;
TAIL: some extra space needed to decompress the file efficiently.

A raw compressed buffer is composed of the HEADER section only.

Functions overview

Functions compress and decompress operates with full compressed files (that is, HEADER + BODY), while compress_buffer and decompress_buffer operates directly with raw compressed buffers.

The latter two are useful in a number of situations, such as implementing parallel compressors which “glue” together parsings of entire blocks of T. However, some care is required to use those functions: read carefully the documentation before using them to avoid hard-to-spot bugs.

If compress_buffer is used to obtain a compressed representation of T, the appropriate HEADER and TAIL must be added in order to obtain a full compressed file. Use function create_header to create the HEADER. To get back informations contained therein, use extract_header. To know how much space must be reserved for TAIL, use function safe_buffer_size.

Since literal phrases need to know how many copy phrases there are between the next literal phrases (the next field), carelessly “gluing” together different parsings may result in an inconsistent parsing.

Function fix_parsing is needed in those situations.

It takes as argument, among others, the “glued” parsing (or, in general, a parsing which may have ill-defined nextliteral fields) and a list of unsigned integers which represent the correct sequence of nextliteral fields.

For example, if parsing <42, 3, ban> <2, 3> <5, 1, X> and nextliteral list [1, 0] is given as argument, it returns the parsing <1, 3, ban> <2, 3> <0, 1, X>.

On the other hand, if list [42, 84] is given as argument, it returns parsing <42, 3, ban> <2, 3> <84, 1, X>, which is incorrect.

Thus, it is crucial to properly compute the list of nextliteral fields to the function in order to obtain a correct parsing.

Headers

`compress`

Compress a text into a full compressed file (HEADER + BODY + TAIL)

std::unique_ptr<byte[]>
compress(
	const char *encoder_name, byte *uncompressed, size_t length, size_t *compressed_length = nullptr
);

Arguments:

`encoder_name`	The integer encoder name (for a list, invoke `bc-zip` with command `list`)
`uncompressed`	Pointer to data to be compressed
`length`	Length of data to be compressed
`compressed_length`	If not null, will contain the length of compressed output (in bytes)

Returns:

Pointer to compressed output

`decompress`

Uncompress a full compressed file (HEADER + BODY + TAIL)

std::unique_ptr<byte[]> decompress(byte *compressed, size_t *decompressed_size = nullptr);

Arguments:

`compressed`	Start of compressed buffer
`decompressed_size`	When not null, will contain the size of the decompressed output (in bytes)

Returns:

Pointer to decompressed output

`compress_buffer`

Compress a text into a raw compressed buffer (BODY)

std::unique_ptr<byte[]>
compress_buffer(
	const char *encoder_name, byte *uncompressed, size_t length, size_t *compressed_length = nullptr
);

Arguments

`encoder_name`	The integer encoder name (for a list, invoke `bc-zip` with command `list`)
`uncompressed`	Pointer to data to be compressed
`length`	Length of data to be compressed
`compressed_length`	If not null, will contain the length of compressed output (in bytes)

Returns:

Pointer to compressed output

`decompress_buffer`

Uncompress a raw compressed buffer (BODY)

void decompress_buffer(const char *encoder_name, byte *compressed, byte *output, size_t uncompressed_size);

Arguments:

`encoder_name`	Encoder used to compress the data
`compressed`	Pointer to compressed data (just the `BODY`)
`output`	Pointer to start of decompressed data. Sufficient memory should have been allocated prior to the invocation of this function. Invoke `safe_buffer_size` to know exactly how much memory to allocate for this buffer.
`uncompressed_size`	Size of uncompressed size (in bytes)

`safe_buffer_size`

Gets the size of BODY + TAIL given the encoder name and the BODY length. It is named that way because the TAIL must be added in order to safely decompress a raw compressed buffer.

size_t safe_buffer_size(const char *encoder_name, size_t compressed_length);

Arguments:

`encoder_name`	Name of encoder used in compression
`compressed_length`	Parsing length, in bytes

Returns:

Length of containing buffer, in bytes

`create_header`

Creates an header (to be appended in front of a raw buffer to get a full buffer)

std::unique_ptr<byte[]> create_header(
	const char *encoder_name, std::uint32_t file_size, size_t *header_size
);

Arguments:

`encoder_name`	Encoder name
`file_size`	Length of uncompressed file (in bytes)
`header_size`	Length of returned header, in bytes

Returns:

Pointer to allocated header

`extract_header`

Extract the header from a full buffer.

byte *extract_header(byte *parsing, char **encoder_name, std::uint32_t *file_size);

Arguments

`parsing`	Pointer to parsing
`encoder_name`	After the invocation, pointer to encoder used to compress the buffer
`file_size`	After the invocation, pointer to uncompressed length (in bytes)

Returns:

Pointer to parsing

`fix_parsing`

Fix a parsing composed by “gluing” together different edges by properly fixing nextliteral values.

template <typename lits_it>
void fix_parsing(
	std::string encoder, 
	byte *parsing, size_t parsing_length, size_t uncomp_len, 
	byte *output,
	lits_it next_literals
)

Arguments:

`encoder`	Encoder used to obtain the parsing
`parsing`	The parsing to be fixed
`parsing_length`	Parsing length
`output`	The new, fixed parsing. Memory must be allocated in advance