The API is defined by the header file include/api.hpp
.
To use the API, add the include/
directory into the include path and link your binary against libraries bcobjs
divsufsort
.
The API supports only compressing a file bit-optimally. Moreover, the implementation code hasn’t been thoroughly tested, and an inconsistent use of namespaces implies that some implementation types may be dumped in the local namespace, why may lead to name clashes.
We plan to improve the API in subsequent releases.
Some background on the encoding format is needed in order to understand and use the API.
bc-zip
is a member of the LZ77 compressor family (in particular, the LZSS variant).
As such, a file is compressed as a sequence of text fragments (phrases), which can be represented either as:
Compression is achieved when the majority of text is represented through copy phrases.
Copy phrases are represented as couples <distance, length>
and compressed with an integer encoder, which translates integers to succinct bit sequences.
bc-zip
offers a variety of integer-encoder functions: look at the usage section for more informations about the matter.
Literal phrases are represented as triples <next, L, X>
, where:
X
: string in plaintext;L
: length of X
(in binary format, its length depends on the integer encoder in use);next
: a 32-bit integer number equal to the number of subsequent copy-phrases in the compressed stream occurring before the next literal-phrase.The next
field is needed order to disambiguate between copy- and literal-phrases (which is an effective strategy as the first phrase in the compressed stream must always be a literal phrase).
A sequence of (copy-/literal-)phrases, properly compressed with an integer encoder, representing a text T
is denoted as a parsing of T
.
For example, a valid parsing of bananaX
is <1, 3, ban> <2, 3> <0, 1, X>
; the “1” in <1, 3, ban>
is the “next-literal” field (because only <2,3>
divides <1, 3, ban>
from the next literal phrase, <0, 1, X>
).
A full compressed file is thus composed of three sections:
HEADER
: contains the original file length and the name of the integer encoder used for compressing the phrases;BODY
: contains the parsing of the original file;TAIL
: some extra space needed to decompress the file efficiently.A raw compressed buffer is composed of the HEADER
section only.
Functions compress
and decompress
operates with full compressed files (that is, HEADER
+ BODY
), while compress_buffer
and decompress_buffer
operates directly with raw compressed buffers.
The latter two are useful in a number of situations, such as implementing parallel compressors which “glue” together parsings of entire blocks of T
.
However, some care is required to use those functions: read carefully the documentation before using them to avoid hard-to-spot bugs.
If compress_buffer
is used to obtain a compressed representation of T
, the appropriate HEADER
and TAIL
must be added in order to obtain a full compressed file.
Use function create_header
to create the HEADER
. To get back informations contained therein, use extract_header
. To know how much space must be reserved for TAIL
, use function safe_buffer_size
.
Since literal phrases need to know how many copy phrases there are between the next literal phrases (the next
field), carelessly “gluing” together different parsings may result in an inconsistent parsing.
Function fix_parsing
is needed in those situations.
It takes as argument, among others, the “glued” parsing (or, in general, a parsing which may have ill-defined nextliteral fields) and a list of unsigned integers which represent the correct sequence of nextliteral fields.
For example, if parsing <42, 3, ban> <2, 3> <5, 1, X>
and nextliteral list [1, 0]
is given as argument, it returns the parsing <1, 3, ban> <2, 3> <0, 1, X>
.
On the other hand, if list [42, 84]
is given as argument, it returns parsing <42, 3, ban> <2, 3> <84, 1, X>
, which is incorrect.
Thus, it is crucial to properly compute the list of nextliteral fields to the function in order to obtain a correct parsing.
compress
Compress a text into a full compressed file (HEADER
+ BODY
+ TAIL
)
std::unique_ptr<byte[]>
compress(
const char *encoder_name, byte *uncompressed, size_t length, size_t *compressed_length = nullptr
);
Arguments:
encoder_name |
The integer encoder name (for a list, invoke bc-zip with command list ) |
uncompressed |
Pointer to data to be compressed |
length |
Length of data to be compressed |
compressed_length |
If not null, will contain the length of compressed output (in bytes) |
Returns:
Pointer to compressed output
decompress
Uncompress a full compressed file (HEADER
+ BODY
+ TAIL
)
std::unique_ptr<byte[]> decompress(byte *compressed, size_t *decompressed_size = nullptr);
Arguments:
compressed |
Start of compressed buffer |
decompressed_size |
When not null, will contain the size of the decompressed output (in bytes) |
Returns:
Pointer to decompressed output
compress_buffer
Compress a text into a raw compressed buffer (BODY
)
std::unique_ptr<byte[]>
compress_buffer(
const char *encoder_name, byte *uncompressed, size_t length, size_t *compressed_length = nullptr
);
Arguments
encoder_name |
The integer encoder name (for a list, invoke bc-zip with command list ) |
uncompressed |
Pointer to data to be compressed |
length |
Length of data to be compressed |
compressed_length |
If not null, will contain the length of compressed output (in bytes) |
Returns:
Pointer to compressed output
decompress_buffer
Uncompress a raw compressed buffer (BODY
)
void decompress_buffer(const char *encoder_name, byte *compressed, byte *output, size_t uncompressed_size);
Arguments:
encoder_name |
Encoder used to compress the data |
compressed |
Pointer to compressed data (just the BODY ) |
output |
Pointer to start of decompressed data. Sufficient memory should have been allocated prior to the invocation of this function. Invoke safe_buffer_size to know exactly how much memory to allocate for this buffer. |
uncompressed_size |
Size of uncompressed size (in bytes) |
safe_buffer_size
Gets the size of BODY
+ TAIL
given the encoder name and the BODY
length.
It is named that way because the TAIL
must be added in order to safely decompress a raw compressed buffer.
size_t safe_buffer_size(const char *encoder_name, size_t compressed_length);
Arguments:
encoder_name |
Name of encoder used in compression |
compressed_length |
Parsing length, in bytes |
Returns:
Length of containing buffer, in bytes
create_header
Creates an header (to be appended in front of a raw buffer to get a full buffer)
std::unique_ptr<byte[]> create_header(
const char *encoder_name, std::uint32_t file_size, size_t *header_size
);
Arguments:
encoder_name |
Encoder name |
file_size |
Length of uncompressed file (in bytes) |
header_size |
Length of returned header, in bytes |
Returns:
Pointer to allocated header
extract_header
Extract the header from a full buffer.
byte *extract_header(byte *parsing, char **encoder_name, std::uint32_t *file_size);
Arguments
parsing |
Pointer to parsing |
encoder_name |
After the invocation, pointer to encoder used to compress the buffer |
file_size |
After the invocation, pointer to uncompressed length (in bytes) |
Returns:
Pointer to parsing
fix_parsing
Fix a parsing composed by “gluing” together different edges by properly fixing nextliteral values.
template <typename lits_it>
void fix_parsing(
std::string encoder,
byte *parsing, size_t parsing_length, size_t uncomp_len,
byte *output,
lits_it next_literals
)
Arguments:
encoder |
Encoder used to obtain the parsing |
parsing |
The parsing to be fixed |
parsing_length |
Parsing length |
output |
The new, fixed parsing. Memory must be allocated in advance |