Bzip2 in Rust: porting the randomization table

- bzip2, rust

Here is a straightforward port of some easy code.

randtable.c has a lookup table with seemingly-random numbers. This table is used by the following macros in bzlib_private.h:

extern Int32 BZ2_rNums[512];

#define BZ_RAND_DECLS                          \
   Int32 rNToGo;                               \
   Int32 rTPos                                 \

#define BZ_RAND_INIT_MASK                      \
   s->rNToGo = 0;                              \
   s->rTPos  = 0                               \

#define BZ_RAND_MASK ((s->rNToGo == 1) ? 1 : 0)

#define BZ_RAND_UPD_MASK                       \
   if (s->rNToGo == 0) {                       \
      s->rNToGo = BZ2_rNums[s->rTPos];         \
      s->rTPos++;                              \
      if (s->rTPos == 512) s->rTPos = 0;       \
   }                                           \
   s->rNToGo--;

Here, BZ_RAND_DECLS is used to declare two fields, rNToGo and rTPos, into two structs (1, 2). Both are similar to this:

typedef struct {
   ...
   Bool     blockRandomised;
   BZ_RAND_DECLS
   ...
} DState;

Then, the code that needs to initialize those fields calls BZ_RAND_INIT_MASK, which expands into code to set the two fields to zero.

At several points in the code, BZ_RAND_UPD_MASK gets called, which expands into code that updates the randomization state, or something like that, and uses BZ_RAND_MASK to get a useful value out of the randomization state.

I have no idea yet what the state is about, but let's port it directly.

Give things a name

It's interesting to see that no code except for those macros uses the fields rNToGo and rTPos, which are declared via BZ_RAND_DECLS. So, let's make up a type with a name for that. Since I have no better name for it, I shall call it just RandState. I added that type definition in the C code, and replaced the macro-which-creates-struct-fields with a RandState-typed field:

-#define BZ_RAND_DECLS                          \
-   Int32 rNToGo;                               \
-   Int32 rTPos                                 \
+typedef struct {
+   Int32 rNToGo;
+   Int32 rTPos;
+} RandState;

...

-      BZ_RAND_DECLS;
+      RandState rand;

Since the fields now live inside a sub-struct, I changed the other macros to use s->rand.rNToGo instead of s->rNToGo, and similarly for the other field.

Turn macros into functions

Now, three commits (1, 2, 3) to turn the macros BZ_RAND_INIT_MASK, BZ_RAND_MASK, and BZ_RAND_UPD_MASK into functions.

And now that the functions live in the same C source file as the lookup table they reference, the table can be made static const to avoid having it as read/write unshared data in the linked binary.

Premature optimization concern: doesn't de-inlining those macros cause performance problems? At first, we will get the added overhead from a function call. When the whole code is ported to Rust, the Rust compiler will probably be able to figure out that those tiny functions can be inlined (or we can #[inline] them by hand if we have proof, or if we have more hubris than faith in LLVM).

Port functions and table to Rust

The functions are so tiny, and the table so cut-and-pasteable, that it's easy to port them to Rust in a single shot:

#[no_mangle]
pub unsafe extern "C" fn BZ2_rand_init() -> RandState {
    RandState {
        rNToGo: 0,
        rTPos: 0,
    }
}

#[no_mangle]
pub unsafe extern "C" fn BZ2_rand_mask(r: &RandState) -> i32 {
    if r.rNToGo == 1 {
        1
    } else {
        0
    }
}

#[no_mangle]
pub unsafe extern "C" fn BZ2_rand_update_mask(r: &mut RandState) {
    if r.rNToGo == 0 {
        r.rNToGo = RAND_TABLE[r.rTPos as usize];
        r.rTPos += 1;
        if r.rTPos == 512 {
            r.rTPos = 0;
        }
    }
    r.rNToGo -= 1;
}

Also, we define the RandState type as a Rust struct with a C-compatible representation, so it will have the same layout in memory as the C struct. This is what allows us to have a RandState in the C struct, while in reality the C code doesn't access it directly; it is just used as a struct field.

// Keep this in sync with bzlib_private.h:
#[repr(C)]
pub struct RandState {
    rNToGo: i32,
    rTPos: i32,
}

See the commit for the corresponding extern declarations in bzlib_private.h. With those functions and the table ported to Rust, we can remove randtable.c. Yay!

A few cleanups

After moving to another house one throws away useless boxes; we have to do some cleanup in the Rust code after the initial port, too.

Rust prefers snake_case fields rather than camelCase ones, and I agree. I renamed the fields to n_to_go and table_pos.

Then, I discovered that the EState struct doesn't actually use the fields for the randomization state. I just removed them.

Exegesis

What is that randomization state all about?

And why does DState (the struct used during decompression) need the randomization state, but EState (used during compression) doesn't need it?

I found this interesting comment:

      /*-- 
         Now a single bit indicating (non-)randomisation. 
         As of version 0.9.5, we use a better sorting algorithm
         which makes randomisation unnecessary.  So always set
         the randomised bit to 'no'.  Of course, the decoder
         still needs to be able to handle randomised blocks
         so as to maintain backwards compatibility with
         older versions of bzip2.
      --*/
      bsW(s,1,0);

Okay! So compression no longer uses randomization, but decompression has to support files which were compressed with randomization. Here, bsW(s,1,0) always writes a 0 bit to the file.

However, the decompression code actually reads the blockRandomised bit from the file so that it can see whether it is dealing with an old-format file:

GET_BITS(BZ_X_RANDBIT, s->blockRandomised, 1);

Later in the code, this s->blockRandomised field gets consulted; if the bit is on, the code calls BZ2_rand_update_mask() and friends as appropriate. If one is using files compressed with Bzip2 0.9.5 or later, those randomization functions are not even called.

Talk about preserving compatibility with the past.

Explanation, or building my headcanon

Bzip2's compression starts by running a Burrows-Wheeler Transform on a block of data to compress, which is a wonderful algorithm that I'm trying to fully understand. Part of the BWT involves sorting all the string rotations of the block in question.

Per the comment I cited, really old versions of bzip2 used a randomization helper to make sorting perform well in extreme cases, but not-so-old versions fixed this.

This explains why the decompression struct DState has a blockRandomised bit, but the compression struct EState doesn't need one. The fields that the original macro was pasting into EState were just a vestige from 1999, which is when Bzip2 0.9.5 was released.