Federico's Blog

  1. Bzip2 in Rust - Basic infrastructure and CRC32 computation

    - bzip2, rust

    I have started a little experiment in porting bits of the widely-used bzip2/bzlib to Rust. I hope this can serve to refresh bzip2, which had its last release in 2010 and has been nominally unmaintained for years.

    I hope to make several posts detailing how this port is done. In this post, I'll talk about setting up a Rust infrastructure for bzip2 and my experiments in replacing the C code that does CRC32 computations.

    Super-quick summary of how librsvg was ported to Rust

    • Add the necessary autotools infrastructure to build a Rust sub-library that gets linked into the main public library.

    • Port bit by bit to Rust. Add unit tests as appropriate. Refactor endlessly.

    • MAINTAIN THE PUBLIC API/ABI AT ALL COSTS so callers don't notice that the library is being rewritten under their feet.

    I have no idea of how bzip2 works internally, but I do know how to maintain ABIs, so let's get started.

    Bzip2's source tree

    As a very small project that just builds a library and couple of executables, bzip2 was structured with all the source files directly under a toplevel directory.

    The only tests in there are three reference files that get compressed, then uncompressed, and then compared to the original ones.

    As the rustification proceeds, I'll move the files around to better places. The scheme from librsvg worked well in this respect, so I'll probably be copying many of the techniques and organization from there.

    Deciding what to port first

    I looked a bit at the bzip2 sources, and the code to do CRC32 computations seemed isolated enough from the rest of the code to port easily.

    The CRC32 code was arranged like this. First, a lookup table in crc32table.c:

    UInt32 BZ2_crc32Table[256] = {
       0x00000000L, 0x04c11db7L, 0x09823b6eL, 0x0d4326d9L,
       0x130476dcL, 0x17c56b6bL, 0x1a864db2L, 0x1e475005L,
       ...
    }
    

    And then, three macros in bzlib_private.h which make up all the CRC32 code in the library:

    extern UInt32 BZ2_crc32Table[256];
    
    #define BZ_INITIALISE_CRC(crcVar)              \
    {                                              \
       crcVar = 0xffffffffL;                       \
    }
    
    #define BZ_FINALISE_CRC(crcVar)                \
    {                                              \
       crcVar = ~(crcVar);                         \
    }
    
    #define BZ_UPDATE_CRC(crcVar,cha)              \
    {                                              \
       crcVar = (crcVar << 8) ^                    \
                BZ2_crc32Table[(crcVar >> 24) ^    \
                               ((UChar)cha)];      \
    }
    

    Initially I wanted to just remove this code and replace it with one of the existing Rust crates to do CRC32 computations, but first I needed to know which variant of CRC32 this is.

    Preparing the CRC32 port so it will not break

    I needed to set up tests for the CRC32 code so the replacement code would compute exactly the same values as the original:

    Then I needed a test that computed the CRC32 values of several strings, so I could capture the results and make them part of the test.

    static const UChar buf1[] = "";
    static const UChar buf2[] = " ";
    static const UChar buf3[] = "hello world";
    static const UChar buf4[] = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, ";
    
    int
    main (void)
    {
        printf ("buf1: %x\n", crc32_buffer(buf1, strlen(buf1)));
        printf ("buf2: %x\n", crc32_buffer(buf2, strlen(buf2)));
        printf ("buf3: %x\n", crc32_buffer(buf3, strlen(buf3)));
        printf ("buf4: %x\n", crc32_buffer(buf4, strlen(buf4)));
        // ...
    }
    

    This computes the CRC32 values of some strings using the original algorithm, and prints their results. Then I could cut&paste those results, and turn the printf into assert — and that gives me a test.

    int
    main (void)
    {
        assert (crc32_buffer (buf1, strlen (buf1)) == 0x00000000);
        assert (crc32_buffer (buf2, strlen (buf2)) == 0x29d4f6ab);
        assert (crc32_buffer (buf3, strlen (buf3)) == 0x44f71378);
        assert (crc32_buffer (buf4, strlen (buf4)) == 0xd31de6c9);
        // ...
    }
    

    Setting up a Rust infrastructure for bzip2

    Two things made this reasonably easy:

    I.e. "copy and paste from somewhere that I know works well". Wonderful!

    This is the commit that adds a Rust infrastructure for bzip2. It does the following:

    1. Create a Cargo workspace (a Cargo.toml in the toplevel) with a single member, a bzlib_rust directory where the Rustified parts of the code will live.
    2. Create bzlib_rust/Cargo.toml and bzlib_rust/src for the Rust sources. This will generate a staticlib for libbzlib_rust.a, that can be linked into the main libbz2.la.
    3. Puts in automake hooks so that make clean, make check, etc. all do what you expect for the Rust part.

    As a side benefit, librsvg's Autotools+Rust infrastructure already handled things like cross-compilation correctly, so I have high hopes that this will be good enough for bzip2.

    Can I use a Rust crate for CRC32?

    There are many Rust crates to do CRC computations. I was hoping especially to be able to use crc32fast, which is SIMD-accelerated.

    I wrote a Rust version of the "CRC me a buffer" test from above to see if crc32fast produced the same values as the C code, and of course it didn't. Eventually, after asking on Mastodon, Kepstin figured out what variant of CRC32 is being used in the original code.

    It turns out that this is directly doable in Rust with the git version of the crc crate. This crate lets one configure the CRC32 polynomial and the mode of computation; there are many variants of CRC32 and I wasn't fully aware of them.

    The magic incantation is this:

    let mut digest = crc32::Digest::new_custom(crc32::IEEE, !0u32, !0u32, crc::CalcType::Normal);
    

    With that, the Rust test produces the same values as the C code. Yay!

    But it can't be that easy

    Bzlib stores its internal state in the EState struct, defined in bzlib_private.h.

    That struct stores several running CRC32 computations, and the state for each one of those is a single UInt32 value. However, I cannot just replace those struct fields with something that comes from Rust, since the C code does not know the size of a crc32::Digest from Rust.

    The normal way to do this (say, like in librsvg) would be to turn UInt32 some_crc into void *some_crc and heap-allocate that on the Rust side, with whatever size it needs.

    However!

    It turns out that bzlib lets the caller define a custom allocator so that bzlib doesn't use plain malloc() by default.

    Rust lets one define a global, custom allocator. However, bzlib's concept of a custom allocator includes a bit of context:

    typedef struct {
        // ...
    
        void *(*bzalloc)(void *opaque, int n, int m);
        void (*bzfree)(void *opaque, void *ptr);
        void *opaque;
    } bz_stream;
    

    The caller sets up bzalloc/bzfree callbacks and an optional opaque context for the allocator. However, Rust's GlobalAlloc is set up at compilation time, and we can't pass that context in a good, thread-safe fashion to it.

    Who uses the bzlib custom allocator, anyway?

    If one sets bzalloc/bzfree to NULL, bzlib will use the system's plain malloc()/free() by default. Most software does this.

    I am looking in Debian's codesearch for where bzalloc gets set, hoping that I can figure out if that software really needs a custom allocator, or if they are just dressing up malloc() with logging code or similar (ImageMagick seems to do this; Python seems to have a genuine concern about the Global Interpreter Lock). Debian's codesearch is a fantastic tool!

    The first rustified code

    I cut&pasted the CRC32 lookup table and fixed it up for Rust's syntax, and also ported the CRC32 computation functions. I gave them the same names as the original C ones, and exported them, e.g.

    const TABLE: [u32; 256] = [
       0x00000000, 0x04c11db7, 0x09823b6e, 0x0d4326d9,
       ...
    };
    
    #[no_mangle]
    pub unsafe extern "C" fn BZ2_update_crc(crc_var: &mut u32, cha: u8) {
        *crc_var = (*crc_var << 8) ^ TABLE[((*crc_var >> 24) ^ u32::from(cha)) as usize];
    }
    

    This is a straight port of the C code. Rust is very strict about integer sizes, and arrays can only be indexed with a usize, not any random integer — hence the explicit conversions.

    And with this, and after fixing the linkage, the tests pass!

    First pass at rustifying CRC32: done.

    But that does one byte at a time

    Indeed; the original C code to do CRC32 only handled one byte at a time. If I replace this with a SIMD-enabled Rust crate, it will want to process whole buffers at once. I hope the code in bzlib can be refactored to do that. We'll see!

    How to use an existing Rust crate for this

    I just found out that one does not in fact need to use a complete crc32::Digest to do equivalent computations; one can call crc32::update() by hand and maintain a single u32 state, just like the original UInt32 from the C code.

    So, I may not need to mess around with a custom allocator just yet. Stay tuned.

    In the meantime, I've filed a bug against crc32fast to make it possible to use a custom polynomial and order and still get the benefits of SIMD.

  2. Containing mutability in GObjects

    - gnome, librsvg, refactoring, rust

    Traditionally, GObject implementations in C are mutable: you instantiate a GObject and then change its state via method calls. Sometimes this is expected and desired; a GtkCheckButton widget certainly can change its internal state from pressed to not pressed, for example.

    Other times, objects are mutable while they are being "assembled" or "configured", and only yield a final immutable result until later. This is the case for RsvgHandle from librsvg.

    Please bear with me while I write about the history of the RsvgHandle API and why it ended up with different ways of doing the same thing.

    The traditional RsvgHandle API

    The final purpose of an RsvgHandle is to represent an SVG document loaded in memory. Once it is loaded, the SVG document does not change, as librsvg does not support animation or creating/removing SVG elements; it is a static renderer.

    However, before an RsvgHandle achieves its immutable state, it has to be loaded first. Loading can be done in two ways:

    • The historical/deprecated way, using the rsvg_handle_write() and rsvg_handle_close() APIs. Plenty of code in GNOME used this write/close idiom before GLib got a good abstraction for streams; you can see another example in GdkPixbufLoader. The idea is that applications do this:
    file = open a file...;
    handle = rsvg_handle_new ();
    
    while (file has more data) {
       rsvg_handle_write(handle, a bit of data);
    }
    
    rsvg_handle_close (handle);
    
    // now the handle is fully loaded and immutable
    
    rsvg_handle_render (handle, ...);
    
    file = g_file_new_for_path ("/foo/bar.svg");
    stream = g_file_read (file, ...);
    handle = rsvg_handle_new ();
    
    rsvg_handle_read_stream_sync (handle, stream, ...);
    
    // now the handle is fully loaded and immutable
    
    rsvg_handle_render (handle, ...);
    

    A bit of history

    Let's consider a few of RsvgHandle's functions.

    Constructors:

    • rsvg_handle_new()
    • rsvg_handle_new_with_flags()

    Configure the handle for loading:

    • rsvg_handle_set_base_uri()
    • rsvg_handle_set_base_gfile()

    Deprecated loading API:

    • rsvg_handle_write()
    • rsvg_handle_close()

    Streaming API:

    • rsvg_handle_read_stream_sync()

    When librsvg first acquired the concept of an RsvgHandle, it just had rsvg_handle_new() with no arguments. About 9 years later, it got rsvg_handle_new_with_flags() to allow more options, but it took another 2 years to actually add some usable flags — the first one was to configure the parsing limits in the underlying calls to libxml2.

    About 3 years after RsvgHandle appeared, it got rsvg_handle_set_base_uri() to configure the "base URI" against which relative references in the SVG document get resolved. For example, if you are reading /foo/bar.svg and it contains an element like <image xlink:ref="smiley.png"/>, then librsvg needs to be able to produce the path /foo/smiley.png and that is done relative to the base URI. (The base URI is implicit when reading from a specific SVG file, but it needs to be provided when reading from an arbitrary stream that may not even come from a file.)

    Initially RsvgHandle had the write/close APIs, and 8 years later it got the streaming functions once GIO appeared. Eventually the streaming API would be the preferred one, instead of just being a convenience for those brave new apps that started using GIO.

    A summary of librsvg's API may be something like:

    • librsvg gets written initially; it doesn't even have an RsvgHandle, and just provides a single function which takes a FILE * and renders it to a GdkPixbuf.

    • That gets replaced with RsvgHandle, its single rsvg_handle_new() constructor, and the write/close API to feed it data progressively.

    • GIO appears, we get the first widespread streaming APIs in GNOME, and RsvgHandle gets the ability to read from streams.

    • RsvgHandle gets rsvg_handle_new_with_flags() because now apps may want to configure extra stuff for libxml2.

    • When Cairo appears and librsvg is ported to it, RsvgHandle gets an extra flag so that SVGs rendered to PDF can embed image data efficiently.

    It's a convoluted history, but git log -- rsvg.h makes it accessible.

    Where is the mutability?

    An RsvgHandle gets created, with flags or without. It's empty, and doesn't know if it will be given data with the write/close API or with the streaming API. Also, someone may call set_base_uri() on it. So, the handle must remain mutable while it is being populated with data. After that, it can say, "no more changes, I'm done".

    In C, this doesn't even have a name. Everything is mutable by default all the time. This monster was the private data of RsvgHandle before it got ported to Rust:

    struct RsvgHandlePrivate {
        // set during construction
        RsvgHandleFlags flags;
    
        // GObject-ism
        gboolean is_disposed;
    
        // Extra crap for a deprecated API
        RsvgSizeFunc size_func;
        gpointer user_data;
        GDestroyNotify user_data_destroy;
    
        // Data only used while parsing an SVG
        RsvgHandleState state;
        RsvgDefs *defs;
        guint nest_level;
        RsvgNode *currentnode;
        RsvgNode *treebase;
        GHashTable *css_props;
        RsvgSaxHandler *handler;
        int handler_nest;
        GHashTable *entities;
        xmlParserCtxtPtr ctxt;
        GError **error;
        GCancellable *cancellable;
        GInputStream *compressed_input_stream;
    
        // Data only used while rendering
        double dpi_x;
        double dpi_y;
    
        // The famous base URI, set before loading
        gchar *base_uri;
        GFile *base_gfile;
    
        // Some internal stuff
        gboolean in_loop;
        gboolean is_testing;
    };
    

    "Single responsibility principle"? This is a horror show. That RsvgHandlePrivate struct has all of these:

    • Data only settable during construction (flags)
    • Data set after construction, but which may only be set before loading (base URI)
    • Highly mutable data used only during the loading stage: state machines, XML parsers, a stack of XML elements, CSS properties...
    • The DPI (dots per inch) values only used during rendering.
    • Assorted fields used at various stages of the handle's life.

    It took a lot of refactoring to get the code to a point where it was clear that an RsvgHandle in fact has distinct stages during its lifetime, and that some of that data should only live during a particular stage. Before, everything seemed a jumble of fields, used at various unclear points in the code (for the struct listing above, I've grouped related fields together — they were somewhat shuffled in the original code!).

    What would a better separation look like?

    In the master branch, now librsvg has this:

    /// Contains all the interior mutability for a RsvgHandle to be called
    /// from the C API.
    pub struct CHandle {
        dpi: Cell<Dpi>,
        load_flags: Cell<LoadFlags>,
    
        base_url: RefCell<Option<Url>>,
        // needed because the C api returns *const char
        base_url_cstring: RefCell<Option<CString>>,
    
        size_callback: RefCell<SizeCallback>,
        is_testing: Cell<bool>,
        load_state: RefCell<LoadState>,
    }
    

    Internally, that CHandle struct is now the private data of the public RsvgHandle object. Note that all of CHandle's fields are a Cell<> or RefCell<>: in Rust terms, this means that those fields allow for "interior mutability" in the CHandle struct: they can be modified after intialization.

    The last field's cell, load_state, contains this type:

    enum LoadState {
        Start,
    
        // Being loaded using the legacy write()/close() API
        Loading { buffer: Vec<u8> },
    
        // Fully loaded, with a Handle to an SVG document
        ClosedOk { handle: Handle },
    
        ClosedError,
    }
    

    A CHandle starts in the Start state, where it doesn't know if it will be loaded with a stream, or with the legacy write/close API.

    If the caller uses the write/close API, the handle moves to the Loading state, which has a buffer where it accumulates the data being fed to it.

    But if the caller uses the stream API, the handle tries to parse an SVG document from the stream, and it moves either to the ClosedOk state, or to the ClosedError state if there is a parse error.

    Correspondingly, when using the write/close API, when the caller finally calls rsvg_handle_close(), the handle creates a stream for the buffer, parses it, and also gets either into the ClosedOk or ClosedError state.

    If you look at the variant ClosedOk { handle: Handle }, it contains a fully loaded Handle inside, which right now is just a wrapper around a reference-counted Svg object:

    pub struct Handle {
        svg: Rc<Svg>,
    }
    

    The reason why LoadState::ClosedOk does not contain an Rc<Svg> directly, and instead wraps it with a Handle, is that this is just the first pass at refactoring. Also, Handle contains some API-level logic which I'm not completely sure makes sense as a lower-level Svg object. We'll see.

    Couldn't you move more of CHandle's fields into LoadState?

    Sort of, kind of, but the public API still lets one do things like call rsvg_handle_get_base_uri() after the handle is fully loaded, even though its result will be of little value. So, the fields that hold the base_uri information are kept in the longer-lived CHandle, not in the individual variants of LoadState.

    How does this look from the Rust API?

    CHandle implements the public C API of librsvg. Internally, Handle implements the basic "load from stream", "get the geometry of an SVG element", and "render to a Cairo context" functionality.

    This basic functionality gets exported in a cleaner way through the Rust API, discussed previously. There is no interior mutability in there at all; that API uses a builder pattern to gradually configure an SVG loader, which returns a fully loaded SvgHandle, out of which you can create a CairoRenderer.

    In fact, it may be possible to refactor all of this a bit and implement CHandle directly in terms of the new Rust API: in effect, use CHandle as the "holding space" while the SVG loader gets configured, and later turned into a fully loaded SvgHandle internally.

    Conclusion

    The C version of RsvgHandle's private structure used to have a bunch of fields. Without knowing the code, it was hard to know that they belonged in groups, and each group corresponded roughtly to a stage in the handle's lifetime.

    It took plenty of refactoring to get the fields split up cleanly in librsvg's internals. The process of refactoring RsvgHandle's fields, and ensuring that the various states of a handle are consistent, in fact exposed a few bugs where the state was not being checked appropriately. The public C API remains the same as always, but has better internal checks now.

    GObject APIs tend to allow for a lot of mutability via methods that change the internal state of objects. For RsvgHandle, it was possible to change this into a single CHandle that maintains the mutable data in a contained fashion, and later translates it internally into an immutable Handle that represents a fully-loaded SVG document. This scheme ties in well with the new Rust API for librsvg, which keeps everything immutable after creation.

  3. A Rust API for librsvg

    - gnome, librsvg, rust

    After the librsvg team finished the rustification of librsvg's main library, I wanted to start porting the high-level test suite to Rust. This is mainly to be able to run tests in parallel, which cargo test does automatically in order to reduce test times. However, this meant that librsvg needed a Rust API that would exercise the same code paths as the C entry points.

    At the same time, I wanted the Rust API to make it impossible to misuse the library. From the viewpoint of the C API, an RsvgHandle has different stages:

    • Just initialized
    • Loading
    • Loaded, or in an error state after a failed load
    • Ready to render

    To ensure consistency, the public API checks that you cannot render an RsvgHandle that is not completely loaded yet, or one that resulted in a loading error. But wouldn't it be nice if it were impossible to call the API functions in the wrong order?

    This is exactly what the Rust API does. There is a Loader, to which you give a filename or a stream, and it will return a fully-loaded SvgHandle or an error. Then, you can only create a CairoRenderer if you have an SvgHandle.

    For historical reasons, the C API in librsvg is not perfectly consistent. For example, some functions which return an error will actually return a proper GError, but some others will just return a gboolean with no further explanation of what went wrong. In contrast, all the Rust API functions that can fail will actually return a Result, and the error case will have a meaningful error value. In the Rust API, there is no "wrong order" in which the various API functions and methods can be called; it tries to do the whole "make invalid states unrepresentable".

    To implement the Rust API, I had to do some refactoring of the internals that hook to the public entry points. This made me realize that librsvg could be a lot easier to use. The C API has always forced you to call it in this fashion:

    1. Ask the SVG for its dimensions, or how big it is.
    2. Based on that, scale your Cairo context to the size you actually want.
    3. Render the SVG to that context's current transformation matrix.

    But first, (1) gives you inadequate information because rsvg_handle_get_dimensions() returns a structure with int fields for the width and height. The API is similar to gdk-pixbuf's in that it always wants to think in whole pixels. However, an SVG is not necessarily integer-sized.

    Then, (2) forces you to calculate some geometry in almost all cases, as most apps want to render SVG content scaled proportionally to a certain size. This is not hard to do, but it's an inconvenience.

    SVG dimensions

    Let's look at (1) again. The question, "how big is the SVG" is a bit meaningless when we consider that SVGs can be scaled to any size; that's the whole point of them!

    When you ask RsvgHandle how big it is, in reality it should look at you and whisper in your ear, "how big do you want it to be?".

    And that's the thing. The HTML/CSS/SVG model is that one embeds content into viewports of a given size. The software is responsible for scaling the content to fit into that viewport.

    In the end, what we want is a rendering function that takes a Cairo context and a Rectangle for a viewport, and that's it. The function should take care of fitting the SVG's contents within that viewport.

    There is now an open bug about exactly this sort of API. In the end, programs should just have to load their SVG handle, and directly ask it to render at whatever size they need, instead of doing the size computations by hand.

    When will this be available?

    I'm in the middle of a rather large refactor to make this viewport concept really work. So far this involves:

    • Defining APIs that take a viewport.

    • Refactoring all the geometry computation to support the semantics of the C API, plus the new with_viewport semantics.

    • Fixing the code that kept track of an internal offset for all temporary images.

    • Refactoring all the code that mucks around with the Cairo context's affine transformation matrix, which is a big mutable mess.

    • Tests, examples, documentation.

    I want to make the Rust API available for the 2.46 release, which is hopefully not too far off. It should be ready for the next GNOME release. In the meantime, you can check out the open bugs for the 2.46.0 milestone. Help is appreciated; the deadline for the first 3.33 tarballs is approximately one month from now!

  4. Rust build scripts vs. Meson

    - meson, rust

    One of the pain points in trying to make the Meson build system work with Rust and Cargo is Cargo's use of build scripts, i.e. the build.rs that many Rust programs use for doing things before the main build. This post is about my exploration of what build.rs does.

    Thanks to Nirbheek Chauhan for his comments and additions to a draft of this article!

    TL;DR: build.rs is pretty ad-hoc and somewhat primitive, when compared to Meson's very nice, high-level patterns for build-time things.

    I have the intuition that giving names to the things that are usually done in build.rs scripts, and creating abstractions for them, can make it easier later to implement those abstractions in terms of Meson. Maybe we can eliminate build.rs in most cases? Maybe Cargo can acquire higher-level concepts that plug well to Meson?

    (That is... I think we can refactor our way out of this mess.)

    What does build.rs do?

    The first paragraph in the documentation for Cargo build scripts tells us this:

    Some packages need to compile third-party non-Rust code, for example C libraries. Other packages need to link to C libraries which can either be located on the system or possibly need to be built from source. Others still need facilities for functionality such as code generation before building (think parser generators).

    That is,

    • Compiling third-party non-Rust code. For example, maybe there is a C sub-library that the Rust crate needs.

    • Link to C libraries... located on the system... or built from source. For example, in gtk-rs, the sys crates link to libgtk-3.so, libcairo.so, etc. and need to find a way to locate those libraries with pkg-config.

    • Code generation. In the C world this could be generating a parser with yacc; in the Rust world there are many utilities to generate code that is later used in your actual program.

    In the next sections I'll look briefly at each of these cases, but in a different order.

    Code generation

    Here is an example, in how librsvg generates code for a couple of things that get autogenerated before compiling the main library:

    • A perfect hash function (PHF) of attributes and CSS property names.
    • A pair of lookup tables for SRGB linearization and un-linearization.

    For example, this is main() in build.rs:

    fn main() {
        generate_phf_of_svg_attributes();
        generate_srgb_tables();
    }
    

    And this is the first few lines of of the first function:

    fn generate_phf_of_svg_attributes() {
        let path = Path::new(&env::var("OUT_DIR").unwrap()).join("attributes-codegen.rs");
        let mut file = BufWriter::new(File::create(&path).unwrap());
    
        writeln!(&mut file, "#[repr(C)]").unwrap();
    
        // ... etc
    }
    

    Generate a path like $OUT_DIR/attributes-codegen.rs, create a file with that name, a BufWriter for the file, and start outputting code to it.

    Similarly, the second function:

    fn generate_srgb_tables() {
        let linearize_table = compute_table(linearize);
        let unlinearize_table = compute_table(unlinearize);
    
        let path = Path::new(&env::var("OUT_DIR").unwrap()).join("srgb-codegen.rs");
        let mut file = BufWriter::new(File::create(&path).unwrap());
    
        // ...
    
        print_table(&mut file, "LINEARIZE", &linearize_table);
        print_table(&mut file, "UNLINEARIZE", &unlinearize_table);
    }
    

    Compute two lookup tables, create a file named $OUT_DIR/srgb-codegen.rs, and write the lookup tables to the file.

    Later in the actual librsvg code, the generated files get included into the source code using the include! macro. For example, here is where attributes-codegen.rs gets included:

    // attributes.rs
    
    extern crate phf;  // crate for perfect hash function
    
    // the generated file has the declaration for enum Attribute
    include!(concat!(env!("OUT_DIR"), "/attributes-codegen.rs"));
    

    One thing to note here is that the generated source files (attributes-codegen.rs, srgb-codegen.rs) get put in $OUT_DIR, a directory that Cargo creates for the compilation artifacts. The files do not get put into the original source directories with the rest of the library's code; the idea is to keep the source directories pristine.

    At least in those terms, Meson and Cargo agree that source directories should be kept clean of autogenerated files.

    The Code Generation section of Cargo's documentation agrees:

    In general, build scripts should not modify any files outside of OUT_DIR. It may seem fine on the first blush, but it does cause problems when you use such crate as a dependency, because there's an implicit invariant that sources in .cargo/registry should be immutable. cargo won't allow such scripts when packaging.

    Now, some things to note here:

    • Both the build.rs program and the actual library sources look at the $OUT_DIR environment variable for the location of the generated sources.

    • The Cargo docs say that if the code generator needs input files, it can look for them based on its current directory, which will be the toplevel of your source package (i.e. your toplevel Cargo.toml).

    Meson hates this scheme of things. In particular, Meson is very systematic about where it finds input files and sources, and where things like code generators are allowed to place their output.

    The way Meson communicates these paths to code generators is via command-line arguments to "custom targets". Here is an example that is easier to read than the documentation:

    gen = find_program('generator.py')
    
    outputs = custom_target('generated',
      output : ['foo.h', 'foo.c'],
      command : [gen, '@OUTDIR@'],
      ...
    )
    

    This defines a target named 'generated', which will use the generator.py program to output two files, foo.h and foo.c. That Python program will get called with @OUTDIR@ as a command-line argument; in effect, meson will call /full/path/to/generator.py @OUTDIR@ explicitly, without any magic passed through environment variables.

    If this looks similar to what Cargo does above with build.rs, it's because it is similar. It's just that Meson gives a name to the concept of generating code at build time (Meson's name for this is a custom target), and provides a mechanism to say which program is the generator, which files it is expected to generate, and how to call the program with appropriate arguments to put files in the right place.

    In contrast, Cargo assumes that all of that information can be inferred from an environment variable.

    In addition, if the custom target takes other files as input (say, so it can call yacc my-grammar.y), the custom_target() command can take an input: argument. This way, Meson can add a dependency on those input files, so that the appropriate things will be rebuilt if the input files change.

    Now, Cargo could very well provide a small utility crate that build scripts could use to figure out all that information. Meson would tell Cargo to use its scheme of things, and pass it down to build scripts via that utility crate. I.e. to have

    // build.rs
    
    extern crate cargo_high_level;
    
    let output = Path::new(cargo_high_level::get_output_path()).join("codegen.rs");
    //                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ this, instead of:
    
    let output = Path::new(&env::var("OUT_DIR").unwrap()).join("codegen.rs");
    
    // let the build system know about generated dependencies
    cargo_high_level::add_output(output);
    

    A similar mechanism could be used for the way Meson likes to pass command-line arguments to the programs that deal with custom targets.

    Linking to C libraries on the system

    Some Rust crates need to link to lower-level C libraries that actually do the work. For example, in gtk-rs, there are high-level binding crates called gtk, gdk, cairo, etc. These use low-level crates called gtk-sys, gdk-sys, cairo-sys. Those -sys crates are just direct wrappers on top of the C functions of the respective system libraries: gtk-sys makes almost every function in libgtk-3.so available as a Rust-callable function.

    System libraries sometimes live in a well-known part of the filesystem (/usr/lib64, for example); other times, like in Windows and MacOS, they could be anywhere. To find that location plus other related metadata (include paths for C header files, library version), many system libraries use pkg-config. At the highest level, one can run pkg-config on the command line, or from build scripts, to query some things about libraries. For example:

    # what's the system's installed version of GTK?
    $ pkg-config --modversion gtk+-3.0
    3.24.4
    
    # what compiler flags would a C compiler need for GTK?
    $ pkg-config --cflags gtk+-3.0
    -pthread -I/usr/include/gtk-3.0 -I/usr/include/at-spi2-atk/2.0 -I/usr/include/at-spi-2.0 -I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include -I/usr/include/gtk-3.0 -I/usr/include/gio-unix-2.0/ -I/usr/include/libxkbcommon -I/usr/include/wayland -I/usr/include/cairo -I/usr/include/pango-1.0 -I/usr/include/harfbuzz -I/usr/include/pango-1.0 -I/usr/include/fribidi -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libdrm -I/usr/include/libpng16 -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/libmount -I/usr/include/blkid -I/usr/include/uuid -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include
    
    # and which libraries?
    $ pkg-config --libs gtk+-3.0
    -lgtk-3 -lgdk-3 -lpangocairo-1.0 -lpango-1.0 -latk-1.0 -lcairo-gobject -lcairo -lgdk_pixbuf-2.0 -lgio-2.0 -lgobject-2.0 -lglib-2.0
    

    There is a pkg-config crate which build.rs can use to call this, and communicate that information to Cargo. The example in the crate's documentation is for asking pkg-config for the foo package, with version at least 1.2.3:

    extern crate pkg_config;
    
    fn main() {
        pkg_config::Config::new().atleast_version("1.2.3").probe("foo").unwrap();
    }
    

    And the documentation says,

    After running pkg-config all appropriate Cargo metadata will be printed on stdout if the search was successful.

    Wait, what?

    Indeed, printing specially-formated stuff on stdout is how build.rs scripts communicate back to Cargo about their findings. To quote Cargo's docs on build scripts; the following is talking about the stdout of build.rs:

    Any line that starts with cargo: is interpreted directly by Cargo. This line must be of the form cargo:key=value, like the examples below:

    # specially recognized by Cargo
    cargo:rustc-link-lib=static=foo
    cargo:rustc-link-search=native=/path/to/foo
    cargo:rustc-cfg=foo
    cargo:rustc-env=FOO=bar
    # arbitrary user-defined metadata
    cargo:root=/path/to/foo
    cargo:libdir=/path/to/foo/lib
    cargo:include=/path/to/foo/include
    

    One can use the stdout of a build.rs program to add additional command-line options for rustc, or set environment variables for it, or add library paths, or specific libraries.

    Meson hates this scheme of things. I suppose it would prefer to do the pkg-config calls itself, and then pass that information down to Cargo, you guessed it, via command-line options or something well-defined like that. Again, the example cargo_high_level crate I proposed above could be used to communicate this information from Meson to Cargo scripts. Meson also doesn't like this because it would prefer to know about pkg-config-based libraries in a declarative fashion, without having to run a random script like build.rs.

    Building C code from Rust

    Finally, some Rust crates build a bit of C code and then link that into the compiled Rust code. I have no experience with that, but the respective build scripts generally use the cc crate to call a C compiler and pass options to it conveniently. I suppose Meson would prefer to do this instead, or at least to have a high-level way of passing down information to Cargo.

    In effect, Meson has to be in charge of picking the C compiler. Having the thing-to-be-built pick on its own has caused big problems in the past: GObject-Introspection made the same mistake years ago when it decided to use distutils to detect the C compiler; gtk-doc did as well. When those tools are used, we still deal with problems with cross-compilation and when the system has more than one C compiler in it.

    Snarky comments about the Unix philosophy

    If part of the Unix philosophy is that shit can be glued together with environment variables and stringly-typed stdout... it's a pretty bad philosophy. All the cases above boil down to having a well-defined, more or less strongly-typed way to pass information between programs instead of shaking proverbial tree of the filesystem and the environment and seeing if something usable falls down.

    Would we really have to modify all build.rs scripts for this?

    Probably. Why not? Meson already has a lot of very well-structured knowledge of how to deal with multi-platform compilation and installation. Re-creating this knowledge in ad-hoc ways in build.rs is not very pleasant or maintainable.

    Related work

  5. Who wrote librsvg?

    - gnome, librsvg

    Authors by lines of code, each year:

    Librsvg authors by lines of code by year

    Authors by percentage of lines of code, each year:

    Librsvg authors by percentage of lines of code by year

    Which lines of code remain each year?

    Lines of code that remain each year

    The shitty thing about a gradual rewrite is that a few people end up "owning" all the lines of source code. Hopefully this post is a little acknowledgment of the people that made librsvg possible.

    The charts are made with the incredible tool git-of-theseus — thanks to @norwin@mastodon.art for digging it up! Its README also points to a Hercules plotter with awesome graphs. You know, for if you needed something to keep your computer busy during the weekend.

  6. Librsvg's GObject boilerplate is in Rust now

    - gnome, librsvg, rust

    The other day I wrote about how most of librsvg's library code is in Rust now.

    Today I finished porting the GObject boilerplate for the main RsvgHandle object into Rust. This means that the C code no longer calls things like g_type_register_static(), nor implements rsvg_handle_class_init() and such; all those are in Rust now. How is this done?

    The life-changing magic of glib::subclass

    Sebastian Dröge has been working for many months on refining utilities to make it possible to subclass GObjects in Rust, with little or no unsafe code. This subclass module is now part of glib-rs, the Rust bindings to GLib.

    Librsvg now uses the subclassing functionality in glib-rs, which takes care of some things automatically:

    • Registering your GObject types at runtime.
    • Creating safe traits on which you can implement class_init, instance_init, set_property, get_property, and all the usual GObject paraphernalia.

    Check this out:

    use glib::subclass::prelude::*;
    
    impl ObjectSubclass for Handle {
        const NAME: &'static str = "RsvgHandle";
    
        type ParentType = glib::Object;
    
        type Instance = RsvgHandle;
        type Class = RsvgHandleClass;
    
        glib_object_subclass!();
    
        fn class_init(klass: &mut RsvgHandleClass) {
            klass.install_properties(&PROPERTIES);
        }
    
        fn new() -> Self {
            Handle::new()
        }
    }
    

    In the impl line, Handle is librsvg's internals object — what used to be RsvgHandlePrivate in the C code.

    The following lines say this:

    • const NAME: &'static str = "RsvgHandle"; - the name of the type, for GType's perusal.

    • type ParentType = glib::Object; - Parent class.

    • type Instance, type Class - Structs with #[repr(C)], equivalent to GObject's class and instance structs.

    • glib_object_subclass!(); - All the boilerplate happens here automatically.

    • fn class_init - Should be familiar to anyone who implements GObjects!

    And then, a couple of the property declarations:

    static PROPERTIES: [subclass::Property; 11] = [
        subclass::Property("flags", |name| {
            ParamSpec::flags(
                name,
                "Flags",
                "Loading flags",
                HandleFlags::static_type(),
                0,
                ParamFlags::READWRITE | ParamFlags::CONSTRUCT_ONLY,
            )
        }),
        subclass::Property("dpi-x", |name| {
            ParamSpec::double(
                name,
                "Horizontal DPI",
                "Horizontal resolution in dots per inch",
                0.0,
                f64::MAX,
                0.0,
                ParamFlags::READWRITE | ParamFlags::CONSTRUCT,
            )
        }),
        // ... etcetera
    ];
    

    This is quite similar to the way C code usually registers properties for new GObject subclasses.

    The moment at which a new GObject subclass gets registered against the GType system is in the foo_get_type() call. This is the C code in librsvg for that:

    extern GType rsvg_handle_rust_get_type (void);
    
    GType
    rsvg_handle_get_type (void)
    {
        return rsvg_handle_rust_get_type ();
    }
    

    And the Rust function that actually implements this:

    #[no_mangle]
    pub unsafe extern "C" fn rsvg_handle_rust_get_type() -> glib_sys::GType {
        Handle::get_type().to_glib()
    }
    

    Here, Handle::get_type() gets implemented automatically by Sebastian's subclass traits. It gets things like the type name and the parent class from the impl ObjectSubclass for Handle we saw above, and calls g_type_register_static() internally.

    I can confirm now that implementing GObjects in Rust in this way, and exposing them to C, really works and is actually quite pleasant to do. You can look at librsvg's Rust code for GObject here.

    Further work

    There is some auto-generated C code to register librsvg's error enum and a flags type against GType; I'll move those to Rust over the next few days.

    Then, I think I'll try to actually remove all of the library's entry points from the C code and implement them in Rust. Right now each C function is really just a single call to a Rust function, so this should be trivial-ish to do.

    I'm waiting for a glib-rs release, the first one that will have the glib::subclass code in it, before merging all of the above into librsvg's master branch.

    A new Rust API for librsvg?

    Finally, this got me thinking about what to do about the Rust bindings to librsvg itself. The rsvg crate uses the gtk-rs machinery to generate the binding: it reads the GObject Introspection data from Rsvg.gir and generates a Rust binding for it.

    However, the resulting API is mostly identical to the C API. There is an rsvg::Handle with the same methods as the ones from C's RsvgHandle... and that API is not particularly Rusty.

    At some point I had an unfinished branch to merge rsvg-rs into librsvg. The intention was that librsvg's build procedure would first build librsvg.so itself, then generate Rsvg.gir as usual, and then generate rsvg-rs from that. But I got tired of fucking with Autotools, and didn't finish integrating the projects.

    Rsvg-rs is an okay Rust API for using librsvg. It still works perfectly well from the standalone crate. However, now that all the functionality of librsvg is in Rust, I would like to take this opportunity to experiment with a better API for loading and rendering SVGs from Rust. This may make it more clear how to refactor the toplevel of the library. Maybe the librsvg project can provide its own Rust crate for public consumption, in addition to the usual librsvg.so and Rsvg.gir which need to remain with a stable API and ABI.

  7. Librsvg is almost rustified now

    - gnome, librsvg, rust

    Since a few days ago, librsvg's library implementation is almost 100% Rust code. Paolo Borelli's and Carlos Martín Nieto's latest commits made it possible.

    What does "almost 100% Rust code" mean here?

    • The C code no longer has struct fields that refer to the library's real work. The only field in RsvgHandlePrivate is an opaque pointer to a Rust-side structure. All the rest of the library's data lives in Rust structs.

    • The public API is implemented in C, but it is just stubs that immediately call into Rust functions. For example:

    gboolean
    rsvg_handle_render_cairo_sub (RsvgHandle * handle, cairo_t * cr, const char *id)
    {
        g_return_val_if_fail (RSVG_IS_HANDLE (handle), FALSE);
        g_return_val_if_fail (cr != NULL, FALSE);
    
        return rsvg_handle_rust_render_cairo_sub (handle, cr, id);
    }
    
    • The GObject boilerplate and supporting code is still in C: rsvg_handle_class_init and set_property and friends.

    • All the high-level tests are still done in C.

    • The gdk-pixbuf loader for SVG files is done in C.

    Someone posted a chart on Reddit about the rustification of librsvg, comparing lines of code in each language vs. time.

    Rustifying the remaining C code

    There is only a handful of very small functions from the public API still implemented in C, and I am converting them one by one to Rust. These are just helper functions built on top of other public API that does the real work.

    Converting the gdk-pixbuf loader to Rust seems like writing a little glue code for the loadable module; the actual loading is just a couple of calls to librsvg's API.

    Rsvg-rs in rsvg?

    Converting the tests to Rust... ideally this would use the rsvg-rs bindings; for example, it is what I already use for rsvg-bench, a benchmarking program for librsvg.

    I have an unfinished branch to merge the rsvg-rs repository into librsvg's own repository. This is because...

    1. Librsvg builds its library, librsvg.so
    2. Gobject-introspection runs on librsvg.so and the source code, and produces librsvg.gir
    3. Rsvg-rs's build system calls gir on librsvg.gir to generate the Rust binding's code.

    As you can imagine, doing all of this with Autotools is... rather convoluted. It gives me a lot of anxiety to think that there is also an unfinished branch to port the build system to Meson, where probably doing the .so→.gir→rs chain would be easier, but who knows. Help in this area is much appreciated!

    An alternative?

    Rustified tests could, of course, call the C API of librsvg by hand, in unsafe code. This may not be idiomatic, but sounds like it could be done relatively quickly.

    Future work

    There are two options to get rid of all the C code in the library, and just leave C header files for public consumption:

    1. Do the GObject implementation in Rust, using Sebastian Dröge's work from GStreamer to do this easily.

    2. Work on making gnome-class powerful enough to implement the librsvg API directly, and in an ABI-compatible fashion to what there is right now.

    The second case will probably build upon the first one, since one of my plans for gnome-class is to make it generate code that uses Sebastian's, instead of generating all the GObject boilerplate by hand.

  8. In support of Coraline Ada Ehmke

    - code-of-conduct

    Last night, the linux.org DNS was hijacked and redirected to a page that doxed her. Coraline is doing extremely valuable work with the Contributor Covenant code of conduct, which many free software projects have adopted already.

    Coraline has been working for years in making free software, and computer technology circles in general, a welcome place for underrepresented groups.

    I hope Coraline stays safe and strong. You can support her directly on Patreon.

  9. My GUADEC 2018 presentation

    - gnome, librsvg, rust, talks

    I just realized that I forgot to publish my presentation from this year's GUADEC. Sorry, here it is!

    Patterns of refactoring C to Rust - link to PDF

    You can also get the ODP file for the presentation. This is released under a CC-BY-SA license.

    This is the video of the presentation.

    Update Dec/06: Keen readers spotted an incorrect use of opaque pointers; I've updated the example code in the presentation to match Jordan's fix with the recommended usage. That merge request has an interesting conversation on FFI esoterica, too.

  10. Refactoring allowed URLs in librsvg

    - gnome, librsvg, rust

    While in the middle of converting librsvg's code that processes XML from C to Rust, I went into a digression that has to do with the way librsvg decides which files are allowed to be referenced from within an SVG.

    Resource references in SVG

    SVG files can reference other files, i.e. they are not self-contained. For example, there can be an element like <image xlink:href="foo.png">, or one can request that a sub-element of another SVG be included with <use xlink:href="secondary.svg#foo">. Finally, there is the xi:include mechanism to include chunks of text or XML into another XML file.

    Since librsvg is sometimes used to render untrusted files that come from the internet, it needs to be careful not to allow those files to reference any random resource on the filesystem. We don't want something like <text><xi:include href="/etc/passwd" parse="text"/></text> or something equally nefarious that would exfiltrate a random file into the rendered output.

    Also, want to catch malicious SVGs that want to "phone home" by referencing a network resource like <image xlink:href="http://evil.com/pingback.jpg">.

    So, librsvg is careful to have a single place where it can load secondary resources, and first it validates the resource's URL to see if it is allowed.

    The actual validation rules are not very important for this discussion; they are something like "no absolute URLs allowed" (so you can't request /etc/passwd, "only siblings or (grand)children of siblings allowed" (so foo.svg can request bar.svg and subdir/bar.svg, but not ../../bar.svg).

    The code

    There was a central function rsvg_io_acquire_stream() which took a URL as a string. The code assumed that that URL had been first validated with a function called allow_load(url). While the code's structure guaranteed that all the places that may acquire a stream would actually go through allow_load() first, the structure of the code in Rust made it possible to actually make it impossible to acquire a disallowed URL.

    Before:

    pub fn allow_load(url: &str) -> bool;
    
    pub fn acquire_stream(url: &str, ...) -> Result<gio::InputStream, glib::Error>;
    
    pub fn rsvg_acquire_stream(url: &str, ...) -> Result<gio::InputStream, LoadingError> {
        if allow_load(url) {
            acquire_stream(url, ...)?
        } else {
            Err(LoadingError::NotAllowed)
        }
    }
    

    The refactored code now has an AllowedUrl type that encapsulates a URL, plus the promise that it has gone through these steps:

    • The URL has been run through a URL well-formedness parser.
    • The resource is allowed to be loaded following librsvg's rules.
    pub struct AllowedUrl(Url);  // from the Url parsing crate
    
    impl AllowedUrl {
        pub fn from_href(href: &str) -> Result<AllowedUrl, ...> {
            let parsed = Url::parse(href)?; // may return LoadingError::InvalidUrl
    
            if allow_load(parsed) {
                Ok(AllowedUrl(parsed))
            } else {
                Err(LoadingError::NotAllowed)
            }
        }
    }
    
    // new prototype
    pub fn acquire_stream(url: &AllowedUrl, ...) -> Result<gio::InputStream, glib::Error>;
    

    This forces callers to validate the URLs as soon as possible, right after they get them from the SVG file. Now it is not possible to request a stream unless the URL has been validated first.

    Plain URIs vs. fragment identifiers

    Some of the elements in SVG that reference other data require full files:

    &lt;image xlink:href="foo.png" ...&gt;      &lt;!-- no fragments allowed --&gt;
    

    And some others, that reference particular elements in secondary SVGs, require a fragment ID:

    &lt;use xlink:href="icons.svg#app_name" ...&gt;   &lt;!-- fragment id required --&gt;
    

    And finally, the feImage element, used to paste an image as part of a filter effects pipeline, allows either:

    &lt;!-- will use that image --&gt;
    &lt;feImage xlink:href="foo.png" ...&gt;
    
    &lt;!-- will render just this element from an SVG and use it as an image --&gt;
    &lt;feImage xlink:href="foo.svg#element"&gt;
    

    So, I introduced a general Href parser :

    pub enum Href {
        PlainUri(String),
        WithFragment(Fragment),
    }
    
    /// Optional URI, mandatory fragment id
    pub struct Fragment(Option<String>, String);
    

    The parts of the code that absolutely require a fragment id now take a Fragment. Parts which require a PlainUri can unwrap that case.

    The next step is making those structs contain an AllowedUrl directly, instead of just strings, so that for callers, obtaining a fully validated name is a one-step operation.

    In general, the code is moving towards a scheme where all file I/O is done at loading time. Right now, some of those external references get resolved at rendering time, which is somewhat awkward (for example, at rendering time the caller has no chance to use a GCancellable to cancel loading). This refactoring to do early validation is leaving the code in a very nice state.

« Page 2 / 7 »