Federico's Blog

  1. Correctness in Rust: building strings

    - rust

    Rust tries to follow the "make illegal states unrepresentable" mantra in several ways. In this post I'll show several things related to the process of building strings, from bytes in memory, or from a file, or from char * things passed from C.

    Strings in Rust

    The easiest way to build a string is to do it directly at compile time:

    let my_string = "Hello, world!";
    

    In Rust, strings are UTF-8. Here, the compiler checks our string literal is valid UTF-8. If we try to be sneaky and insert an invalid character...

    let my_string = "Hello \xf0";
    

    We get a compiler error:

    error: this form of character escape may only be used with characters in the range [\x00-\x7f]
     --> foo.rs:2:30
      |
    2 |     let my_string = "Hello \xf0";
      |                              ^^
    

    Rust strings know their length, unlike C strings. They can contain a nul character in the middle, because they don't need a nul terminator at the end.

    let my_string = "Hello \x00 zero";
    println!("{}", my_string);
    

    The output is what you expect:

    $ ./foo | hexdump -C
    00000000  48 65 6c 6c 6f 20 00 20  7a 65 72 6f 0a           |Hello . zero.|
    0000000d                    ^ note the nul char here
    $
    

    So, to summarize, in Rust:

    • Strings are encoded in UTF-8
    • Strings know their length
    • Strings can have nul chars in the middle

    This is a bit different from C:

    • Strings don't exist!

    Okay, just kidding. In C:

    • A lot of software has standardized on UTF-8.
    • Strings don't know their length - a char * is a raw pointer to the beginning of the string.
    • Strings conventionally have a nul terminator, that is, a zero byte that marks the end of the string. Therefore, you can't have nul characters in the middle of strings.

    Building a string from bytes

    Let's say you have an array of bytes and want to make a string from them. Rust won't let you just cast the array, like C would. First you need to do UTF-8 validation. For example:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    fn convert_and_print(bytes: Vec<u8>) {
        let result = String::from_utf8(bytes);
        match result {
            Ok(string) => println!("{}", string),
            Err(e) => println!("{:?}", e)
        }
    }
    
    fn main() {
        convert_and_print(vec![0x48, 0x65, 0x6c, 0x6c, 0x6f]);
        convert_and_print(vec![0x48, 0x65, 0xf0, 0x6c, 0x6c, 0x6f]);
    }
    

    In lines 10 and 11, we call convert_and_print() with different arrays of bytes; the first one is valid UTF-8, and the second one isn't.

    Line 2 calls String::from_utf8(), which returns a Result, i.e. something with a success value or an error. In lines 3-5 we unpack this Result. If it's Ok, we print the converted string, which has been validated for UTF-8. Otherwise, we print the debug representation of the error.

    The program prints the following:

    $ ~/foo
    Hello
    FromUtf8Error { bytes: [72, 101, 240, 108, 108, 111], error: Utf8Error { valid_up_to: 2, error_len: Some(1) } }
    

    Here, in the error case, the Utf8Error tells us that the bytes are UTF-8 and are valid_up_to index 2; that is the first problematic index. We also get some extra information which lets the program know if the problematic sequence was incomplete and truncated at the end of the byte array, or if it's complete and in the middle.

    And for a "just make this printable, pls" API? We can use String::from_utf8_lossy(), which replaces invalid UTF-8 sequences with U+FFFD REPLACEMENT CHARACTER:

    fn convert_and_print(bytes: Vec<u8>) {
        let string = String::from_utf8_lossy(&bytes);
        println!("{}", string);
    }
    
    fn main() {
        convert_and_print(vec![0x48, 0x65, 0x6c, 0x6c, 0x6f]);
        convert_and_print(vec![0x48, 0x65, 0xf0, 0x6c, 0x6c, 0x6f]);
    }
    

    This prints the following:

    $ ~/foo
    Hello
    He�llo
    

    Reading from files into strings

    Now, let's assume you want to read chunks of a file and put them into strings. Let's go from the low-level parts up to the high level "read a line" API.

    Single bytes and single UTF-8 characters

    When you open a File, you get an object that implements the Read trait. In addition to the usual "read me some bytes" method, it can also give you back an iterator over bytes, or an iterator over UTF-8 characters.

    The Read.bytes() method gives you back a Bytes iterator, whose next() method returns Result<u8, io::Error>. When you ask the iterator for its next item, that Result means you'll get a byte out of it successfully, or an I/O error.

    In contrast, the Read.chars() method gives you back a Chars iterator, and its next() method returns Result<char, CharsError>, not io::Error. This extended CharsError has a NotUtf8 case, which you get back when next() tries to read the next UTF-8 sequence from the file and the file has invalid data. CharsError also has a case for normal I/O errors.

    Reading lines

    While you could build a UTF-8 string one character at a time, there are more efficient ways to do it.

    You can create a BufReader, a buffered reader, out of anything that implements the Read trait. BufReader has a convenient read_line() method, to which you pass a mutable String and it returns a Result<usize, io::Error> with either the number of bytes read, or an error.

    That method is declared in the BufRead trait, which BufReader implements. Why the separation? Because other concrete structs also implement BufRead, such as Cursor — a nice wrapper that lets you use a vector of bytes like an I/O Read or Write implementation, similar to GMemoryInputStream.

    If you prefer an iterator rather than the read_line() function, BufRead also gives you a lines() method, which gives you back a Lines iterator.

    In both cases — the read_line() method or the Lines iterator, the error that you can get back can be of ErrorKind::InvalidData, which indicates that there was an invalid UTF-8 sequence in the line to be read. It can also be a normal I/O error, of course.

    Summary so far

    There is no way to build a String, or a &str slice, from invalid UTF-8 data. All the methods that let you turn bytes into string-like things perform validation, and return a Result to let you know if your bytes validated correctly.

    The exceptions are in the unsafe methods, like String::from_utf8_unchecked(). You should really only use them if you are absolutely sure that your bytes were validated as UTF-8 beforehand.

    There is no way to bring in data from a file (or anything file-like, that implements the Read trait) and turn it into a String without going through functions that do UTF-8 validation. There is not an unsafe "read a line" API without validation — you would have to build one yourself, but the I/O hit is probably going to be slower than validating data in memory, anyway, so you may as well validate.

    C strings and Rust

    For unfortunate historical reasons, C flings around char * to mean different things. In the context of Glib, it can mean

    • A valid, nul-terminated UTF-8 sequence of bytes (a "normal string")
    • A nul-terminated file path, which has no meaningful encoding
    • A nul-terminated sequence of bytes, not validated as UTF-8.

    What a particular char * means depends on which API you got it from.

    Bringing a string from C to Rust

    From Rust's viewpoint, getting a raw char * from C (a "*const c_char" in Rust parlance) means that it gets a pointer to a buffer of unknown length.

    Now, that may not be entirely accurate:

    • You may indeed only have a pointer to a buffer of unknown length
    • You may have a pointer to a buffer, and also know its length (i.e. the offset at which the nul terminator is)

    The Rust standard library provides a CStr object, which means, "I have a pointer to an array of bytes, and I know its length, and I know the last byte is a nul".

    CStr provides an unsafe from_ptr() constructor which takes a raw pointer, and walks the memory to which it points until it finds a nul byte. You must give it a valid pointer, and you had better guarantee that there is a nul terminator, or CStr will walk until the end of your process' address space looking for one.

    Alternatively, if you know the length of your byte array, and you know that it has a nul byte at the end, you can call CStr::from_bytes_with_nul(). You pass it a &[u8] slice; the function will check that a) the last byte in that slice is indeed a nul, and b) there are no nul bytes in the middle.

    The unsafe version of this last function is unsafe CStr::from_bytes_with_nul_unchecked(): it also takes an &[u8] slice, but you must guarantee that the last byte is a nul and that there are no nul bytes in the middle.

    I really like that the Rust documentation tells you when functions are not "instantaneous" and must instead walks arrays, like to do validation or to look for the nul terminator above.

    Turning a CStr into a string-like

    Now, the above indicates that a CStr is a nul-terminated array of bytes. We have no idea what the bytes inside look like; we just know that they don't contain any other nul bytes.

    There is a CStr::to_str() method, which returns a Result<&str, Utf8Error>. It performs UTF-8 validation on the array of bytes. If the array is valid, the function just returns a slice of the validated bytes minus the nul terminator (i.e. just what you expect for a Rust string slice). Otherwise, it returns an Utf8Error with the details like we discussed before.

    There is also CStr::to_string_lossy() which does the replacement of invalid UTF-8 sequences like we discussed before.

    Conclusion

    Strings in Rust are UTF-8 encoded, they know their length, and they can have nul bytes in the middle.

    To build a string from raw bytes, you must go through functions that do UTF-8 validation and tell you if it failed. There are unsafe functions that let you skip validation, but then of course you are on your own.

    The low-level functions which read data from files operate on bytes. On top of those, there are convenience functions to read validated UTF-8 characters, lines, etc. All of these tell you when there was invalid UTF-8 or an I/O error.

    Rust lets you wrap a raw char * that you got from C into something that can later be validated and turned into a string. Anything that manipulates a raw pointer is unsafe; this includes the "wrap me this pointer into a C string abstraction" API, and the "build me an array of bytes from this raw pointer" API. Later, you can validate those as UTF-8 and build real Rust strings — or know if the validation failed.

    Rust builds these little "corridors" through the API so that illegal states are unrepresentable.

  2. GUADEC 2017 presentation

    - gnome, guadec, librsvg, rust, talks

    During GUADEC this year I gave a presentation called Replacing C library code with Rust: what I learned with librsvg. This is the PDF file; be sure to scroll past the full-page presentation pages until you reach the speaker's notes, especially for the code sections!

    Replacing C library code with Rust - link to PDF

    You can also get the ODP file for the presentation. This is released under a CC-BY-SA license.

    For the presentation, my daughter Luciana made some drawings of Ferris, the Rust mascot, also released under the same license:

    Ferris says hi Ferris busy at work Ferris makes a mess Ferris presents her work

  3. Surviving a rust-cssparser API break

    - gnome, librsvg, rust

    Yesterday I looked into updating librsvg's Rust dependencies. There have been some API breaks (!!!) in the unstable libraries that it uses since the last time I locked them. This post is about an interesting case of API breakage.

    rust-cssparser is the crate that Servo uses for parsing CSS. Well, more like tokenizing CSS: you give it a string, it gives you back tokens, and you are supposed to compose CSS selector information or other CSS values from the tokens.

    Librsvg uses rust-cssparser now for most of the micro-languages in SVG's attribute values, instead of its old, fragile C parsers. I hope to be able to use it in conjunction with Servo's rust-selectors crate to fully parse CSS data and replace libcroco.

    A few months ago, rust-cssparser's API looked more or less like the following. This is the old representation of a Token:

    pub enum Token<'a> {
        // an identifier
        Ident(Cow<'a, str>),
    
        // a plain number
        Number(NumericValue),
    
        // a percentage value normalized to [0.0, 1.0]
        Percentage(PercentageValue),
    
        WhiteSpace(&'a str),
        Comma,
    
        ...
    }
    

    That is, a Token can be an Identifier with a string name, or a Number, a Percentage, whitespace, a comma, and many others.

    On top of that is the old API for a Parser, which you construct with a string and then it gives you back tokens:

    impl<'i> Parser<'i> {
        pub fn new(input: &'i str) -> Parser<'i, 'i> {
    
        pub fn next(&mut self) -> Result<Token<'i>, ()> { ... }
    
        ...
    }
    

    This means the following. You create the parser out of a string slice with new(). You can then extract a Result with a Token sucessfully, or with an empty error value. The parser uses a lifetime 'i on the string from which it is constructed: the Tokens that return identifiers, for example, could return sub-string slices that come from the original string, and the parser has to be marked with a lifetime so that it does not outlive its underlying string.

    A few commits later, rust-cssparser got changed to return detailed error values, so that instead of () you get a a BasicParseError with sub-cases like UnexpectedToken or EndOfInput.

    After the changes to the error values for results, I didn't pay much attention to rust-cssparser for while. Yesterday I wanted to update librsvg to use the newest rust-cssparser, and had some interesting problems.

    First, Parser::new() was changed from taking just a &str slice to taking a ParserInput struct. This is an implementation detail which lets the parser cache the last token it saw. Not a big deal:

    // instead of constructing a parser like
    let mut parser = Parser::new (my_string);
    
    // you now construct it like
    let mut input = ParserInput::new (my_string);
    let mut parser = Parser::new (&mut input);
    

    I am not completely sure why this is exposed to the public API, since Rust won't allow you to have two mutable references to a ParserInput, and the only consumer of a (mutable) ParserInput is the Parser, anyway.

    However, the parser.next() function changed:

    // old version
    pub fn next(&mut self) -> Result<Token<'i>, ()> { ... }
    
    // new version
    pub fn next(&mut self) -> Result<&Token<'i>, BasicParseError<'i>> {... }
    // note this bad boy here -------^
    

    The successful Result from next() is now a reference to a Token, not a plain Token value which you now own. The parser is giving you a borrowed reference to its internally-cached token.

    My parsing functions for the old API looked similar to the following. This is a function that parses a string into an angle; it can look like "45deg" or "1.5rad", for example.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    pub fn parse_angle_degrees (s: &str) -> Result <f64, ParseError> {
        let mut parser = Parser::new (s);
    
        let token = parser.next ()
            .map_err (|_| ParseError::new ("expected angle"))?;
    
        match token {
            Token::Number (NumericValue { value, .. }) => Ok (value as f64),
    
            Token::Dimension (NumericValue { value, .. }, unit) => {
                let value = value as f64;
    
                match unit.as_ref () {
                    "deg"  => Ok (value),
                    "grad" => Ok (value * 360.0 / 400.0),
                    "rad"  => Ok (value * 180.0 / PI),
                    _      => Err (ParseError::new ("expected angle"))
                }
            },
    
            _ => Err (ParseError::new ("expected angle"))
        }.and_then (|r|
                    parser.expect_exhausted ()
                    .map (|_| r)
                    .map_err (|_| ParseError::new ("expected angle")))
    }
    

    This is a bit ugly, but it was the first version that passed the tests. Lines 4 and 5 mean, "get the first token or return an error". Line 17 means, "anything except deg, grad, or rad for the units causes the match expression to generate an error". Finally, I was feeling very proud of using and_then() in line 22, with parser.expect_exhausted(), to ensure that the parser would not find any more tokens after the angle/units.

    However, in the new version of rust-cssparser, Parser.next() gives back a Result with a &Token success value — a reference to a token —, while the old version returned a plain Token. No problem, I thought, I'm just going to de-reference the value in the match and be done with it:

        let token = parser.next ()
            .map_err (|_| ParseError::new ("expected angle"))?;
    
        match *token {
        //    ^ dereference here...
            Token::Number { value, .. } => value as f64,
    
            Token::Dimension { value, ref unit, .. } => {
        //                            ^ avoid moving the unit value
    

    The compiler complained elsewhere. The whole function now looked like this:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    pub fn parse_angle_degrees (s: &str) -> Result <f64, ParseError> {
        let mut parser = Parser::new (s);
    
        let token = parser.next ()
            .map_err (|_| ParseError::new ("expected angle"))?;
    
        match token {
            // ...
        }.and_then (|r|
                    parser.expect_exhausted ()
                    .map (|_| r)
                    .map_err (|_| ParseError::new ("expected angle")))
    }
    

    But in line 4, token is now a reference to something that lives inside parser, and parser is therefore borrowed mutably. The compiler didn't like that line 10 (the call to parser.expect_exhausted()) was trying to borrow parser mutably again.

    I played a bit with creating a temporary scope around the assignment to token so that it would only borrow parser mutably inside that scope. Things ended up like this, without the call to and_then() after the match:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    pub fn angle_degrees (s: &str) -> Result <f64, ParseError> {
        let mut input = ParserInput::new (s);
        let mut parser = Parser::new (&mut input);
    
        let angle = {
            let token = parser.next ()
                .map_err (|_| ParseError::new ("expected angle"))?;
    
            match *token {
                Token::Number { value, .. } => value as f64,
    
                Token::Dimension { value, ref unit, .. } => {
                    let value = value as f64;
    
                    match unit.as_ref () {
                        "deg"  => value,
                        "grad" => value * 360.0 / 400.0,
                        "rad"  => value * 180.0 / PI,
                        _      => return Err (ParseError::new ("expected 'deg' | 'grad' | 'rad'"))
                    }
                },
    
                _ => return Err (ParseError::new ("expected angle"))
            }
        };
    
        parser.expect_exhausted ().map_err (|_| ParseError::new ("expected angle"))?;
    
        Ok (angle)
    }
    

    Lines 5 through 25 are basically

        let angle = {
            // parse out the angle; return if error
        };
    

    And after that is done, I test for parser.expect_exhausted(). There is no chaining of results with helper functions; instead it's just going through each token linearly.

    The API break was annoying to deal with, but fortunately the calling code ended up cleaner, and I didn't have to change anything in the tests. I hope rust-cssparser can stabilize its API for consumers that are not Servo.

  4. Legacy Systems as Old Cities

    Translations: es - gnome, recompiler, urbanism

    I just realized that I only tweeted about this a couple of months ago, but never blogged about it. Shame on me!

    I wrote an article, Legacy Systems as Old Cities, for The Recompiler magazine. Is GNOME, now at 20 years old, legacy software? Is it different from mainframe software because "everyone" can change it? Does long-lived software have the same patterns of change as cities and physical artifacts? Can we learn from the building trades and urbanism for maintaining software in the long term? Could we turn legacy software into a good legacy?

    You can read the article here.

    Also, let me take this opportunity to recommend The Recompiler magazine. It is the most enjoyable technical publication I read. Their podcast is also excellent!

    Update 2017/06/10 - Spanish version of the article, Los Sistemas Heredados como Ciudades Viejas

  5. Setting Alt-Tab behavior in gnome-shell

    - gnome, gnome-shell

    After updating my distro a few months ago, I somehow lost my tweaks to the Alt-Tab behavior in gnome-shell.

    The default is to have Alt-Tab switch you between applications in the current workspace. One can use Alt-backtick (or whatever key you have above Tab) to switch between windows in the current application.

    I prefer a Windows-like setup, where Alt-Tab switches between windows in the current workspace, regardless of the application to which they belong.

    Many moons ago there was a gnome-shell extension to change this behavior, but these days (GNOME 3.24) it can be done without extensions. It is a bit convoluted.

    With the GUI

    If you are using X instead of Wayland, this works:

    1. Unset the Switch applications command. To do this, run gnome-control-center, go to Keyboard, and find the Switch applications command. Click on it, and hit Backspace in the dialog that prompts you for the keyboard shortcut. Click on the Set button.

    2. Set the Switch windows command. While still in the Keyboard settings, find the Switch windows command. Click on it, and hit Alt-Tab. Click Set.

    That should be all you need, unless you are in Wayland. In that case, you need to do it on the command line.

    With the command line, or in Wayland

    The kind people on #gnome-hackers tell me that as of GNOME 3.24, changing Alt-Tab doesn't work on Wayland as in (2) above, because the compositor captures the Alt-Tab key when you type it inside the dialog that prompts you for a keyboard shortcut. In that case, you have to change the configuration keys directly instead of using the GUI:

    gsettings set org.gnome.desktop.wm.keybindings switch-applications "[]"
    gsettings set org.gnome.desktop.wm.keybindings switch-applications-backward "[]"
    gsettings set org.gnome.desktop.wm.keybindings switch-windows "['<Alt>Tab', '<Super>Tab']"
    gsettings set org.gnome.desktop.wm.keybindings switch-windows-backward  "['<Alt><Shift>Tab', '<Super><Shift>Tab']"
    

    Of course the above also works in X, too.

    Changing windows across all workspaces

    If you'd like to switch between windows in all workspaces, rather than in the current workspace, find the org.gnome.shell.window-switcher current-workspace-only GSettings key and change it. You can do this in dconf-editor, or on the command line with

    gsettings set org.gnome.shell.window-switcher current-workspace-only true
    
  6. Exploring Rust's standard library: system calls and errors

    - rust

    In this post I'll show you the code path that Rust takes inside its standard library when you open a file. I wanted to learn how Rust handles system calls and errno, and all the little subtleties of the POSIX API. This is what I learned!

    The C side of things

    When you open a file, or create a socket, or do anything else that returns an object that can be accessed like a file, you get a file descriptor in the form of an int.

    /* All of these return a int with a file descriptor, or
     * -1 in case of error.
     */
    int open(const char *pathname, int flags, ...);
    int socket(int domain, int type, int protocol);
    

    You get a nonnegative integer in case of success, or -1 in case of an error. If there's an error, you look at errno, which gives you an integer error code.

    int fd;
    
    retry_open:
    fd = open ("/foo/bar/baz.txt", 0);
    if (fd == -1) {
        if (errno == ENOENT) {
            /* File doesn't exist */
        } else if (errno == ...) [
            ...
        } else if (errno == EINTR) {
            goto retry_open; /* interrupted system call; let's retry */
        }
    }
    

    Many system calls can return EINTR, which means "interrupted system call", which means that something interrupted the kernel while it was doing your system call and it returned control to userspace, with the syscall unfinished. For example, your process may have received a Unix signal (e.g. you send it SIGSTOP by pressing Ctrl-Z on a terminal, or you resized the terminal and your process got a SIGWINCH). Most of the time EINTR means simply that you must retry the operation: if you Control-Z a program to suspend it, and then fg to continue it again; and if the program was in the middle of open()ing a file, you would expect it to continue at that exact point and to actually open the file. Software that doesn't check for EINTR can fail in very subtle ways!

    Once you have an open file descriptor, you can read from it:

    ssize_t
    read_five_bytes (int fd, void *buf)
    {
        ssize_t result;
    
        retry:
        result = read (fd, buf, 5);
        if (result == -1) {
            if (errno == EINTR) {
                goto retry;
            } else {
                return -1; /* the caller should cherk errno */
            }
        } else {
            return result; /* success */
        }
    }
    

    ... and one has to remember that if read() returns 0, it means we were at the end-of-file; if it returns less than the number of bytes requested it means we were close to the end of file; if this is a nonblocking socket and it returns EWOULDBLOCK or EAGAIN then one must decide to retry the operation or actually wait and try again later.

    There is a lot of buggy software written in C that tries to use the POSIX API directly, and gets these subtleties wrong. Most programs written in high-level languages use the I/O facilities provided by their language, which hopefully make things easier.

    I/O in Rust

    Rust makes error handling convenient and safe. If you decide to ignore an error, the code looks like it is ignoring the error (e.g. you can grep for unwrap() and find lazy code). The code actually looks better if it doesn't ignore the error and properly propagates it upstream (e.g. you can use the ? shortcut to propagate errors to the calling function).

    I keep recommending this article on error models to people; it discusses POSIX-like error codes vs. exceptions vs. more modern approaches like Haskell's and Rust's - definitely worth studying over a few of days (also, see Miguel's valiant effort to move C# I/O away from exceptions for I/O errors).

    So, what happens when one opens a file in Rust, from the toplevel API down to the system calls? Let's go down the rabbit hole.

    You can open a file like this:

    use std::fs::File;
    
    fn main () {
        let f = File::open ("foo.txt");
        ...
    }
    

    This does not give you a raw file descriptor; it gives you an io::Result<fs::File, io::Error>, which you must pick apart to see if you actually got back a File that you can operate on, or an error.

    Let's look at the implementation of File::open() and File::create().

    impl File {
        pub fn open<P: AsRef<Path>>(path: P) -> io::Result<File> {
            OpenOptions::new().read(true).open(path.as_ref())
        }
    
        pub fn create<P: AsRef<Path>>(path: P) -> io::Result<File> {
            OpenOptions::new().write(true).create(true).truncate(true).open(path.as_ref())
        }
        ...
    }
    

    Here, OpenOptions is an auxiliary struct that implements a "builder" pattern. Instead of passing bitflags for the various O_CREATE/O_APPEND/etc. flags from the open(2) system call, one builds a struct with the desired options, and finally calls .open() on it.

    So, let's look at the implementation of OpenOptions.open():

        pub fn open<P: AsRef<Path>>(&self, path: P) -> io::Result<File> {
            self._open(path.as_ref())
        }
    
        fn _open(&self, path: &Path) -> io::Result<File> {
            let inner = fs_imp::File::open(path, &self.0)?;
            Ok(File { inner: inner })
        }
    

    See that fs_imp::File::open()? That's what we want: it's the platform-specific wrapper for opening files. Let's look at its implementation for Unix:

        pub fn open(path: &Path, opts: &OpenOptions) -> io::Result<File> {
            let path = cstr(path)?;
            File::open_c(&path, opts)
        }
    

    The first line, let path = cstr(path)? tries to convert a Path into a nul-terminated C string. The second line calls the following:

        pub fn open_c(path: &CStr, opts: &OpenOptions) -> io::Result<File> {
            let flags = libc::O_CLOEXEC |
                        opts.get_access_mode()? |
                        opts.get_creation_mode()? |
                        (opts.custom_flags as c_int & !libc::O_ACCMODE);
            let fd = cvt_r(|| unsafe {
                open64(path.as_ptr(), flags, opts.mode as c_int)
            })?;
            let fd = FileDesc::new(fd);
    
            ...
    
            Ok(File(fd))
        }
    

    Here, let flags = ... converts the OpenOptions we had in the beginning to an int with bit flags.

    Then, it does let fd = cvt_r (LAMBDA), and that lambda function calls the actual open64() from libc (a Rust wrapper for the system's libc): it returns a file descriptor, or -1 on error. Why is this done in a lambda? Let's look at cvt_r():

    pub fn cvt_r<T, F>(mut f: F) -> io::Result<T>
        where T: IsMinusOne,
              F: FnMut() -> T
    {
        loop {
            match cvt(f()) {
                Err(ref e) if e.kind() == ErrorKind::Interrupted => {}
                other => return other,
            }
        }
    }
    

    Okay! Here f is the lambda that calls open64(); cvt_r() calls it in a loop and translates the POSIX-like result into something friendly to Rust. This loop is where it handles EINTR, which gets translated into ErrorKind::Interrupted. I suppose cvt_r() stands for convert_retry()? Let's look at the implementation of cvt(), which fetches the error code:

    pub fn cvt<T: IsMinusOne>(t: T) -> io::Result<T> {
        if t.is_minus_one() {
            Err(io::Error::last_os_error())
        } else {
            Ok(t)
        }
    }
    

    (The IsMinusOne shenanigans are just a Rust-ism to help convert multiple integer types without a lot of as casts.)

    The above means, if the POSIX-like result was -1, return an Err() from the last error returned by the operating system. That should surely be errno internally, correct? Let's look at the implementation for io::Error::last_os_error():

        pub fn last_os_error() -> Error {
            Error::from_raw_os_error(sys::os::errno() as i32)
        }
    

    We don't need to look at Error::from_raw_os_error(); it's just a conversion function from an errno value into a Rust enum value. However, let's look at sys::os::errno():

    pub fn errno() -> i32 {
        unsafe {
            (*errno_location()) as i32
        }
    }
    

    Here, errno_location() is an extern function defined in GNU libc (or whatever C library your Unix uses). It returns a pointer to the actual int which is the errno thread-local variable. Since non-C code can't use libc's global variables directly, there needs to be a way to get their addresses via function calls - that's what errno_location() is for.

    And on Windows?

    Remember the internal File.open()? This is what it looks like on Windows:

        pub fn open(path: &Path, opts: &OpenOptions) -> io::Result<File> {
            let path = to_u16s(path)?;
            let handle = unsafe {
                c::CreateFileW(path.as_ptr(),
                               opts.get_access_mode()?,
                               opts.share_mode,
                               opts.security_attributes as *mut _,
                               opts.get_creation_mode()?,
                               opts.get_flags_and_attributes(),
                               ptr::null_mut())
            };
            if handle == c::INVALID_HANDLE_VALUE {
                Err(Error::last_os_error())
            } else {
                Ok(File { handle: Handle::new(handle) })
            }
        }
    

    CreateFileW() is the Windows API function to open files. The conversion of error codes inside Error::last_os_error() happens analogously - it calls GetLastError() from the Windows API and converts it.

    Can we not call C libraries?

    The Rust/Unix code above depends on the system's libc for open() and errno, which are entirely C constructs. Libc is what actually does the system calls. There are efforts to make the Rust standard library not use libc and use syscalls directly.

    As an example, you can look at the Rust standard library for Redox. Redox is a new operating system kernel entirely written in Rust. Fun times!

    Update: If you want to see what a C-less libstd would look like, take a look at steed, an effort to reimplement Rust's libstd without C dependencies.

    Conclusion

    Rust is very meticulous about error handling, but it succeeds in making it pleasant to read. I/O functions give you back an io::Result<>, which you piece apart to see if it succeeded or got an error.

    Internally, and for each platform it supports, the Rust standard library translates errno from libc into an io::ErrorKind Rust enum. The standard library also automatically handles Unix-isms like retrying operations on EINTR.

    I've been enjoying reading the Rust standard library code: it has taught me many Rust-isms, and it's nice to see how the hairy/historical libc constructs are translated into clean Rust idioms. I hope this little trip down the rabbit hole for the open(2) system call lets you look in other interesting places, too.

  7. Moving to a new blog engine

    - meta

    In 2003 I wrote an Emacs script to write my blog and produce an RSS feed. Back then, I seemed to write multiple short blog entries in a day rather than longer articles (doing Mastodon before it was cool?). But my blogging patterns have changed. I've been wanting to add some more features to the script: moving to a page-per-post model, support for draft articles, tags, and syntax highlighting for code excerpts...

    This is a wheel that I do not find worth reinventing these days. After asking on Mastodon about static site generators (thanks to everyone who replied!), I've decided to give Pelican a try. I've reached the age where "obvious, beautiful documentation" is high on my list of things to look for when shopping for tools, and Pelican's docs are nice from the start.

    The old blog is still available in the old location.

    If you find broken links, or stuff that doesn't work correctly here, please mail me!