Correctness in Rust: building strings

- rust

Rust tries to follow the "make illegal states unrepresentable" mantra in several ways. In this post I'll show several things related to the process of building strings, from bytes in memory, or from a file, or from char * things passed from C.

Strings in Rust

The easiest way to build a string is to do it directly at compile time:

let my_string = "Hello, world!";

In Rust, strings are UTF-8. Here, the compiler checks our string literal is valid UTF-8. If we try to be sneaky and insert an invalid character...

let my_string = "Hello \xf0";

We get a compiler error:

error: this form of character escape may only be used with characters in the range [\x00-\x7f]
 --> foo.rs:2:30
  |
2 |     let my_string = "Hello \xf0";
  |                              ^^

Rust strings know their length, unlike C strings. They can contain a nul character in the middle, because they don't need a nul terminator at the end.

let my_string = "Hello \x00 zero";
println!("{}", my_string);

The output is what you expect:

$ ./foo | hexdump -C
00000000  48 65 6c 6c 6f 20 00 20  7a 65 72 6f 0a           |Hello . zero.|
0000000d                    ^ note the nul char here
$

So, to summarize, in Rust:

  • Strings are encoded in UTF-8
  • Strings know their length
  • Strings can have nul chars in the middle

This is a bit different from C:

  • Strings don't exist!

Okay, just kidding. In C:

  • A lot of software has standardized on UTF-8.
  • Strings don't know their length - a char * is a raw pointer to the beginning of the string.
  • Strings conventionally have a nul terminator, that is, a zero byte that marks the end of the string. Therefore, you can't have nul characters in the middle of strings.

Building a string from bytes

Let's say you have an array of bytes and want to make a string from them. Rust won't let you just cast the array, like C would. First you need to do UTF-8 validation. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
fn convert_and_print(bytes: Vec<u8>) {
    let result = String::from_utf8(bytes);
    match result {
        Ok(string) => println!("{}", string),
        Err(e) => println!("{:?}", e)
    }
}

fn main() {
    convert_and_print(vec![0x48, 0x65, 0x6c, 0x6c, 0x6f]);
    convert_and_print(vec![0x48, 0x65, 0xf0, 0x6c, 0x6c, 0x6f]);
}

In lines 10 and 11, we call convert_and_print() with different arrays of bytes; the first one is valid UTF-8, and the second one isn't.

Line 2 calls String::from_utf8(), which returns a Result, i.e. something with a success value or an error. In lines 3-5 we unpack this Result. If it's Ok, we print the converted string, which has been validated for UTF-8. Otherwise, we print the debug representation of the error.

The program prints the following:

$ ~/foo
Hello
FromUtf8Error { bytes: [72, 101, 240, 108, 108, 111], error: Utf8Error { valid_up_to: 2, error_len: Some(1) } }

Here, in the error case, the Utf8Error tells us that the bytes are UTF-8 and are valid_up_to index 2; that is the first problematic index. We also get some extra information which lets the program know if the problematic sequence was incomplete and truncated at the end of the byte array, or if it's complete and in the middle.

And for a "just make this printable, pls" API? We can use String::from_utf8_lossy(), which replaces invalid UTF-8 sequences with U+FFFD REPLACEMENT CHARACTER:

fn convert_and_print(bytes: Vec<u8>) {
    let string = String::from_utf8_lossy(&bytes);
    println!("{}", string);
}

fn main() {
    convert_and_print(vec![0x48, 0x65, 0x6c, 0x6c, 0x6f]);
    convert_and_print(vec![0x48, 0x65, 0xf0, 0x6c, 0x6c, 0x6f]);
}

This prints the following:

$ ~/foo
Hello
He�llo

Reading from files into strings

Now, let's assume you want to read chunks of a file and put them into strings. Let's go from the low-level parts up to the high level "read a line" API.

Single bytes and single UTF-8 characters

When you open a File, you get an object that implements the Read trait. In addition to the usual "read me some bytes" method, it can also give you back an iterator over bytes, or an iterator over UTF-8 characters.

The Read.bytes() method gives you back a Bytes iterator, whose next() method returns Result<u8, io::Error>. When you ask the iterator for its next item, that Result means you'll get a byte out of it successfully, or an I/O error.

In contrast, the Read.chars() method gives you back a Chars iterator, and its next() method returns Result<char, CharsError>, not io::Error. This extended CharsError has a NotUtf8 case, which you get back when next() tries to read the next UTF-8 sequence from the file and the file has invalid data. CharsError also has a case for normal I/O errors.

Reading lines

While you could build a UTF-8 string one character at a time, there are more efficient ways to do it.

You can create a BufReader, a buffered reader, out of anything that implements the Read trait. BufReader has a convenient read_line() method, to which you pass a mutable String and it returns a Result<usize, io::Error> with either the number of bytes read, or an error.

That method is declared in the BufRead trait, which BufReader implements. Why the separation? Because other concrete structs also implement BufRead, such as Cursor — a nice wrapper that lets you use a vector of bytes like an I/O Read or Write implementation, similar to GMemoryInputStream.

If you prefer an iterator rather than the read_line() function, BufRead also gives you a lines() method, which gives you back a Lines iterator.

In both cases — the read_line() method or the Lines iterator, the error that you can get back can be of ErrorKind::InvalidData, which indicates that there was an invalid UTF-8 sequence in the line to be read. It can also be a normal I/O error, of course.

Summary so far

There is no way to build a String, or a &str slice, from invalid UTF-8 data. All the methods that let you turn bytes into string-like things perform validation, and return a Result to let you know if your bytes validated correctly.

The exceptions are in the unsafe methods, like String::from_utf8_unchecked(). You should really only use them if you are absolutely sure that your bytes were validated as UTF-8 beforehand.

There is no way to bring in data from a file (or anything file-like, that implements the Read trait) and turn it into a String without going through functions that do UTF-8 validation. There is not an unsafe "read a line" API without validation — you would have to build one yourself, but the I/O hit is probably going to be slower than validating data in memory, anyway, so you may as well validate.

C strings and Rust

For unfortunate historical reasons, C flings around char * to mean different things. In the context of Glib, it can mean

  • A valid, nul-terminated UTF-8 sequence of bytes (a "normal string")
  • A nul-terminated file path, which has no meaningful encoding
  • A nul-terminated sequence of bytes, not validated as UTF-8.

What a particular char * means depends on which API you got it from.

Bringing a string from C to Rust

From Rust's viewpoint, getting a raw char * from C (a "*const c_char" in Rust parlance) means that it gets a pointer to a buffer of unknown length.

Now, that may not be entirely accurate:

  • You may indeed only have a pointer to a buffer of unknown length
  • You may have a pointer to a buffer, and also know its length (i.e. the offset at which the nul terminator is)

The Rust standard library provides a CStr object, which means, "I have a pointer to an array of bytes, and I know its length, and I know the last byte is a nul".

CStr provides an unsafe from_ptr() constructor which takes a raw pointer, and walks the memory to which it points until it finds a nul byte. You must give it a valid pointer, and you had better guarantee that there is a nul terminator, or CStr will walk until the end of your process' address space looking for one.

Alternatively, if you know the length of your byte array, and you know that it has a nul byte at the end, you can call CStr::from_bytes_with_nul(). You pass it a &[u8] slice; the function will check that a) the last byte in that slice is indeed a nul, and b) there are no nul bytes in the middle.

The unsafe version of this last function is unsafe CStr::from_bytes_with_nul_unchecked(): it also takes an &[u8] slice, but you must guarantee that the last byte is a nul and that there are no nul bytes in the middle.

I really like that the Rust documentation tells you when functions are not "instantaneous" and must instead walks arrays, like to do validation or to look for the nul terminator above.

Turning a CStr into a string-like

Now, the above indicates that a CStr is a nul-terminated array of bytes. We have no idea what the bytes inside look like; we just know that they don't contain any other nul bytes.

There is a CStr::to_str() method, which returns a Result<&str, Utf8Error>. It performs UTF-8 validation on the array of bytes. If the array is valid, the function just returns a slice of the validated bytes minus the nul terminator (i.e. just what you expect for a Rust string slice). Otherwise, it returns an Utf8Error with the details like we discussed before.

There is also CStr::to_string_lossy() which does the replacement of invalid UTF-8 sequences like we discussed before.

Conclusion

Strings in Rust are UTF-8 encoded, they know their length, and they can have nul bytes in the middle.

To build a string from raw bytes, you must go through functions that do UTF-8 validation and tell you if it failed. There are unsafe functions that let you skip validation, but then of course you are on your own.

The low-level functions which read data from files operate on bytes. On top of those, there are convenience functions to read validated UTF-8 characters, lines, etc. All of these tell you when there was invalid UTF-8 or an I/O error.

Rust lets you wrap a raw char * that you got from C into something that can later be validated and turned into a string. Anything that manipulates a raw pointer is unsafe; this includes the "wrap me this pointer into a C string abstraction" API, and the "build me an array of bytes from this raw pointer" API. Later, you can validate those as UTF-8 and build real Rust strings — or know if the validation failed.

Rust builds these little "corridors" through the API so that illegal states are unrepresentable.