A few days ago, I read a discussion on a YouTube video. The essence of the discussion was that memory layout and memory size matter.

So, an unsigned 1 8-bit integer (or u8 in Rust) takes up less memory and thus is faster than a u128. However, someone else added a caveat: the memory layout can differ based on the architecture. Another point made was that 32 and 64-bit integers are probably faster, as modern CPUs optimize these values thoroughly. Therefore, a u8 is probably slower than a u32.

On the face of it, the argument made sense. However, two things made me wonder:

  1. Is it true?
  2. Are the differences so vast that I should consider using u8, u32, or u64 for IDs of my structs instead of u128?

Like a good Bayesian thinker, I opted to look into some evidence.

So, I’ve written a program:

#[derive(Debug, Clone, serde::Serialize, serde::Deserialize)]
struct SomeThing {
    item: u128, /*and this one*/
}

fn make255() -> Vec<SomeThing> {
    (0..255u128/* <- chaning this number*/)
        .map(|n| SomeThing { item: n })
        .collect()
}

// 65_535 * 255 =  16_711_425 // numbers
// 33_422_850 * 2 = 33_422_850  // Since we are reformatting those numbers once again when converting them back from strings
// 33_422_850 * 100 = 3_342_285_000 // For 100 times
fn main() {
    let material: Vec<_> = (0..u16::MAX/*65535*/).flat_map(|_| make255()).collect();
    let stringed = serde_json::to_string(&material).expect("fucked");
    let _unstringed =
        serde_json::from_str::<Vec<SomeThing>>(&stringed).expect("could not deserialize");
}

I created four editions of the above code, with u8, u32, u64, and u128.

I built them using cargo build --release.

Each run of the program would create around 33 million structs.

Each containing a number between 0 to 255 2, and then turn each into a JSON string and then turn them back into SomeThing structs.

As a backend developer, this covers one of the more important concerns of mine, which is serializing and deserializing 3.

I used hyperfine to run each binary 100 times. This means that for each type, we created 3 billion units.

Here are the results for the u8 version:

 hyperfine  --runs 100 ./u8version  --export-json 8.json
Benchmark 1: ./u8version
  Time (mean ± σ):     921.9 ms ±  72.6 ms    [User: 854.1 ms, System: 65.4 ms]
  Range (min … max):   801.2 ms … 1082.6 ms    100 runs

And here are the results for the u32 version:

 ❯ hyperfine  --runs 100 ./u32version  --export-json 32.json
Benchmark 1: ./u32version
  Time (mean ± σ):     957.6 ms ±  70.0 ms    [User: 860.1 ms, System: 95.1 ms]
  Range (min … max):   876.6 ms … 1134.1 ms    100 runs

And here are the results for the u64 version:

 ❯ hyperfine  --runs 100 ./u64version  --export-json 64.json
Benchmark 1: ./u64version
  Time (mean ± σ):     969.8 ms ±  21.9 ms    [User: 836.9 ms, System: 130.3 ms]
  Range (min … max):   890.5 ms … 1038.4 ms    100 runs

And finally, here are the results for the u128 version:

 ❯ hyperfine  --runs 100 ./u128version  --export-json 128.json
Benchmark 1: ./u128version
  Time (mean ± σ):      1.421 s ±  0.022 s    [User: 1.210 s, System: 0.206 s]
  Range (min … max):    1.371 s …  1.473 s    100 runs

So, compared to the u8 version, the u32 version took only 3% more time, the u64 version only 5% more, and the u128 version around 54% more time.

The difference between u8, u32, and u64 is so extremely small that I cannot help but agree that they are extremely optimized. It’s probably not faster than u8, but it is fair to say that when you are concerned with performance, u64 is still as solid as u8.

u128 is not as performant as the other types. But at this point, we should consider the amount of difference in time. Between u8 and u128, there was only a 15 nanosecond difference. To put it into perspective, you would need 4 billion u8 instances instead of u128 to save one minute of computation time on my computer.

In a back-end environment, such a small difference seems extremely negligible, making u128 with its extreme tolerance for integer overflow seem like a good deal.

Update 3 hours later, same day

Joshua Barretto suggested that I should use black_box to fence against possible compiler optimization. I took the benchmarkds again, and the results took significantly longer, around 50-60% more time. But they kept the same proportions. Specifically u8 took 1.498 seconds, u32 took 1.565 seconds, u64 took 1.615 seconds and u128 took 2.289 seconds.

Here is the modified code:


use std::hint::black_box;


#[derive(Debug, Clone, serde::Serialize, serde::Deserialize)]
struct SomeThing {
    item: u128, 
}

fn make255() -> Vec<SomeThing> {
    (black_box(0..255u128))
        .map(|n| SomeThing { item: n })
        .collect()
}

fn main() {
    black_box(|| {
        let material: Vec<_> = (0..u16::MAX)
            .flat_map(|_| black_box(make255()))
            .collect();
        let stringed = black_box(serde_json::to_string(&material)).expect("fucked");
        let _unstringed = black_box(serde_json::from_str::<Vec<SomeThing>>(&stringed))
            .expect("could not deserialize");
    })()
}
  

  1. Unsigned means that it cannot contain anything less than 0.↩︎

  2. That’s because u8 is limited to 255.↩︎

  3. However, this may mean that another confounding factor is tinting our data: the libraries involved in serializing and deserializing may be the source of the difference. However, since I am looking for applicable scenarios, this works well in my favor.↩︎