io_uring Guide: Building High-Performance Servers with Rust and Async IO

Hey, have you ever run into this situation: your high-performance server has plenty of CPU headroom, memory looks fine, but it’s still slow as molasses?

I was pulling my hair out over this exact problem a while back. Turns out, the bottleneck was in I/O operations. If you want to build a truly high-performance server, async IO is something you just can’t avoid. This reminds me of a real-life scenario…

Traditional I/O is Like Waiting at the Bank

Waiting in queue

Picture this: you’re at the bank to get something done. What does traditional blocking I/O feel like?

You grab a ticket, sit down and wait. When your number’s called, you walk up to the counter. Finish one thing, go back and sit down, wait for the next call. The whole time, you’re just sitting there doing nothing. That’s synchronous blocking I/O for you - one request has to finish before the next one can start.

Then the bank introduced a “reservation + queue” system - that’s like epoll. Now you can keep an eye on multiple tasks at once and handle whichever one’s ready. Way better than before, but there’s still a problem - every time you need something done, you still have to physically walk up to the counter (syscall).

Enter io_uring: The VIP Lane

Conveyor belt automation

So what exactly is io_uring? Think of it this way - it’s like the bank suddenly opening a VIP express lane just for you.

Here’s the deal: it sets up two conveyor belts between you and the counter.

One’s called the Submission Queue (SQ) - you write down what you need on a slip and toss it on the belt, and whoosh, it slides right over to the counter. The other is the Completion Queue (CQ) - finished slips automatically come sliding back from the other side.

Here’s the kicker - you don’t have to walk anywhere! Just throw your requests on the belt, go about your business, and the results come back on their own. The folks at the counter (the kernel) just keep grinding away, and you don’t need to babysit them.

That’s the essence of io_uring: using shared memory ring buffers to drastically reduce context switches between user space and kernel space. This async IO model is the key technology for building high-performance servers.

The Technical Bits, in Plain English

Alright, let’s dig a little deeper, but I promise not to make your head explode.

The traditional approach looks like this:

Your program -> syscall -> kernel processes -> returns result -> your program continues

Every time you have to “pass through a checkpoint” - this process is called context switching, and it’s not cheap. It’s like going through security, registration, and queuing every single time you go to the bank, even if you just want to withdraw 50 bucks.

The io_uring way:

Your program <--shared memory--> Kernel
    |                              |
 submit request              fetch request
    |                              |
 fetch result                submit result

See that? With shared memory, many operations don’t need syscalls at all. The kernel and your program each do their own thing, passing data through two ring buffers. That’s the magic of async IO - seriously efficient.

Playing with io_uring in Rust

When it comes to Rust and io_uring, there are a few solid libraries worth mentioning.

The Raw Approach: io-uring crate

use io_uring::{opcode, types, IoUring};
use std::fs::File;
use std::os::unix::io::AsRawFd;

fn main() -> std::io::Result<()> {
    // Create a ring that can hold 256 requests
    let mut ring = IoUring::new(256)?;

    let file = File::open("test.txt")?;
    let fd = types::Fd(file.as_raw_fd());

    // Prepare a read operation
    let mut buf = vec![0u8; 1024];
    let read_op = opcode::Read::new(fd, buf.as_mut_ptr(), buf.len() as _)
        .build()
        .user_data(0x42);  // Tag this operation

    // Push the operation to the submission queue
    unsafe {
        ring.submission()
            .push(&read_op)
            .expect("submission queue full");
    }

    // Submit and wait for completion
    ring.submit_and_wait(1)?;

    // Fetch the result from completion queue
    let cqe = ring.completion().next().expect("no completion");
    println!("Read complete, return value: {}", cqe.result());

    Ok(())
}

This is the lowest-level approach - like driving stick shift. Maximum control, but also the most work.

The Elegant Way: tokio-uring

use tokio_uring::fs::File;

#[tokio_uring::main]
async fn main() -> std::io::Result<()> {
    let file = File::open("test.txt").await?;
    let buf = vec![0u8; 1024];

    let (result, buf) = file.read_at(buf, 0).await;
    let bytes_read = result?;

    println!("Read {} bytes", bytes_read);
    Ok(())
}

Much cleaner, right? tokio-uring wraps io_uring in the familiar async/await style we all know and love. Writing code feels just like regular async code.

For the Performance Obsessed: glommio

use glommio::prelude::*;

fn main() {
    LocalExecutorBuilder::default()
        .spawn(|| async {
            let file = glommio::io::DmaFile::open("test.txt")
                .await
                .expect("failed to open file");

            let buffer = file.read_at(0, 1024).await.expect("read failed");
            println!("Read {} bytes", buffer.len());
        })
        .expect("failed to start executor")
        .join()
        .unwrap();
}

glommio uses a “thread-per-core” architecture - one thread per CPU core, combined with io_uring. It’s an absolute performance beast. Perfect for latency-critical scenarios like databases or message queues.

Hands-On: Building a Simple Echo Server

Talk is cheap, let’s write an echo server with tokio-uring:

use tokio_uring::net::{TcpListener, TcpStream};

async fn handle_client(stream: TcpStream) {
    let mut buf = vec![0u8; 1024];

    loop {
        let (result, b) = stream.read(buf).await;
        buf = b;

        match result {
            Ok(0) => break,  // Connection closed
            Ok(n) => {
                let (result, b) = stream.write_all(buf[..n].to_vec()).await;
                buf = vec![0u8; 1024];
                if result.is_err() {
                    break;
                }
            }
            Err(_) => break,
        }
    }
}

#[tokio_uring::main]
async fn main() -> std::io::Result<()> {
    let listener = TcpListener::bind("127.0.0.1:8080".parse().unwrap())?;
    println!("Server started, listening on port 8080");

    loop {
        let (stream, addr) = listener.accept().await?;
        println!("New connection from: {}", addr);

        tokio_uring::spawn(handle_client(stream));
    }
}

This server uses io_uring for all network I/O - in high-concurrency scenarios, it’ll outperform traditional epoll solutions by a good margin.

When Should You Use io_uring?

Let me be straight with you - io_uring is powerful, but it’s not a silver bullet. It depends on your use case.

These scenarios are a perfect fit:

Building a server handling tens or hundreds of thousands of connections? Go for it. Heavy file I/O? Go for it. Need ultra-low latency, like a trading system? Definitely go for it. Just make sure your machine is running Linux kernel 5.1 or higher, otherwise it’s a no-go.

These scenarios? Maybe skip it:

Your code needs to run on Windows or Mac? Forget it - io_uring is Linux’s baby, other systems don’t have it. Your bottleneck isn’t even in I/O? Then why bother - it won’t solve your problem and you’ll just waste time. You’re the only one on the team who understands this stuff? Maintenance is going to be a nightmare.

Lessons from the Trenches

I’ve been using io_uring for a while now, and I’ve hit quite a few snags. Here are some tips:

1. Buffer Ownership Issues

io_uring is truly asynchronous, which means once you hand over a buffer, you can’t touch that memory until the operation completes. Rust’s ownership system is a huge help here, but you’ll need to get used to this “lend the buffer out and get it back” pattern.

2. Kernel Version Matters

Different kernel versions support different io_uring features. For example, network operations weren’t really solid until 5.6, and 5.7 brought more optimizations. Always check compatibility before upgrading.

3. Debugging is Tricky

Since operations are submitted asynchronously, stack traces might not be very intuitive when things go wrong. Add plenty of logging and make good use of the user_data field to track requests.

Wrapping Up

io_uring represents the future of Linux async IO. Its shared memory ring buffer design drastically reduces syscall overhead, letting us squeeze every last drop of performance from the hardware. For developers looking to build high-performance servers, this is essential knowledge.

In the Rust ecosystem, you’ve got options at every abstraction level - from the low-level io-uring crate to higher-level tokio-uring and glommio. Rust’s memory safety and zero-cost abstractions, combined with io_uring’s async IO capabilities, make for a golden partnership in building high-performance servers.

Next time your server hits an I/O bottleneck, give io_uring a shot. It might just be the key you’ve been looking for.

If you found this article helpful, feel free to like and bookmark it. Got questions? Drop a comment below and I’ll do my best to respond.

Follow me for more hands-on Rust high-performance programming tips. See you in the next one!

Traditional I/O is Like Waiting at the Bank#

Enter io_uring: The VIP Lane#

The Technical Bits, in Plain English#

Playing with io_uring in Rust#

The Raw Approach: io-uring crate#

The Elegant Way: tokio-uring#

For the Performance Obsessed: glommio#

Hands-On: Building a Simple Echo Server#

When Should You Use io_uring?#

Lessons from the Trenches#

Wrapping Up#