On a 64-vCPU EPYC, my key-value store does 9.9 million pipelined GETs per second — 3.7× more than Dragonfly, the fastest multi-threaded Redis alternative on the market. Dragonfly is built by a seasoned team and backed by years of work. So how does a single person’s side project beat it on this benchmark? The short answer is specialization — and the long answer is the rest of this post.

Why I built it

Last December, I was itching to build something from scratch. I had no idea what it would be, just that it had to be different. After some research, I decided to make a KV store. I’d used Redis before, so I wasn’t a stranger to in-memory caches — but using something is quite different from building it. I didn’t want to follow any tutorial, since I felt it might cause me to fixate on the design decisions from the content I was following. So I decided to do my own research. I did settle on the RESP protocol early, for Redis client compatibility.

Now the big question: how to make it different? Well, since I’m a C++ developer, why not try making it faster than the competition? That’s not so easy either — the most popular KVs (Redis, memcached, Dragonfly) are written in C or C++, so language alone doesn’t give me any performance boost. There’s one thing I can do to beat industry leaders: make it more performant through specialization. Popular products, even though developed over years or even decades by a group of very smart people, have to support a wide variety of features, and these come at a cost.

The single-threaded foundation

My initial plan was to make a single-threaded version. Why? Because of relative simplicity. I didn’t want multithreading to prevent me from creating a solid foundation. And that’s exactly what I did with RapidKV — a single-threaded store that already beat Redis by 4.2×. (It’s still up on GitHub if you want the single-threaded story in isolation, but everything that matters carries forward into what came next.)

The core: storage engine

The cache decisions here are exactly what let the sharded design scale later

It was essential that I create a fast storage engine, since that would be the backbone of the system. Modern systems rely heavily on CPU caches for their speed, so if I could choose data structures that take full advantage of these, I could see a performance improvement. The core of this storage engine is a hash table, which stores the actual KV entries.

Most implementations of hash tables (like C++’s std::unordered_map) use separate chaining to handle collisions. While this has its advantages, it also comes at the cost of poor cache locality. Since each entry is its own separate node, each pointer jump risks a cache miss. So if you need to probe, say, 4 times for an entry, the CPU will read 4 separate 64-byte cache lines that can be scattered across memory. This leads to poor performance. This is what Redis uses. The other way to handle collisions is open addressing, where everything is laid out sequentially, and CPUs love sequential data because of the excellent cache usage. This does come with its own disadvantages, but if you want speed, it’s the right choice. Dragonfly, by comparison, uses a hybrid called DashTable: a flat directory of pointers to fixed-size segments, where each segment is itself a small open-addressing hash table. When a segment fills, it splits. So they created an excellent data structure that’s faster than the Redis dict while also using less memory. I, on the other hand, went all-in on an open-addressing implementation, since that should still give the best performance — but comes at the cost of higher memory usage.

The open-addressing technique I went ahead with is called Robin Hood hashing (my FlatMap). After my research, this seemed to serve my purpose best. Here’s how it works — on a collision, we linearly probe and start checking consecutive slots. Either we find an empty slot, or in the case of an occupied slot we check if the new key is farther away from its home than the current key occupying the slot. If it is, we “rob” the slot from that key and displace it forward.

if (newEntryMetadata.psl > m_table.metadata[keyPos].psl)
{
    std::swap(key, m_table.keys[keyPos]);
    std::swap(value, m_table.values[keyPos]);
    std::swap(newEntryMetadata, m_table.metadata[keyPos]);
}

And we keep doing this until we reach an empty slot.

What Robin Hood buys you isn’t a lower average probe length — that’s fixed by load factor, and it’s the same for any open-addressing scheme. What it crushes is the variance and the worst case. Across normal operation (my table doubles at 0.80 load, so the effective load factor hovers in the high 0.6s), the average displacement from home was ~1 slot. The number that matters more: my max probe length stayed in the low 20s across millions of keys, where naive linear probing routinely produces outliers in the hundreds at the same fill. That’s what keeps p99 latency flat.

Because of the Structure-of-Arrays (SoA) design, the metadata is packed into just 2 bytes per entry. This allows up to 32 entries to reside within a single 64-byte cache line. Consequently, worst-case probes require fetching at most two cache lines to scan the metadata, filtering out 99% of unnecessary key comparisons before the heavier key/value arrays are ever accessed.

Resizing is handled incrementally rather than stop-the-world, to avoid the latency spikes a bulk rehash would cause.

Now let’s come to what the FlatMap stores in its keys and values. While my initial plan was to support many more data types, I focused on getting two types right — integers and strings — rather than many types half-done. Integers are stored as-is, but the big question was: can I optimize the strings somehow? There are three ways I could think of: make the size of the object as small as possible for better cache performance and move-ability (remember, we need to move objects around in Robin Hood); some level of inlining of data, called Small String Optimization (SSO), to reduce pointer jumps and therefore improve cache performance; and lastly a custom allocator that doesn’t fragment the heap for every new string allocation. C++’s default std::string isn’t bad, but object size varies between implementations from 24–32 bytes, and so does the SSO length. For full control I needed to write something from scratch. For comparison, both Redis and Dragonfly employ some form of SSO.

This is what I decided on: a custom 16-byte string called CompactString that inlines data for size ≤ 15 and falls back to a custom slab allocator for larger sizes. Since the size is exactly 16 bytes and it’s aligned to 16 bytes, 4 of these can fit in a single cache line. That means in the best case, for small strings in our table, a single read has 4 strings in contiguous memory — no pointer jump needed. And even if the size is larger and heap allocation is required, the slab allocator can allocate and deallocate memory in O(1) time, making for an extremely fast string type. Constructing heap-backed strings (via the slab allocator) is 3.6× faster than std::string, moves are 5.8× faster — the move win is what pays off during Robin Hood’s swaps and access is 1.75× faster. (Small inline strings construct at roughly parity, ~7% faster — there’s no allocator to beat there, the win for those is the cache density).

So what happens when we combine this new CompactString with our FlatMap? We get 2.5× faster insertions than std::unordered_map with std::string, 2× faster lookups, and 2.6× faster deletions.

These micro-benchmarks are part of the repo.

This wraps up the core storage engine. There’s still a lot more to talk about here, but I’ll save that for a future post. So stay tuned for a deep dive into the storage engine, where I’ll talk about the SoA layout, incremental resizing, CompactString, the eviction policy, and the memory management in more detail.

Now let’s see the overall single-threaded architecture and get a better understanding of the other components.

Single-threaded architecture

Let’s start with Asio. This is the part of the system responsible for managing TCP connections and IO; it fires when data is read and when data has been written to the network buffer. This is managed by the io_context event loop. One thread here is doing all the work — whenever a connection receives a request, the event loop wakes up the thread. It then does the reading, parsing, executing, and writing. Reading and writing operations are both async. Once the writing completes, we start all over again. All the connections reside in the server class.

A crucial part of this system is IO. A system handicapped by IO cannot be scaled properly, no matter how many cores you throw at it. When you’re doing millions of operations per second, it’s necessary to reduce as many copies and allocations in the pipeline as you can, since these dirty the CPU caches and use up memory bandwidth — other than, of course, just wasting CPU cycles. And that’s exactly what I did: the input data that’s read into the network buffer is used as-is to create commands using string_view, and passed directly to the storage layer to be stored in the hash tables. Similarly, when writing output, it’s written directly to the network buffer; any formatting required (like integer-to-string) is done using preallocated stack buffers. So the protocol (RESP) parsing is zero-copy, and protocol writing is zero-allocation.

Both the input and output buffers use a custom linear buffer with compaction, with the output buffer actually consisting of two buffers — one used to write to the network while the other collects responses. This means writes never stall. (When I moved onto the multithreaded architecture, there was an extra copy step that needed to be added — but I’ll explain that later.)

A micro-optimization in the IO path is an integer-packed command dispatcher. Command dispatch avoids string comparison entirely: each command name (up to 8 bytes) is folded to uppercase and packed into a single uint64_t, so matching an incoming command against the table is a handful of integer comparisons instead of repeated string compares on the hot path.

This results in an IO path that scales incredibly well even on a single thread with pipelining, leaving Redis far behind as the pipelining depth goes up. (Pipelining here means sending N requests at a time.)

At pipelining = 500, this version achieves a whopping 8 million ops/sec, compared to Redis’s 1.9 million — and that’s while having 1/5th the p99 latency of Redis (tested on a Ryzen 4800H). Now, I know a pipeline depth of 500 is very high and at the extreme end of realistic values, but it’s meant to show the maximum throughput you could get out of this system.

Some design choices made early on, with the multithreaded future of this project in mind, include the thread-local slab allocator, no global shared resource/mutex, and a self-contained connection object.

Hitting the wall

Hitting high throughput with pipelining has its use cases, but without pipelining you’re still severely limited on throughput compared to multithreaded counterparts. Those extra cores on the machine are sitting idle. To use them, we have to somehow divide the incoming work across multiple threads. One basic solution is to create N worker threads that access the hash table protected by a lock. While this may perform better than the single-threaded version, it’ll scale very poorly — since all the threads are accessing the same data structure, there’s going to be heavy lock contention.

A better approach is fine-grained locking, something memcached uses. Essentially, instead of one global lock, you create many small specific locks that each protect separate segments of the hash table. This minimizes lock contention and can result in good throughput scaling — but only up to a point. As the thread count increases, you start running into the same problem again: a higher thread count means higher chances of multiple threads hitting the same locked segment, and so we have the same issue. It goes from near-linear scaling in the beginning to flatlining at higher thread counts.

So what’s the answer? It’s called a shared-nothing design, and it’s also what Dragonfly uses, since it allows for maximum throughput. Let’s get into the details.

The shared-nothing design

Shared-nothing multithreaded architecture

In simple terms, we take the one hash table we had and split it into N tables, where N is the number of threads. Each of these N tables, enclosed in an independent entity called a shard, owns a different part of the key space. The distribution is decided using a hash function, with uniform keys resulting in uniform distribution across the N shards. These shards run their own separate event loop in their own separate thread. The shards are cache-line aligned to prevent false sharing, and each shard’s thread is pinned to a dedicated CPU core for optimal cache locality.

But now the question is: who takes care of the IO — the main thread, or separate threads? It’s the shards themselves. Whenever a new connection is made, it’s assigned to one of the shards in round-robin fashion.

So whenever a connection gets a request, we calculate the hash of the key using rapidhash (one of the fastest non-cryptographic hashes), then calculate the fast modulo using libdivide. This gives us the shard ID where the key resides.

// hash the key, then a fast modulo (via libdivide) gives the owning shard
uint64_t hash     = rapidhash_withSeed(request.arguments[0].data(),
                                       request.arguments[0].size(),
                                       m_routingHashSeed);
uint64_t quotient = hash / m_fastModDivisor;
targetShardId     = hash - (quotient * m_shardPool.size());

if (targetShardId == m_shardId)
    m_dispatcher.dispatch(request, m_database, response);          // local: inline fast path
else
    m_shardPool[targetShardId]->ExecuteRemote(std::move(request), /* ... */);  // remote: post to owner

Now there are two possibilities: either the key it requests is in the same shard, or in a different shard. If it’s the same shard, the command is executed inline in a fast path. If it’s in a different shard, we post this request to the target shard. Once the target shard executes the command asynchronously, it notifies the local shard, and we write to the network buffer.

Now the problem: what happens with pipelined requests? Since these requests can be spread across the shards, they can complete in a random order. Since a client expects pipelined responses back in the same order it sent them, we need to enforce strict ordering of the responses. This is done using request indexes — each request’s response is written by the local or remote shard into its own buffer, and each shard posts its completion to the local shard. Once all responses are accounted for, they’re copied to the network buffer in their respective order. And because of this ordering requirement, there’s an extra copy in the multithreaded architecture.

So how are cross-shard requests handled internally? When we want to access data in a remote shard, we post the request to the target shard. The target shard stores this request in its own task queue (managed by Asio’s io_context), and all the requests coming from different shards are queued up here and executed one by one. The hash table itself doesn’t have any form of locking mechanism on it. Similarly, when a target shard has executed a caller’s request, it posts the completion notification to the caller shard, which again means a completion callback is inserted into the caller’s queue.

So while there’s no locking mechanism on the hash table itself, the task queue still needs to be protected from corruption by multiple threads. Asio’s io_context internally uses locks to ensure this thread safety.

There’s one additional challenge with cross-shard ops. Each cross-shard op requires a small heap allocation to store the callback — and here’s the nasty part: the allocation and deallocation happen on different threads. Shard A allocates the callback, shard B frees it. Then B allocates the completion callback, and A frees it. This is the textbook worst case for a thread-local allocator. My slab allocator assumes alloc and free happen on the same thread — it has no safe answer for a cross-thread free. So I couldn’t use it on this path. That leaves two additional heap allocations per cross-shard op (and as shard count increases, so do the cross-shard ops) routed through the generic allocator. This meant losing a lot of performance. That’s exactly why I chose to use mimalloc: it’s designed so that freeing memory on a different thread than it was allocated on is cheap. Using it gave me roughly a 20% boost in throughput.

So what does all this achieve? A system that keeps on scaling even at very high thread counts. Let’s see the results.

Benchmarks

Test environment

ComponentDetails
CPUAMD EPYC 9554P (64C / 128T)
Server cores64 threads pinned to cores 0–63
Benchmark cores64 threads pinned to cores 64–127 via taskset -c 64-127
Benchmark toolmemtier_benchmark
Value size256 bytes
Key range1–10,000,000

Server configurations

ServerLaunch command
Redisredis-server --save "" --appendonly no
Dragonflydragonfly --dbfilename "" --snapshot_cron "" --cache_mode=true --version_check=false --proactor_threads=64
VortexKV./VortexKV VortexKV.config (64 shards)

All three servers ran on the same bare-metal machine. Redis had all persistence disabled; Dragonfly ran in cache mode with snapshots disabled. Servers were restarted between runs for a clean state.

Without pipelining

SET throughput (no pipeline) — 64 threads × 10 connections · ratio 1:0 (SET only) · 256B values · 100 seconds

MetricRedisDragonflyVortexKVvs Redisvs Dragonfly
Ops/sec66,6312,489,8252,575,16938.6×+3.4%
Avg latency9.508 ms0.254 ms0.248 ms−97.4%−2.4%
p99 latency11.967 ms0.351 ms0.343 ms−97.1%−2.3%

GET throughput (no pipeline, pre-populated) — 64 threads × 10 connections · ratio 0:1 (GET only) · 256B values · 100 seconds · 10M keys pre-populated

MetricRedisDragonflyVortexKVvs Redisvs Dragonfly
Ops/sec70,7802,552,8892,541,42835.9×−0.4%
Avg latency9.131 ms0.253 ms0.249 ms−97.3%−1.6%
p99 latency9.407 ms0.351 ms0.343 ms−96.4%−2.3%

Takeaway: Without pipelining, VortexKV and Dragonfly are effectively neck-and-neck at ~2.5M ops/sec — both ~36× faster than single-threaded Redis.

With pipelining

Pipelined SET (pipeline = 30) — 64 threads × 10 connections · pipeline = 30 · ratio 1:0 · 256B values · 200K requests/client

MetricRedis †DragonflyVortexKVvs Redisvs Dragonfly
Ops/sec567,6427,417,39611,880,29720.9×+60.2%
Avg latency8.298 ms2.358 ms1.769 ms−78.7%−25.0%
p99 latency15.743 ms4.047 ms2.319 ms−85.3%−42.7%

Pipelined GET (pipeline = 30, pre-populated) — 64 threads × 10 connections · pipeline = 30 · ratio 0:1 · 256B values · 100 seconds · 10M keys pre-populated

MetricRedis †DragonflyVortexKVvs Redisvs Dragonfly
Ops/sec558,3082,688,8739,889,34617.7×3.68×
Avg latency8.592 ms7.140 ms1.918 ms−77.7%−73.1%
p99 latency17.023 ms7.807 ms2.655 ms−84.4%−66.0%

† For Redis, clients were run at an optimal 16-thread × 10-connection config for the pipelined tests. Redis executes commands on a single thread, so adding more threads doesn’t improve its execution throughput, unlike VortexKV and Dragonfly.

Takeaway: Under pipelining, VortexKV pulls clearly ahead. Pipelined SET is 60% faster than Dragonfly; pipelined GET is 3.7× faster — the largest gap in the entire suite.

An anomaly worth chasing: why were writes outpacing reads?

One interesting thing to note here is the difference in SET vs GET numbers. Both VortexKV and Dragonfly achieve higher SET ops, which feels like an anomaly — it should be the exact opposite, especially for Dragonfly. I thought something was wrong, so I ran the benchmarks again, not just on this EPYC server but on two other machines as well: a C3D GCP instance and my own Ryzen 4800H machine. The trend was similar.

Knowing my own architecture, I had a suspicion that might be true for Dragonfly as well — and that was the extra copy required in the response pipeline I mentioned earlier. For any SET request, the incoming data is passed directly using string_view to the target shard, where it’s copied once into the hash table entry, and we respond with a small +OK\r\n (5 bytes). That response data needs to be passed to the caller’s shard, for which we first copy into a temporary buffer, then copy to the network to preserve ordering. For a GET response, we again send the data directly, but of course there’s no copying into the hash table entry — what’s different is that the response data is now much bigger, equal to the value size (256 bytes in our benchmarks).

Maybe Dragonfly was experiencing something similar, but worse. So I simply changed the data size in the benchmarks from 256 bytes to 8 bytes, and — boom — now Dragonfly’s GETs were as fast as its SETs. To make sure, I read some of Dragonfly’s code (string_family.cc , function ReadString) to see what was going on: for a GET response, Dragonfly first creates a temporary string on the target shard (so a heap alloc plus a copy), then this string object is passed to the caller, where it’s read again (from the target’s memory, which is more costly for multi-CCD CPUs) and copied to the output buffer. If we reduce the data size, this string is now inlined — so no heap alloc, and the copy is significantly cheaper — resulting in better numbers.

Scaling

Shard scaling (1:1 SET:GET, no pipeline) — 256B values · 25 seconds · client threads scaled proportionally to server shards

ShardsDragonfly (ops/sec)VortexKV (ops/sec)ΔDragonfly p99VortexKV p99
183,66473,510−12.1%0.287 ms0.295 ms
4311,311287,612−7.6%0.367 ms0.415 ms
161,064,231959,061−9.9%0.415 ms0.447 ms
321,736,6631,572,802−9.4%0.559 ms0.575 ms
642,403,3682,551,297+6.2%0.351 ms0.351 ms

Shard scaling: VortexKV vs Dragonfly

At 1–32 shards, Dragonfly leads by 8–12%. I don’t have a precise breakdown of where that comes from, but it’s likely from a more mature implementation. The 1-shard count is an especially interesting case, since this configuration makes both systems act like a single-threaded cache (no cross-shard ops). This lead might be coming from their helio I/O framework, plus years of profiling-driven micro-optimization on a mature codebase.

At 64 shards the relationship inverts. Dragonfly wraps every command that touches keys — even single-key ops — in a Transaction object, the machinery that gives them atomic multi-key operations via their VLL (Very Light Locking) intent-lock system. That per-command cost is a fair price for a feature I skipped, and skipping it is likely part of why my lighter path scales better at full saturation.

Note: This scaling test is unpipelined on purpose, to isolate how cleanly each architecture scales with core count rather than to measure peak throughput. The pipelined numbers earlier are the throughput ceiling; this is the scaling shape. The +6.2% here isn’t the same claim as the 3.7× there — one is “scales better at saturation,” the other is “higher peak under load.”

Limitations

VortexKV is specialized, and not complete. A few honest categories of what’s missing and why:

Multi-key DEL/EXISTS. These existed in the single-threaded version — since everything was in a single table, they were easy. As soon as we move to the sharded architecture, a single multi-key request can span multiple shards. This makes these ops non-trivial: every multi-key operation now needs to account for results from multiple shards and maintain atomicity. The implementation is not only genuinely hard, it also introduces extra work that would hurt throughput — counterintuitive to what I’m trying to build. So I gave up a feature I’d already had working.

Persistence, replication, AUTH, and TLS aren’t here either — but those are production-completeness features, and VortexKV is a study of the hot path, not a production-ready Redis replacement. They were out of scope by design.

Data types beyond strings and integers were left out because I refuse to ship a type that isn’t optimized. I initially did have plans to support more types — you can actually see the remnants of this in the hash table’s value type, which uses a std::variant (for just string and int, a custom union would be better). Take lists, for example: I could have added the feature backed internally by a std::vector; it was trivial enough. But shipping a slow data structure in a project that’s about data-structure performance would contradict the entire point.

Future work

The next big thing on the roadmap is a more efficient cross-shard communication mechanism. The current one, using ASIO post, uses lock-based task queues and extra heap allocations. That is a bottleneck that gets worse as shard count increases. I plan to implement custom lock-free queues for the cross-shard task posting specifically, rather than routing it through Asio’s queue (the network I/O would stay on Asio). I haven’t settled on the exact design yet, but per shard it’ll be either one MPSC queue or N−1 SPSC queues. The MPSC design is simpler but has more contention; the SPSC design uses more memory (quadratic with shard count) but should be faster. I’ll be testing both. For reference, Dragonfly also uses lock-free queues for cross-shard communication — so this is partly about closing the gap with their implementation.

This should improve cross-shard performance, since it avoids the lock contention on the critical path — and depending on the queue design, can remove the per-op heap allocation too. It should also help scaling at higher shard counts, where cross-shard ops are both more frequent and more expensive.

Try it yourself

And that’s how a single-person side project ends up at 9.9M pipelined GETs/sec — not by out-engineering a seasoned team, but by refusing to pay for features I didn’t need and optimizing wherever I could. VortexKV, the full source, and reproducible benchmarks are all on GitHub: github.com/shashank2602/VortexKV.

I’m currently open to C++ roles. The best way to reach me is by email at shashankjoshi2602@gmail.com or on LinkedIn.