The same server in five languages, and where they secretly disagree

Marton Trencseni - Fri 12 June 2026 - Programming

Introduction

Over the course of this series I ended up writing the same toy async message queue server five times: in Python, JavaScript, C++, Rust and Go. Each one is under 500 lines of code, each speaks the exact same line-based JSON protocol over TCP, and — this is the part I was proud of — each one passes the exact same suite of 34 unit tests. The whole point of the test suite was to enforce wire-compatibility across languages.

So here is the uncomfortable question I eventually asked myself: does passing the same tests actually mean the five servers behave the same? Or does it just mean they agree on the 34 things I happened to write tests for?

I went looking, at the byte level, and the answer is clearly the latter. Below are at least six ways the five implementations disagree — and every single one of them passes the suite. The code for all five is on Github.

How I looked: a byte-level probe

Before the findings, the method, because the method is the whole trick. The unit test suite parses every server response with json.loads before comparing it. That is the natural thing to do, but it also throws away information — most importantly, the order and exact spelling of the bytes on the wire. So I wrote a small separate harness that compares raw bytes and never parses.

The core of it is a receive helper that just accumulates whatever the server sends until it goes quiet:

def recv_raw(sock, t=0.4):          # collect bytes until the server idles
    sock.settimeout(t)
    out = b""
    try:
        while True:
            b = sock.recv(65536)
            if not b: break
            out += b
    except socket.timeout:
        pass
    return out

The driver starts each server in turn on the same port, runs a fixed set of probes against it, then kills it:

for name, cmd in SERVERS.items():
    p = subprocess.Popen(cmd + [str(PORT), str(CACHE)], ...)
    time.sleep(1.2)                 # let it bind
    r = probe(name)                 # the same probes for every server
    p.terminate(); p.wait()

Each probe is a few lines: open a socket, send some commands, and keep the raw reply. The whole thing leans on fixed sleeps rather than proper message framing, which is good enough for localhost and stable across runs, but is the one place I would not call this rigorous. The signals turned out to be unambiguous anyway.

The disagreements

Here is the summary. Every row was observed directly from the wire, against all five running servers.

# Probe Python JavaScript C++ Rust Go
1 JSON key order insertion insertion insertion alphabetical alphabetical
2 unsubscribe, never subscribed Internal exception success success success success
3 quit stray parse json error silent close silent close silent close silent close
4 bare empty line ignored disconnects ignored ignored ignored
5 invalid UTF-8 UTF-8 error parse json parse json UTF-8 error UTF-8 error
6 fractional last_seen malformed accepted malformed malformed malformed

Three of these are worth showing in detail; the rest I'll describe.

Divergence 1: the keys come out in a different order

The probe subscribes, sends one message, and keeps the delivered line:

line(c, '{"command":"subscribe","topic":"t"}'); recv_raw(c)
line(c, '{"command":"send","topic":"t","msg":"hi","delivery":"all"}')
msgline = next(l for l in recv_raw(c).split(b"\r\n") if b"index" in l)

What actually comes back, byte for byte:

python : {"command": "send", "topic": "t", "msg": "hi", "delivery": "all", "index": 0}
js     : {"command":"send","topic":"t","msg":"hi","delivery":"all","index":0}
cpp    : {"command":"send","topic":"t","msg":"hi","delivery":"all","index":0}
rust   : {"command":"send","delivery":"all","index":0,"msg":"hi","topic":"t"}
go     : {"command":"send","delivery":"all","index":0,"msg":"hi","topic":"t"}

Two things jump out. First, Rust and Go sort the keys alphabetically, while the others preserve the order the fields were inserted. Nobody decided this — it falls out of library defaults. Go's encoding/json sorts map keys when it marshals a map[string]interface{}. Rust's serde_json stores JSON objects in a BTreeMap unless you turn on the preserve_order feature, which the Cargo.toml here does not. Python's json.dumps, JavaScript's JSON.stringify and Boost.JSON all keep insertion order.

Second, look at Python: it is the only one that puts spaces after the colons and commas. That is also a library default — json.dumps uses ", " and ": " as separators unless you ask for compact output. So Python is wire-distinguishable from the other four before you even get to key order.

None of this is caught by the tests, because the tests json.loads every response and compare Python dicts, which are order- and whitespace-insensitive.

Divergence 2: unsubscribing from something you never subscribed to

The probe is one line — unsubscribe a topic this connection never subscribed to — and the replies split cleanly:

python          : {"success": false, "reason": "Internal exception"}
js/cpp/rust/go  : {"success": true}

Python is the odd one out, and the reason is a single method choice. Its handler does:

def handle_unsubscribe(cmd, writer):
    topics[cmd["topic"]].remove(writer)          # set.remove → KeyError if absent
    topics_reverse[writer].remove(cmd["topic"])
    send_success(writer)

set.remove raises KeyError when the element is not present, the exception propagates up to the client loop's try/except, and the client gets Internal exception. The other four reach for the forgiving operation — JavaScript's Set.delete, C++'s set::erase, Rust's HashMap::remove, Go's delete — all of which are no-ops on a missing key, so they happily report success. Had Python used discard instead of remove, all five would agree. The tests never unsubscribe without first subscribing, so this never surfaces.

Divergence 5: invalid UTF-8, same symptom, two different causes

The probe sends three invalid bytes followed by a newline:

c.sendall(b"\xff\xfe\xfa\r\n")

Three of the servers correctly call this a UTF-8 problem, two call it a JSON problem:

python/rust/go : {... "reason": "Could not decode input as UTF-8"}
js             : {"success":false,"reason":"Could not parse json"}
cpp            : {"success":false,"reason":"Could not parse json"}

The interesting part is that JavaScript and C++ arrive at the same wrong answer for completely different reasons. In JavaScript, the readline interface has already decoded the incoming bytes into a string by the time my code sees them, substituting the Unicode replacement character U+FFFD for the bad bytes — so the explicit UTF-8 check never has anything to reject, and the mangled string simply fails to parse as JSON. In C++, the check is right there in the code:

boost::locale::conv::utf_to_utf<char32_t>(str.c_str(), str.c_str() + str.size());

but utf_to_utf defaults to the skip error policy: it silently drops invalid bytes instead of throwing, so valid_utf8() returns true and the leftover again fails as JSON. Two languages, two library behaviours, one identical bug.

And this one is special, because the test suite does have a test for invalid UTF-8 — it just doesn't actually assert anything:

response = send_and_receive_many(client, random_bytes, allow_trailing_bytes=True)
for r in response: r == {'success': False, 'reason': 'Could not decode input as UTF-8'}

That r == {...} is a bare expression. There is no assert. The comparison is computed and thrown away. So the one divergence the author thought to test for is precisely the one the test silently waves through.

The other three, briefly

quit (divergence 3). Four servers treat quit as a special command and close the connection silently. Python does not special-case it early enough: the string falls through to json.loads("quit"), which fails, so the client gets a stray {"reason": "Could not parse json"} and then the connection closes. A control word leaks a protocol error on its way out.

Empty line (divergence 4). Send a bare \r\n. Four servers ignore it and stay connected. JavaScript treats an empty line as end-of-input — if (!line) { rl.close(); } — and hangs up. An empty line is a no-op everywhere except Node, where it is a disconnect.

Fractional last_seen (divergence 6). Send last_seen: 2.5. Four servers reject it as malformed, because they require an integer kind. JavaScript accepts it, because typeof 2.5 === "number" passes its type check and parseInt then quietly truncates it to 2. JavaScript's single numeric type leaks straight through the validator.

The differences I did not test on the wire

To be honest about scope: the six above are all things I observed directly from a client. There are further divergences I am confident about from reading the code but did not probe, and I want to label them as such.

The most important is the concurrency model. Python (asyncio), JavaScript (Node), and C++ (a single Boost.ASIO io_context run on one thread) are all single-threaded event loops, so they need no locking around the shared topic state. Rust (multi-threaded Tokio) wraps everything in Arc<Mutex<…>>, and Go (goroutines) guards it with a sync.Mutex. Same data structure, but two of the five need a lock and three do not, purely because of the runtime.

Related, and more consequential: delivery. The C++ server writes to each subscriber with a synchronous, blocking asio::write on its single thread, which means one slow or stuck subscriber can stall the entire event loop for everyone — a head-of-line problem the other implementations avoid in different ways (buffered fire-and-forget in Python and JS, an unbounded channel in Rust, a hand-rolled unbounded queue in Go). Finally, C++ stores its per-topic message index in a plain 32-bit int, so it would overflow after about two billion messages where the others would not. I did not generate two billion messages to confirm that one.

Conclusion

Every divergence I found falls into one of two buckets. Either it is a language or library default that leaks through — key ordering, JSON whitespace, UTF-8 error policy, JavaScript's one number type — or it is a small implementation choice that nobody thought to align, like Python's remove versus discard, or where each server puts its quit check. None of them are deep algorithmic disagreements. They are all the kind of thing you would never notice until you put the byte stream under a microscope.

The larger lesson is about what a test suite actually is. Mine encoded the behaviours I thought to assert, and the five servers converged on exactly those behaviours and no others. Wire-compatibility, it turns out, is not the same as behavioural-compatibility; it is behavioural-compatibility on the inputs you tested, which is a much smaller and more fragile claim. "The same program, written five times" is a comforting story, but it stops being true the moment you look closely — and a shared green test suite is very good at stopping you from looking closely.