The Wrong Boundary

It's pretty easy to find people in IRC who've seen new programmers blame everything they can, when something goes wrong: it's how console I/O is broken, the library is wrong, the database is wrong, the pointers are wrong, the JVM has a bug that's been sticking around for twenty years. Meanwhile, the more experienced developers sigh and point out the wrong strategy the code uses, and the fix is easy once the paradigms are corrected. One of the hallmarks of being a junior developer is to look everywhere but the mirror.

But the experts make similar mistakes: experts tend to be better about boundary conditions than junior developers, but they suffer the same problems: by prematurely defining where the problems can be, we miss where the problems are.

Is it in the room with us right now?

Junior programmers mess up paradigms because they think they're more experienced than they actually are. They read something from an input stream, for example, and think "Well, I read it" - without realizing the input stream has more to read, corrupting their future reads unless they clear out everything they don't want - an error most experienced programmers remember as a classic error.

They end up blaming the system for not working the way they think it should. To some degree, perhaps they're right - their assumptions have merit, it's just that someone has to make the systems work the way the programmers expect, and there are often concrete reasons the system doesn't do that already.

Those junior programmers are struggling with reality and understanding; they trust themselves too much and the system too little.

ByteCode.News is a connected system; this website is connected to IRC, Slack, and Discord as well as the web. In the IRC adapter, I just spent three days worrying over the speed of queue processing; the adapter was sending responses back every two seconds or so, which is impossibly slow when the internal message queue processed each message in around 4ms. I tried adaptive delays, I tried input throttle measurements, I measured throughput to see if the production system just had something configured incorrectly.

It was all wasted effort. It was me. A classic "ID10T" error - my irc client was throttling outgoing messages. I was sending messages at a rate of one every two seconds without realizing it, since my client showed them all to me instantly and timestamped them when I queued them and not when they were sent.

I observed poor throughput, and because I did trust the system, I assumed my code was wrong, and I wasted time trying to find the error so I could fix it. I trusted the wrong thing, just like a junior programmer: I assumed my tests were wrong, my results were wrong, that I was wrong, and I was - but the error was in the location of the problem.

A single IRC chat client configuration cleared everything up.

The setting, if you're interested, is from weechat: irc.server.libera.anti_flood.

What the measurement wasn't saying

The error observation wasn't incorrect; the time nevet took to respond to my instructions was slow. It was my scoping that was wrong, because I didn't know what (or whom) to ask. The data was there; nevet has logs that actually did record the incoming time, I just never thought to scan the logs in detail for the input, only the output.

It's a subtler version of the junior programmer mistake. "The JVM is broken" is at least falsifiable - you can point the junior dev at the test kit, or a body of work on the subject they're struggling with. But "my adaptive sender isn't working well enough yet" is a hypothesis that appears to engage with the evidence, while actually just reasserting the original attribution. I was measuring inside a boundary I'd drawn poorly, and the measurements couldn't tell me that, because they couldn't see outside it.

It's "show me where in this circle the problem exists," while the problem is the eraser on the pencil drawing the circle.

Questions Shape Their Answers

Programmer Barbie says "problem definition is hard!" and she's not wrong. Problem definition is hard, because every one of us has assumptions about relevant experience and our environment. We say "what's happening" and get the view from 10000 feet - from a specific observer, maybe ourselves, and that can fail to account for... almost anything. If you're not used to thinking about how many packets a record takes on the wire, then you're not going to think that 1501 bytes can take twice as long as 1500 bytes to transmit.

This is related to MTU - the packet size for Ethernet traffic. Once you go over the MTU size, you have a new packet. The maximum packet size is 1500 bytes, so if you send 1501 bytes in a single network transaction, you have to send two packets instead of one, for that extra byte.

Once you know, it's easy to fix: send the 1500 bytes, only, unless you have no other choice! Look for what you can optimize; it's usually there. (This is why JSON usually wins over XML in wire performance: it's often smaller... and why protobuf wins over JSON. It's all a tradeoff of verifiability and readability and memory consumption.)

The hard part of all of this is recognition. It's never going to be free; I was doing the problem analysis with AI agents, trying to find that external perspective that would clear it all up, and not one of them mentioned the problem scope, because I declared the problem and the LLMs assumed I knew more of what I was talking about than I did.

Once I explicitly widened the scope and asked the right question, the fix was 30 seconds away.

Locate first

The discipline this points to is simple to state and easy to skip: locate the problem before you characterize it. Really locate it. The problem isn't "my code is slow" - where, specifically, is time being spent? Have you measured? Do the measurements meet your expectations in each environment? Why?

That's a story from the past, too: a test server and a production server were both running a scenario in the same amount of time, despite the test server being vastly constrained compared to the production system. The problem was in database colocation - the test server was running DB2 on the same machine that WebSphere was running on, and the production server was running WebSphere in Connecticut and the database was in the Midwest - the time being lost was network traffic, quite literally. Yet the problem statement was "why are we paying for a killer prod system when the test system runs as well?"

The stories we come in with affect everything we do, something that affects humans in our entire lives. The junior dev's story is that the environment is hostile; the senior engineer's story is that their own code is the most likely culprit, because that's the story both have lived. Both stories feel like reasonable priors. Both can cause you to instrument around the actual problem rather than toward it.

Your observation tells you something is wrong. It doesn't tell you what. Treating those as the same thing is where the time goes.

Is it in the room with us right now?

What the measurement wasn't saying

Questions Shape Their Answers

Locate first

Comments (0)