"XML is a cheap DSL" - from an IRS use case

The engineering lead on the IRS Tax Withholding Estimator published "XML is a cheap DSL," arguing that XML was the right choice for their declarative tax logic engine - beating JSON, YAML, s-expressions, and Prolog on the "cheap universal tooling" axis, swimming directly against the "XML is great, can we all just use something else" stream.

It's been said - by yours truly - that XML lands firmly in the "tolerate it when you must" category - enterprise integration, legacy systems, SOAP endpoints you didn't design and can't escape. When it's forced upon you, really. And that may be the right call for most greenfield work.

The article on XML being a cheap DSL is worth reading. It's also worth naming what they actually built, because the piece doesn't quite say it directly.

What the Fact Dictionary Is

The IRS Fact Graph encodes US tax logic as a dependency graph of named facts. Some facts are writable - they're user inputs. Some are derived - they're calculations that depend on other facts. The engine traverses the graph, determines which questions need to be asked given prior answers, and produces a final tax obligation. Auditability comes for free: because the structure is declarative rather than imperative, you can ask how any value was computed and trace the dependency chain all the way through the system. It's observable.

That is RDF. Not "inspired by" RDF, not "similar to" RDF — that is the Resource Description Framework's core design: named resources, typed relationships, a reasoning engine that can traverse the graph and draw inferences without being explicitly told the execution order. They could have used RDF directly. They did not. They probably made the right choice.

Why RDF Didn't Win Here

RDF isn't a bad idea, at all. It's solving a genuine and hard problem: how do you build a machine-readable knowledge graph that can reason about facts without the context of existing as a biological entity that observes the world directly? Computers can't look out the window. They need everything made explicit. They can't feel the rain and think "oh, water is wet" like we can; to determine water is wet is to have a definition of water, of wetness, of the quality and density of rain, possibly relative humidity, maybe even vision. (Can you tell it's raining when it's pitch dark? Of course you can, but why? What if you're under a waterfall in a cave?) RDF provides an ontological model that lets a machine reason from stated axioms — and for the semantic web vision of globally federated knowledge, the machinery it brings is genuinely necessary.

The machinery is also enormous. RDF's XML serialization is notoriously unreadable, and most of that pain is namespaces. The namespace system exists to prevent term collision across independently maintained ontologies — so that tax:totalOwed from the IRS doesn't collide with tax:totalOwed from some other domain using the same vocabulary. (Taxes owed to a vendor aren't the same as taxes owed to a government... or are they? Which government? When?)

For a globally federated knowledge graph, that's essential. For a closed-domain application where you own all the terms, it's complexity with no return. The IRS Fact Dictionary doesn't need xmlns:fact="https://irs.gov/factgraph/2024#" on every element because it is never going to federate with DBpedia. You're unlikely to need to build a full query against the knowledge graph with SPARQL. The structure is simple enough that validation is more than sufficient - "does this document have this structure and no other," as opposed to "do these facts construct a valid tax calculation, and if so, what's the result? Go ahead, I'll wait."

The semantic web people solved a real problem with tooling about ten times more complex than a closed-domain application requires, because they were optimizing for the universal case. The IRS team got equivalent reasoning power by writing clean XML with a vocabulary they defined themselves. And they probably did it in about one-eighth the time that a simple ontology would have required.

The Validation Point

The piece doesn't emphasize schema validation - the Fact Dictionary is discussed mostly on readability and tooling grounds - but the capability is sitting there. XML's validation ecosystem (XSD, RelaxNG, Schematron) means the structure of every fact, the permitted child elements, the required attributes, the type constraints on values - all of it can be enforced at parse time, before any reasoning happens. In a system that encodes federal tax law, that's not academic. A malformed fact that slips through to the reasoning engine is worse than one that fails loudly at the schema boundary.

This is one of XML's genuinely underappreciated strengths: validation isn't bolted on, it's native to the ecosystem. JSON Schema exists and works, but it's an afterthought relative to the document model. XSD is a first-class citizen, where experienced practitioners learn to automatically distrust documents without an explicit schema reference: such documents are trivial to pollute.

What This Adds to the Format Conversation

The serialization formats piece's table puts XML in the "enterprise integration" row — tolerate it when the system on the other end requires it. That's still right for most decisions.

This is the exception that clarifies the rule: XML earns a chosen role when you need a declarative DSL over a domain with enough complexity to justify a reasoning engine, enough longevity to care about tooling universality, and enough auditability requirements to want introspection built into the structure rather than bolted on afterward.

The IRS had all three. Most applications don't. But it's useful to know what "XML is actually the right answer" looks like, so you recognize it when it shows up.

RDF is still the king of the hill when it comes to an actual knowledge graph - if you need to be able to concretely define relationships and definitions, it's an easy tool to reach for. But it's a specialist's instrument: most of us don't even know what an ontology actually is. Sometimes all you want is a hammer: that's XML here.

What the Fact Dictionary Is

Why RDF Didn't Win Here

The Validation Point

What This Adds to the Format Conversation

Comments (0)