JSON, TOON, YAML, and more: considering data serialization formats

In a number of applications I work with, I use JSON everywhere, as a sort of lingua franca. I typically do this with Jackson, and with Spring Boot 4's primary use of Jackson 3, it hit me that I might want to consider migrating from Jackson 2 to Jackson 3.

I didn't need to do it. Jackson 2 is working; I wasn't seeing huge problems in how Jackson 2 was using system resources, but reducing possible technical debt is good when you can do it, so I had to give it some thought, before I ran out of room and had to do it.

But then I realized... when I said I use JSON everywhere, I really mean it. API responses, message passing, jsonb columns, maybe even NoSQL with Mongo, internal payloads when object structures aren't the right answer... JSON is the water my communications swim in, for the most part, unless I have no choice but to use something else.

JSON is not alone, though. If I look at the root problems I am trying to solve, I need to consider my actual options, not just the choices I've found myself making because they're easy.

That leads me to, frankly, a lot of options: TOON, YAML, Parquet, hoary old XML, and even some newer alternatives like XferLang.

What Each Format Actually Is

Before comparing tradeoffs, it's worth being precise about what each format is for — because several of these are solving genuinely different problems, and conflating them leads to bad decisions. Bad decisions are, as it says on the tin, bad decisions and we'd like to avoid them if we can.

JSON (JavaScript Object Notation) is a text-based, human-readable serialization format built on two primitives: objects (key-value maps) and arrays. Its dominance comes from near-universal tooling support, native browser handling, and near zero-friction interoperability across language boundaries. The cost is verbosity: field names repeat in every object, numbers are stored as character sequences, and there is no native type for dates, binary data, or typed numerics. JSON is verbose by design, and that design has proven to be the right call for general interoperability. It's been extended in various ways to allow oddities like "comments" and "schema definitions," but support for these features tend to require, well, specific support.

YAML (YAML Ain't Markup Language) is technically a superset of JSON — any valid JSON is valid YAML — but in practice it occupies a different niche. YAML uses indentation-based structure and minimizes punctuation, making it significantly more readable for human-authored content. This is why it dominates in configuration files: Kubernetes manifests, GitHub Actions workflows, Ansible playbooks, Spring application configs. YAML is not a great choice for machine-generated programmatic serialization, however. Whitespace significance makes it a problem at scale, its type inference rules have surprising edges, and the parse overhead is nontrivial compared to JSON. It's probably best kept in a configuration-shaped box.

TOON (Token-Oriented Object Notation) is a compact, human-readable encoding of the JSON data model, designed originally specifically for LLM input. It combines YAML-style indentation for nested objects with a CSV-style tabular layout for uniform arrays. The key innovation is the array header: instead of repeating field names in every record, TOON declares them once with an explicit row count and field list, then streams values line-by-line:

factoids[3]{subject,predicate,accesscount,value}:
  json,description,12,JavaScript Object Notation
  json,url,5,https://json.org/
  toon,url,7,https://github.com/toon-format/toon

The same data in JSON repeats "subject", "predicate", "accesscount", and "value" three times each. TOON's benchmarks show roughly 40% token reduction versus pretty-printed JSON for uniform tabular data, and around 36% versus compact JSON. Crucially, TOON is still text and still human-readable — it is not a binary format. It ends up having a lot of the qualities of CSV without having the inflexibility of CSV.

XML (Extensible Markup Language) is the elder statesman of this group. Tag-based, hierarchical, verbose to a degree that inspired an entire generation of programmers to invent JSON as a reaction. XML is not, however, dead. It has an ecosystem that nothing else in this list matches: XSD for schema validation, XSLT for transformation, XPath and XQuery for querying, SOAP for web services, and decades of enterprise tooling. Maven POMs are famously XML (and XML's verbosity is one of the drivers for Gradle's adoption, if truth is told, although it's not the only driver.) Spring's legacy configuration is XML. Office OpenXML is, well, XML. If you're interfacing with enterprise systems, financial services infrastructure, or any domain that was built in the 2000s, XML is probably there waiting for you. Looming. Waiting to talk for what seems like hours.

Parquet is a columnar, binary storage format from the Apache ecosystem, optimized for analytics over large datasets. It is not a general-purpose application serialization format — comparing it to JSON or TOON for REST API payloads is a category error. Where it belongs is in data pipelines: reading a subset of columns from millions of rows, compressing time-series data, feeding analytics into Spark, Flink, DuckDB, or any modern data warehouse. Parquet files carry their schema with them, enforce types natively, and support predicate pushdown — the ability for a query engine to skip entire row groups without deserializing them. For anything that looks like a data lake, Parquet is the default answer. Parquet was an easy discard when considering data formats, because it's really designed for bulk data, and most of my applications work with relatively transient data or limited sets of data in a given process.

An important constraint worth naming explicitly: Parquet is a file format. It owns the container. When bulk data needs to live as a section inside a larger document — say, a flight operations payload that also carries metadata, routing information, and status fields — Parquet doesn't compose. You can't embed a Parquet table inside a JSON document and have either side be happy about it. We ran into exactly this on an airline data transfer project: there was a bulk section that looked like a Parquet candidate on paper, but it was embedded in a much larger document, the MVP timeline was real, and JSON won on simplicity without much argument. The columnar efficiency of Parquet only materializes when Parquet owns the whole artifact.

XferLang is the newest and narrowest entry here. It is a typed, delimiter-driven data format designed for serialization, data transfer, and configuration, with a design philosophy that asks: what if JSON had grown up with a stronger type system and a programmable parser? XferLang features explicit types (no type inference heuristics), interpolated strings, no escape sequences (you lengthen the delimiter instead of inserting backslashes), named bindings for dynamic value insertion, and extensible processing instructions that can carry metadata, perform conditional logic, or compose elements from external sources at parse time. The primary implementation is a .NET library. XferLang is more interesting as a possibility than a candidate, so far, much like Parquet or YAML but in a different space - its .NET heritage and limited implementations mean that you'd be working to create an implementation rather than using the specification itself. Worth knowing about to consider directions, but not ready for use in a JVM-centric application.

Strengths and Weaknesses

JSON

Strengths:

  • Universal tooling support across every language and framework - provided you limit yourself to JSON, although some alternative specifications like JSON5 are popular enough and supported.
  • Human-readable and debuggable without any additional tooling.
  • Native to the browser tier; nearly zero-friction cross-service interoperability for other tiers.
  • Schema-optional: you can emit JSON at a consumer and negotiate the contract later. (This is "good.")
  • Deep tooling ecosystem: JSON Schema, jq, JSONPath, jsonb in PostgreSQL, every document store.
  • Composable: bulk data, metadata, status fields, and nested structures all coexist naturally in a single document.

Weaknesses:

  • Verbose: field names repeat per record, numbers are character strings.
  • No native type system for dates, binary data, or typed numerics. Binary data is, um, possible but requires encoding.
  • Per-record overhead compounds badly for large uniform datasets.
  • Pretty-printed JSON is extravagantly wasteful at scale; compact JSON sacrifices readability.

YAML

Strengths:

  • Highly readable for human-authored content; less punctuation noise than JSON.
  • First-class choice for configuration files across the modern DevOps ecosystem.
  • Supports comments — a feature JSON stubbornly refuses to have (although, again, extensions for JSON exist, but they're not part of the standard.)
  • Jackson has a jackson-dataformat-yaml module, so JVM integration is available.

Weaknesses:

  • Whitespace significance is a persistent source of subtle bugs; indentation errors are silent in ways that JSON's bracket mismatches are not. YAML looks like a format beloved by Python.
  • Type inference has surprising edges: yes, no, on, off, bare numbers — all have implicit type coercions that vary between YAML versions.
  • Not suitable as a machine-generated wire format; parse overhead is significant at volume.
  • The YAML spec is famously complex; implementations diverge in edge cases.

TOON

Strengths:

  • Roughly 40% fewer tokens than pretty-printed JSON for uniform arrays; ~36% versus compact JSON.
  • Retains full human readability — no decoder required to inspect a TOON file.
  • Lossless round-trip to and from JSON; same data model, different encoding.
  • Explicit structural metadata ([N] lengths, {fields} headers) helps LLMs parse and validate data reliably, with higher comprehension accuracy than JSON in benchmarks.
  • Growing multi-language ecosystem: official implementations in Java (JToon), Python, Go, Rust, .NET, Kotlin, and others.

Weaknesses:

  • Purpose-built for LLM input. That is both its strength and its principal constraint, although it's not limited to LLMs.
  • No general ecosystem for TOON-serialized REST APIs, TOON-native databases, or TOON as a service-to-service wire format.
  • Gains evaporate for non-uniform or deeply nested data. For semi-uniform structures (~50% tabular eligibility), TOON can be less efficient than compact JSON.
  • No Jackson dataformat module exists. JToon operates independently of the Jackson ecosystem; bridging the two requires manual coordination. This is huge.
  • Spec is at v3.0 and still evolving, though the format is described as stable.

XML

Strengths:

  • Unmatched enterprise ecosystem: XSD, XSLT, XPath, SOAP, XML namespaces, digital signatures.
  • Schema validation is a first-class feature; you can enforce structure strictly at parse time. This is a strength that many programmers ignore at their peril, and they keep ignoring it, and even when it bites them, they keep ignoring it, because XSD isn't trivial to write and it's really not trivial to write well.
  • Excellent for document-centric data with mixed content (text interspersed with markup).
  • Jackson has jackson-dataformat-xml; Spring's @RequestMapping can negotiate XML natively.
  • Not going away, ever: tooling, financial services, healthcare (HL7/FHIR), and legacy enterprise systems depend on it. When the cockroaches take over after mankind expires, they're going to have to learn XML to some degree.

Weaknesses:

  • Extremely verbose. Tags repeat on every open and close; attributes add another syntax layer.
  • No universal agreement on how to represent arrays; different serializers make different choices. Heck, different schema authors make different choices.
  • Parsing is heavyweight compared to JSON; the streaming vs. DOM trade-off adds architectural complexity. There're even bifurcated parsing mechanisms; it's not maddening, it can be madness itself, part of why libraries like Jackson and XStream make XML endurable.
  • Nobody reaches for XML voluntarily for new greenfield service design anymore. Ever.

Parquet

Strengths:

  • Columnar storage: read only the columns you need from millions of rows, skipping everything else at the I/O layer.
  • Excellent compression: similar values are physically adjacent, making dictionary encoding and run-length encoding highly effective.
  • Self-describing: Parquet files carry their schema; consumers don't need out-of-band schema negotiation.
  • First-class support across the entire analytics ecosystem: Spark, Flink, DuckDB, Pandas, Arrow, Hive, Redshift, BigQuery.
  • Predicate pushdown: query engines skip entire row groups that don't satisfy filter conditions before deserialization.

Weaknesses:

  • Binary and not human-readable; debugging requires tooling. This is huge.
  • Wrong tool for transactional, record-at-a-time serialization. Reading a single row means decompressing column chunks.
  • Fixed overhead per file that doesn't amortize at small data volumes.
  • No place in a REST API or a message queue payload. At all.
  • Not composable: Parquet owns the file or it doesn't work. You cannot embed Parquet data as a section inside a larger document, at least not without considerable pain that offsets anything you'd gain from it.

XferLang

Strengths:

  • Explicit type system eliminates heuristic type inference; no "is this a string or a number?" ambiguity.
  • Delimiter-lengthening eliminates escape sequences entirely; content that would collide with a delimiter simply uses a longer one.
  • Programmable processing instructions enable metadata, conditional logic, and dynamic value composition at parse time without modifying the core grammar.
  • Clean, whitespace-tolerant syntax that is readable without ceremony.

Weaknesses:

  • Primary implementation is .NET; JVM support is not yet established.
  • Minimal adoption and ecosystem; this is an interesting design, not a production-standard interchange format.
  • The processing instruction system is powerful but adds conceptual surface area that most serialization use cases don't need.
  • Not something you'd deploy as a service boundary format today without accepting significant integration risk.

Jackson Support

This is where the JVM ecosystem lens matters most.

JSON: Jackson is the de facto JSON library for the JVM. ObjectMapper, the streaming API, databind — all of it. With Jackson 3's immutable ObjectMapper (and the preferred JsonMapper), you get thread-safe shared instances without defensive copying, which eliminates an entire class of "just to be safe" bugs that accumulate in long-lived codebases.

YAML: Jackson has a jackson-dataformat-yaml module that works through the same databind infrastructure. You can register YAML as an additional MediaType in Spring and get content negotiation essentially for free. YAML is not a good choice for service-to-service payloads, but if you're also serving human-readable configuration or export endpoints, the module makes it easy to add.

TOON: No Jackson integration exists. JToon is a standalone implementation that operates outside the Jackson ecosystem. Bridging the two means serializing with Jackson to an intermediate object model and then encoding with JToon — workable, but not the MediaType registration and automatic content negotiation you'd get with a proper dataformat module. Worth watching; the ecosystem is moving quickly.

XML: Jackson has jackson-dataformat-xml, and Spring MVC integrates with it naturally. This is probably the most mature non-JSON Jackson dataformat module, and if you need to expose XML endpoints alongside JSON, it is the path of least resistance. The ability to be well-specified actually serves it very well here: you can say "this is what a message looks like," and know from the document if it's actually a valid message or not... if the schema is written well.

Parquet: Not a Jackson concern. Parquet is handled through Apache Parquet for Java (Hadoop ecosystem) or Arrow's Java bindings if you want a lighter path. DuckDB's JDBC driver will also read and write Parquet directly if you want to query it from a JVM application without the full Hadoop stack. Tossed out as a consideration.

XferLang: .NET-primary. No JVM library exists at time of writing. While an implementation might be able to be spun up, it'd be tracking something that's under development, and tracking the specification and implementing for Jackson makes this a non-starter.

Size Differentials and Data Scale

The honest answer here is that for most object serialization — the kind you do a thousand times a minute in a typical service — format choice barely registers. A bot message payload, a REST response for a single entity, a Mongo document representing a user session: these are small, mostly heterogeneous objects. JSON wins on tooling inertia alone, and the size difference between JSON and any alternative is noise.

Where the calculation starts to shift is at the edges: objects that are either very large, very numerous, or both.

For small objects — a user record, a bot message, a handful of fields — JSON is the correct default. The verbosity is negligible, every tool in your stack already speaks it, and no alternative saves you enough to justify the integration cost. This covers the overwhelming majority of what bytecode.news and most other application services actually serialize. You'd consider how large the compressed objects were - looking for network packet transfer size boundaries - but unless you're talking significant numbers, you just won't care.

For moderate uniform datasets — a few hundred to a few thousand records with identical structure, particularly when those records are being fed to an LLM — TOON becomes genuinely interesting. The TOON benchmarks show roughly 60% token reduction versus pretty-printed JSON and 35–36% versus compact JSON for 100-row sample employee records. If you're assembling RAG context, building a prompt that includes search results or database rows, or paying per-token to a model provider, that reduction can be real money and real latency. For service-to-service communication without a token budget, JSON's tooling advantage still wins, and YAML is a non-starter for machine-generated payloads regardless of its token profile.

For large non-uniform datasets — complex nested structures, deeply heterogeneous records — TOON's gains actually reverse. Its tabular format only helps when objects are uniform; for semi-uniform structures (~50% tabular eligibility), TOON can use more tokens than compact JSON. Know your data before reaching for it. (Actually, know your data, period. Assumptions make donkeys of me and thee, or... I don't know, there's a saying. Assumptions are bad.)

For bulk analytical data — millions of rows, time-series exports, standalone data pipeline outputs — neither JSON nor TOON nor YAML is the right tool. This is Parquet's territory, and it's not even close. A 100-million-row dataset in pretty-printed JSON is an unusable artifact. The same data in Parquet is queryable column-by-column without loading the full dataset into memory, compresses aggressively, and is natively understood by every analytics tool worth mentioning. The fixed overhead per Parquet file is real but amortizes quickly at volume.

The qualifier standalone matters. On a project with a significant data transfer requirement, we had a section of bulk operational data that looked like a Parquet candidate — uniform structure, meaningful record counts. We entertained it for about five minutes. The bulk section was embedded inside a much larger document that also carried other types of data, including extensible datasets, and the MVP timeline was real. Parquet doesn't embed. You can't drop a Parquet table into a JSON field and transmit it as part of a larger payload in any practical sense. JSON won on composability and simplicity, and the decision took less time than writing this paragraph.

The lesson: Parquet's advantages only materialize when it owns the whole artifact. The moment bulk data is a section of something larger rather than the thing itself, you're back to JSON.

A rough heuristic, collapsing all of this into something actionable:

ScaleUse CaseFormat
Small objectsAPI responses, bot messagesJSON
Human-authored configConfiguration filesYAML
Medium uniform arraysLLM prompts, context windowsTOON
Medium uniform arraysService-to-service, storageJSON (with jsonb)
Large batch / analyticsStandalone pipelines, warehouses, exportsParquet
Enterprise integrationSOAP, HL7, legacy systemsXML

The table is a starting point, not a decision tree. The real question is always: who consumes this data, and what do they need from it? Token efficiency only matters if tokens cost something. Columnar access only matters if your read patterns are columnar. Human readability only matters if humans are actually reading it. And Parquet only helps when it owns the document.

Worth noting: in bytecode.news, most message surfaces are JSON - but most messages are actually Java objects. They're not serialized at all. This is important, because spending a lot of time fretting over choices that don't actually impact your architecture is time wasted. I wanted to think about it mostly because I really like to think about things like this.

The JSONB Angle

Worth addressing directly for the PostgreSQL users in the room: if you're using jsonb columns, you're already getting binary storage internally. Postgres parses the JSON at write time and stores a decomposed binary representation that supports GIN indexing and native JSON operators. The format you put in is JSON text; the format Postgres stores is already optimized. What jsonb doesn't change is the application-side wire cost — you're still transmitting text JSON to and from the database, and you're still doing it via Jackson or whatever serialization layer sits between your objects and the wire.

TOON doesn't slot into this picture cleanly, and I say this as someone who had exactly the thought "wait, should I be storing TOON in Postgres instead?" The answer is: not unless you're willing to lose all JSON operator support. (I am not.) Storing TOON in a bytea column gives you a blob with no query-level semantics. There's no toonb, no TOON-aware operators, no TOON GIN index. If you want structured storage with query support, jsonb is already doing roughly what you'd want and more - the binary optimization happens transparently. What you get from JSON on the application side is what you get.

When to Reach for Each

Stay with JSON when:

  • Your consumers are heterogeneous: browsers, mobile clients, third-party integrators. This is most use cases, frankly.
  • You need the full tooling surface: validation, schema enforcement, debugging, logging, although "validation" and "enforcement" don't always mean what people familiar with XML think they mean. (Trust me on this, or don't, but... you should trust me on this.)
  • Object sizes are small and record counts are modest — meaning: most of the time.
  • You want Jackson doing the heavy lifting, including the immutability improvements in Jackson 3.
  • Bulk data is a section of a larger document rather than a standalone artifact.

Reach for YAML when:

  • You are writing configuration files that humans will read, edit, and commit to source control.
  • You need comments in your serialized data and that requirement is non-negotiable.
  • You want readable export formats and Jackson's YAML module is already in your dependency tree.
  • Emphatically not for machine-generated service payloads.

Consider TOON when:

  • You're constructing LLM prompts that include structured data, especially uniform arrays.
  • You're paying per-token costs and context window real estate matters — RAG pipelines, large context assembly, batch LLM jobs.
  • Your data is genuinely uniform: 100% tabular eligibility is the sweet spot, and gains fall off sharply below ~70%.
  • You can accept a library that isn't yet Jackson-native and handle the bridging yourself.

Tolerate XML when:

  • You're interfacing with systems that require it and you have no leverage over that choice: SOAP endpoints, HL7 feeds, financial messaging, Maven POMs.
  • You need schema validation as a load-bearing feature and XSD is acceptable overhead. (In other words: when it really counts, you reach for XML. Seriously.)
  • You're in a domain where the ecosystem is XML-native and the cost of fighting it exceeds the cost of using it.
  • Otherwise: do not reach for it voluntarily.

Reach for Parquet when:

  • You are building data pipelines, analytics exports, or anything that feeds a warehouse or a lake — and the data is the whole document, not a section of one.
  • Data volume is large enough that columnar access patterns and compression pay off — and that threshold might be lower than you think.
  • You need typed schemas enforced at the storage layer, not just at the application layer.
  • Your consumers are analytics tools — Spark, DuckDB, Pandas, BI platforms — not application services.

Watch XferLang when:

  • You're interested in where the serialization design space is heading, particularly around typed formats.
  • You're in a .NET-primary environment and want stronger typing guarantees than JSON provides.
  • Do not deploy it as a JVM service boundary format today; there's nothing to deploy.

The Honest Conclusion

I was not about to rip out JSON from a working system on the strength of a sideways thought during a Jackson migration. I considered it for about the time it took to blink, but... for bytecode.news pipeline payloads, nevet bot messages, and IRC log records, JSON is the right call. The objects are small, the record counts are modest, the primary cost is latency, and the format overhead is noise.

But the thought experiment landed somewhere useful anyway, because it forced the question: am I using JSON everywhere because it's right everywhere, or because it's easy everywhere? Those aren't the same thing.

TOON is solving a genuine problem in a domain that's growing fast. If you're assembling LLM context from database queries, or building RAG pipelines that need to fit large structured datasets into a token budget, TOON deserves a look. JToon means the JVM isn't left out — it just isn't Jackson-native yet, and that's the friction point for a Spring-based application. Watch for a Jackson dataformat module; if and when that appears, the integration cost drops considerably. And watch Spring AI, as well; Spring AI leverages toolchain usage automatically, and Spring may consider the problem worth solving itself.

YAML deserves the credit and space it gets - it's more or less the way configuration is preferred these days. (I'm ignoring my use of application.properties as much as I can - recent projects use YAML.) It's awesome for configuration, where its readability advantage over JSON is real and its weaknesses (whitespace sensitivity, type coercion edges) don't apply.

And Parquet: if you're producing JSON files for standalone analytics consumption and nobody has pushed back yet, they're probably quietly suffering. The short version is: don't emit JSON to your data warehouse. But equally: don't reach for Parquet when your bulk data is a passenger inside a larger document rather than the whole vehicle.

The decision to use a format everywhere should be deliberate, not default. Knowing what JSON is not optimized for is as useful as knowing what it is.

Comments (0)

Sign in to comment

No comments yet.