Handling Tokenization and Structured Inputs in LLMs

Ciprian Ciprian · · 17 min read

Large Language Models don’t “see” JSON, CSV, XML, or Markdown the way we do.

At the API layer, you send something that looks nicely structured. Inside the model, it all gets flattened into the same thing: a sequence of token IDs. No trees, no rows and columns, no JSON objects. Just tokens.

And yet, when you actually benchmark different input formats on the same data, you get:

  • Massive differences in token usage (30–60% swings)
  • Non-trivial differences in accuracy (10–20 percentage points)
  • Big impacts on latency and cost at production scale

This post is about that input side only:

  • How the tokenization pipeline actually works
  • How different formats (JSON, CSV, Markdown, XML, code) get turned into tokens
  • How tokenization strategies differ across model families
  • Why format choice changes behavior even though the pipeline is “uniform”
  • Practical ways to optimize inputs without touching model weights

Output schemas, JSON mode, function calling, etc. I’ll leave for a separate post on output management.


1. The tokenization pipeline: from text to vectors

Let’s start with the boring bit that secretly controls everything.

1.1 What really happens when you call the API

When you send data to an LLM via OpenAI, Anthropic, etc:

  1. Transport layer: Your request is JSON (the HTTP payload), but that’s just transport. The model doesn’t see this outer JSON structure as “data”. The server extracts the relevant text fields out of your messages and passes those strings to the tokenizer.

  2. Tokenization: The tokenizer converts UTF-8 text into a sequence of integer ids: ["Alice", " is", " 30", " years", " old", "."] → [1234, 532, 981, …]

  3. Embedding lookup: Each token id indexes an embedding matrix: token_id -> vector in ℝ^d GPT-2 had 50,257 tokens × 768 dims; modern frontier models go into tens of thousands of dimensions.

  4. Add position: Transformers don’t know order by default. So we inject positional information (sinusoidal encodings in the original Transformer, RoPE in modern models like LLaMA, GPT-Neox, PaLM, etc.).

  5. Transformer stack: Those position-aware vectors go through dozens to hundreds of layers of attention + feed-forward. At this point the model has some internal representation of the content.

  6. Prediction: The model predicts the next token id, then the next, and so on.

Notice what’s missing: there is no JSON parser, no CSV reader, no HTML DOM, no SQL engine in this flow. All the structure we care about is flattened at step 2.


1.2 Tokenizer flavors (and why you should care)

The core tokenizer algorithm varies by model family, but the pattern is the same: compress text into a limited vocabulary so sequences aren’t absurdly long.

Very condensed overview:

  • Byte Pair Encoding (BPE) – used in GPT-style models

    • Start from characters/bytes
    • Iteratively merge the most frequent pairs until you hit a target vocab size
    • Frequent words → 1 token, rare/long words → multiple tokens
    • Simple and effective, but can fragment some languages heavily
  • WordPiece – used in BERT, some earlier Google models

    • Similar to BPE, but merges chosen with a likelihood objective
    • Uses ## for word continuations ("play", "##ing")
    • Loses exact spacing, bad if you need perfect round-tripping
  • SentencePiece (BPE / Unigram) – used in LLaMA 1/2, Gemini, T5, Mistral

    • Works directly on raw Unicode, including spaces
    • Often uses a special ”▁” character to mark word boundaries: "▁Hello▁World"
    • Unigram mode starts with a huge vocab, prunes down by likelihood
    • Good multilingual coverage, decent compression
  • Byte-level BPE – GPT-2+ style

    • Alphabet is raw bytes (0–255)
    • No “unknown” tokens; any Unicode sequence is representable
    • Some tokens correspond to partial UTF-8 characters; decoding must be done over full sequences

Recent tokenizer evolutions (e.g. LLaMA 3’s tiktoken-based BPE with a larger vocab) mostly chase better compression and more balanced multilingual behavior: fewer tokens for the same text, especially outside English.


ELI5: Tokenization in one sentence

Regardless of the format you send the model, it will ultimately process it as individual parts.


2. Structured formats get no special treatment

This is the part that trips people up: JSON, CSV, YAML, XML, Markdown, code – they all go through the same tokenizer.

The model is not “parsing JSON”. It is just seeing tokens.

2.1 JSON

Input:

{"name": "Alice", "age": 30}

At the tokenizer level this becomes something like:

  • {
  • "name"
  • ": "
  • "Alice"
  • ", "
  • "age"
  • ": "
  • 30
  • }

The exact split depends on the vocab (some patterns get merged), but conceptually:

  • Braces, quotes, colons, commas = normal tokens
  • Keys and values = normal tokens

The model does not build:

{"name": "Alice", "age": 30}

as an internal dict. It just processes the token sequence. Any “JSON understanding” emerges because it has seen millions of similar sequences and learned that:

  • Keys are strings before :
  • Values follow :
  • Braces should match

But that’s pattern learning, not an explicit JSON engine.

2.2 YAML

YAML is just different punctuation for similar semantics:

name: Alice
age: 30

Tokenization sees:

  • "name"
  • ":"
  • " Alice"
  • newline
  • "age"
  • ":"
  • " 30"

Indentation is just spaces. Lists (- item) are just a dash and some text.

If the model seems to “understand YAML”, it’s because it’s learned typical patterns like key: value and indentation, not because it’s running a YAML parser.

2.3 Markdown

Markdown shows up all over training corpora:

## Title

- Item 1
- Item 2

The tokenizer just sees:

  • ##
  • Title
  • newlines
  • -
  • Item
  • 1

The model has seen ## at the start of a line enough times to infer “this is probably a heading”, and - at the start of a line often means “bullet list item”.

Again: emergent understanding from repeated patterns, not a dedicated Markdown renderer.

2.4 HTML / XML

Same story:

<title>Hello</title>

Tokens might be:

  • <title>
  • Hello
  • </title>

or the tag split into smaller units.

The model learns that <title> wraps short strings, that tags usually close, etc. No DOM, just well-learned text patterns.

2.5 CSV and tabular data

CSV:

name,age,role
Alice,30,Engineer
Bob,28,Designer

Tokenization:

  • "name", ",", "age", ",", "role", newline
  • "Alice", ",", "30", ",", "Engineer", newline
  • "Bob", ",", "28", ",", "Designer", newline

No rows, no columns, no schema – just sequences with commas and newlines. The model can infer that the first line is headers and subsequent lines are rows, but that’s learned behavior, not a first-class data structure.


ELI5: “Does the model know this is JSON?”

Not in the way your code does.

Your parser builds a clean object with fields and values. The model just sees a bunch of characters it has seen before.

It acts like it knows JSON because it has memorised that:

  • braces go together
  • key: value is a thing
  • strings look like "this"

…but all of that lives inside neural weights, not an actual JSON tree.


3. Tokenization strategies across model families (from the input’s point of view)

Different model families tokenize your input differently. That matters because:

  • It changes how many tokens you pay for
  • It changes how fragmented certain entities (numbers, IDs, non-English words) become
  • It subtly affects how easy certain patterns are to learn

Here’s a fly-over from the input side.

3.1 GPT / tiktoken

OpenAI’s tiktoken is the reference point for speed and practicality:

  • Implemented in Rust, very fast
  • Uses byte-level BPE with regex pre-tokenization
  • Has dedicated special tokens like <|im_start|>, <|im_end|>, <|endoftext|> for control flow

Newer encodings (like the one used for GPT-4o) increase vocab size (e.g. 200k tokens) to reduce tokens per word, especially for non-English languages. Bigger vocab = fewer, more specific tokens = shorter sequences for the same content.

3.2 LLaMA and friends

LLaMA 1/2 used SentencePiece with 32k vocab, with ”▁” indicating a word boundary. LLaMA 3 moved to a tiktoken-style BPE with a larger vocab to improve compression and multilingual handling.

From your data’s perspective:

  • Same string will tokenize to a different number of tokens across LLaMA 2 vs LLaMA 3
  • White-space and punctuation can behave slightly differently
  • Non-English and code usually get more efficient in newer tokenizers

3.3 Gemini / PaLM

Google models (PaLM, Gemini) tend to use very large SentencePiece Unigram tokenizers (e.g. 200k+ vocab):

  • Massive coverage across scripts
  • Better parity between English and non-English token counts
  • Potential for multiple valid segmentations; inference usually picks the best

If you care about non-English inputs, these tokenizers often give you shorter sequences than older GPT-style ones.

3.4 BERT / WordPiece

BERT is older but still influences how people think:

  • WordPiece with ~30k vocab, ## continuation tokens
  • Handles general language fine
  • Loses exact whitespace, so you can’t round-trip inputs perfectly – not ideal for code, configs, or anything where spacing matters

This is one reason modern generative models moved away from WordPiece: exact reproduction of input text matters a lot more now.

3.5 Mistral

Mistral models use SentencePiece with byte-fallback BPE and 32k vocabulary:

  • Byte-fallback guarantees every character is encodable (no UNKs)
  • Multiple tokenizer versions add control tokens for tool/function use

For you, this means: similar behavior to other SentencePiece-based models, with slightly different compression and control token semantics.


4. Special cases: whitespace, punctuation, numbers, Unicode

These are the sharp edges where inputs get weird.

4.1 Whitespace

  • tiktoken-style (GPT, LLaMA 3): essentially lossless

    • Keeps every space and newline
    • Often merges a leading space with the following word (" hello" as a single token)
    • Perfect for code, exact formatting, diff-based workflows
  • SentencePiece (LLaMA 2, Gemini, T5, Mistral): uses

    • "▁Hello▁World" internally
    • Decodes back to "Hello World"
    • Multiple spaces tend to collapse – usually OK for natural language, problematic if spacing is semantic
  • WordPiece (BERT): lossy

    • Can’t reconstruct original whitespace
    • Fine for classification, bad for anything that needs exact layout

4.2 Punctuation and braces

  • Commas, periods, colons, semicolons, question marks: usually single tokens
  • Brackets and braces: single tokens in most cases, but common patterns ({\") might be merged
  • Tags like <title> might be one token or several, depending on frequency

Practically: JSON and XML carry a lot of structural punctuation. Every one of those characters is a token or part of a token. Over a large dataset, that adds up.

4.3 Numbers

Tokenization of numbers is surprisingly bad almost everywhere:

  • "1234" often becomes two tokens (e.g. "12", "34")
  • "3.14159" becomes multiple tokens ("3", ".", "14", "159", …)
  • Large integers and precise decimals explode into many tokens

This matters for:

  • Cost: storing or streaming big numeric tables burns tokens fast
  • Reasoning: if a single logical number spreads across multiple tokens, arithmetic and comparisons are harder to learn

Some tokenizers special-case common years or round numbers, but it’s still not great.

4.4 Unicode and multilingual text

  • Byte-level BPE (GPT-2 style) can encode any Unicode string without unknowns, because everything decomposes into bytes
  • SentencePiece has explicit controls (character_coverage) to decide which characters get dedicated tokens vs fall back to sub-character encoding
  • Large vocabularies (Gemini, PaLM, some LLaMA 3 configs) aim for fairer token counts across languages, so your input cost isn’t 2–3x in non-English.

5. Why input format still changes performance

So far we’ve said: everything is turned into tokens, and there is no special parsing per format.

So why do benchmarks show:

  • Different formats = different accuracy
  • Different formats = huge differences in token count, cost, and latency

The answer is: different formats induce different token sequences, and some sequences are simply easier (or harder) for models to work with.

5.1 Syntax overhead vs data density

Rough mental model:

  • CSV / TSV

    • Minimal syntax: commas, tabs, newlines
    • Overhead: maybe 5–10% on top of raw data
    • Keys appear once (header row), values dominate
  • JSON

    • Braces, brackets, quotes, colons, commas everywhere
    • Overhead: easily 40–60% for simple tabular data
    • Keys repeated on every row ("employee_name" 100 times in 100 rows)
  • XML / HTML

    • Even more verbose: open and close tags
    • Large overhead but strong structural hints (nesting, tag types)
  • Markdown key–value / tables

    • Somewhere in between: human-readable, some delimiters, less punctuation than JSON/XML

Empirical studies comparing these formats on the same data find:

  • CSV/TSV are usually most token-efficient

  • JSON/XML are usually most token-expensive

  • Accuracy is task-dependent:

    • Sometimes JSON/HTML win, because explicit structure helps models navigate
    • Sometimes CSV wins, especially when data is clean and the task is simple lookups
    • Markdown often balances readability and structure, and does surprisingly well

The key: it might seem like you’re trading overhead tokens for explicit structure but the model might not benefit from that structure, depending on the task.

5.2 Models are getting more format-robust

Older models (GPT-3.5, small LLaMA 2 variants):

  • Showed big swings in accuracy (up to ~40 percentage points) depending on whether the same task was wrapped in plain text, JSON, YAML, HTML, or Markdown.

Newer frontier models:

  • Reduce that spread a lot – you still see differences, but they’re smaller
  • Training on enormous mixed-format corpora seems to give them decent “format translation” ability
  • They learn that {"name": "Alice"}, name: Alice, name,Alice, and "name = Alice" are semantically the same

But none of that changes tokenization:

  • Token counts don’t shrink when the model gets smarter
  • JSON is still JSON: same braces, same colon tokens, same key repetition
  • CSV is still much cheaper to represent flat rows

So accuracy is getting less sensitive to format. Efficiency is not.


ELI5: Why do two formats with the same info behave differently?

Consider the same information in two shapes:

  • A clean, well-structured table or key–value list
  • A long narrative paragraph describing the same thing

Depending on the task, the model leans on different parts of its training.

If the task is creative writing, narrative-style input is closer to what it has seen in books and articles. Feed it a wall of JSON or XML and it first has to mentally ignore or work around all the “overhead” tokens like colons, braces, tags, and quotes to get to the actual story.

For structured data tasks or code understanding, it’s the opposite. Clear delimiters, consistent columns, or obvious block boundaries make it easier for the model to lock onto the relevant bits. The same content wrapped in a noisy, chatty paragraph forces it to do more work to reconstruct structure before it can reason about it.

Same information, different presentation. One shape lines up with how the model learned to solve the current task, the other fights it.


6. Input-side optimization strategies

Since this post is “input edition”, let’s keep all the suggestions strictly on the ingestion side. Output schemas, JSON mode, function calling, etc. go into the next post.

6.1 Choose the right format for the data you actually have

Flat tabular data (metrics, logs, DB exports):

  • Prefer CSV/TSV or a columnar-style format over full JSON.
  • You get 30–60% token savings often, with no loss in expressive power for simple lookups, aggregations, and “find the row where…” tasks.
  • Make sure headers are clear – that’s effectively your schema.

Semi-structured data (API responses, nested objects):

  • Regular JSON is still fine if you truly need nested structure.

  • If the majority of your data is actually tabular (e.g. 100 records with identical fields), consider hybrid formats:

    • One header/field declaration
    • Then row-wise values (CSV/TSV style)
    • Possibly wrapped or annotated so the model knows what’s what

This is similar to the idea behind “AI-native” formats like TOON: declare field names once and stream values to reduce repetition. It’s clever from a token-efficiency perspective, but models haven’t actually been trained on this kind of layout. That means you don’t automatically get the accuracy benefits you’d expect from a format that’s optimised for compression.

In practice, you’ll usually get more reliable behavior from formats the model has seen at scale: Markdown tables, key–value lists, or even plain JSON, even if they cost more tokens. Familiar structure often beats theoretical efficiency when it comes to actual model performance.

Human-facing workflows:

  • Don’t optimize the format you show humans. Optimize what you send to the model.

  • It’s perfectly valid to:

    • Keep pretty JSON or Markdown in your product
    • Transform to a more token-efficient form before sending to the model
    • Transform the model’s result back into something readable

6.2 Separate human formatting from model formatting

There are several papers showing that models handled fully unformatted code almost as well as nicely formatted code and removing whitespace and indentation reduced token count by ~25%+ in some languages.

  • A simple pattern to implement in your app could be:
    • Humans see and edit nicely formatted code
    • System strips it down for the model (removes non-semantic whitespace)
    • System re-formats for humans afterwards

You can apply the same trick to large Markdown reports, HTML, or XML:

  • Strip redundant whitespace
  • Remove decorative markup that doesn’t change meaning
  • Compress repetitive boilerplate

As long as your transformation is lossless for semantics, you’re giving the model the same information in fewer tokens. Just remember to first optimize for accuracy and then for token cost.

6.3 Use compression thoughtfully

Prompt compression techniques (LLMLingua and friends) can bring significant improvements:

  • A small helper model or algorithm drops low-value tokens before sending to the big model
  • Good compression schemes can remove 50–75% of tokens with minimal or even positive impact on accuracy, because they strip noise and redundancy

From an input perspective:

  • Long documents often contain intros, disclaimers, irrelevant sections

  • Compression passes can keep:

    • Headings
    • Entities
    • Numerical facts
    • Critical context
  • And discard:

    • Repeated boilerplate
    • Over-explaining
    • Low-signal sentences

Do this before tokenization for the main model, and you directly reduce cost and context pressure.

6.4 Be tokenizer-aware in your design

If you’re designing a system that will send a lot of structured data to models:

  • Inspect how your target tokenizer splits your real inputs:

    • How many tokens per row / per object?
    • How are numbers split?
    • Do field names blow up token counts?
  • Prefer shorter, meaningful field names when possible:

    • "emp_name" vs "employee_full_legal_name" × 1,000 rows is a huge difference
    • Models don’t need fully spelled out snake_case novels for each key
  • Normalize and pre-process:

    • Convert dates to consistent forms
    • Round numbers to sensible precision
    • Deduplicate or reference repeating blocks rather than inlining them everywhere

Everything you do to reduce redundancy before tokenization pays off every time you call the model.


7. Closing thoughts

If you ignore how inputs are tokenized, you end up with:

  • More tokens than needed
  • Higher costs and latency
  • Sometimes worse accuracy, just because the model is wading through noise

If you treat input format as a first-class design decision, you can:

  • Cut token usage by 30–60% for the same data
  • Keep or even improve accuracy by choosing layouts that are easier for the model to “read”
  • Build a better foundation for any output-side tricks you want later (structured outputs, JSON mode, tools, etc.)

The model is a function from token sequence in to token sequence out. You don’t control the function, but you do control the sequence.

The next post will be about output management:

  • JSON mode, function calling, and when to trust them
  • Separating reasoning from formatting
  • Schema validation, repair loops, and tool calls