← Learn

Tokenization for LLM Inputs: How AI Reads What It Doesn't See

The architectural choices that make pre-LLM tokenisation work in production — deterministic vs randomised, format preservation, mapping storage, and the questions teams have to settle before deployment.

AI Architecture~9 min readUpdated May 2026
TL;DR

Pre-LLM tokenisation is the substitution step that lets an enterprise document cross the boundary to an external model without exposing sensitive content. It works by giving the LLM referential integrity without semantic disclosure — placeholders that the model can reason about consistently across a document, without recovering the underlying identity. Production-grade implementations have to settle a set of architectural questions explicitly: deterministic vs randomised (cross-document linkage vs re-identification surface), format-preserving vs marker-style (output quality vs simplicity), where the mapping lives (exclusive enterprise control is non-negotiable), entity resolution across mentions, and what not to tokenise. Tokenisation alone is sufficient for most workflows. For high-cardinality data, long time series, or defence-in-depth postures, statistical protections (differential privacy, k-anonymity) layer on top. The definition of what counts as sensitive is the layer where most teams under-invest at the start — and where most of the long-term operational cost lives.

1. What Tokenisation Actually Does in This Context

The transformation step that sits between an enterprise document and an external LLM is, conceptually, simple: identify the elements that can't cross the boundary, replace them with placeholders that preserve their structural role, send the result to the model. In practice, the simplicity hides a set of architectural decisions that determine whether the approach works at production scale or breaks under load.

This article walks through those decisions. It is not a tutorial on a specific library or product. It is the set of choices any team adopting pre-LLM tokenisation has to make explicit, with the trade-offs each choice carries.

The term tokenisation is used here in the data-protection sense — replacing sensitive values with non-sensitive placeholders that can be mapped back — not in the NLP sense, where it means splitting text into subword units for model input. The two share a word and almost nothing else.

A related note on terminology: in CUBIG's architecture, tokenisation is the core substitution mechanism inside a broader encapsulation layer that also includes detection, format preservation, and optional statistical protections. This article focuses on the tokenisation mechanism specifically — the design decisions are largely the same whether the implementation calls itself tokenisation, encapsulation, or any of the other names used in the field.

When an enterprise document is prepared for an external LLM, the goal is that the model sees a version of the document that retains everything it needs for the task and removes everything the boundary was set up to keep in. Tokenisation is the mechanism that achieves the second half: identifying sensitive elements and substituting placeholders.

A useful frame: the LLM doesn't need to know that the customer is named Marlene Schmidt. It needs to know that there is a customer, that the customer is referenced in three different places in the document, and that the references all point to the same entity. A token like CUST-7F2A carries the same information — a referenceable entity that appears in multiple places consistently — without carrying the identity.

That is the core property tokenisation provides: referential integrity without semantic disclosure. The model can reason about "the customer" across a document because the token threads through the document consistently. The model cannot recover the identity because the token doesn't encode it.

Everything else in this article is variations on how that property is implemented, and what additional properties layer on top of it.

Tokenization for LLM Inputs: How AI Reads What It Doesn't See | LLM Capsule A diagram showing a source document where the same customer is referenced three different ways, all resolved to a single consistent token in the version the LLM sees, with the mapping table held inside the enterprise. SOURCE DOCUMENT Header: Customer Marlene Schmidt called regarding account dropouts. Agent note: “Mr Schmidt reports the issue started last Tuesday.” Resolution: “Marlene confirmed service restored after firmware roll-back.” Entity resolution Tokenisation WHAT THE LLM SEES Header: Customer CUST-7F2A called regarding account dropouts. Agent note: “CUST-7F2A reports the issue started last Tuesday.” Resolution: “CUST-7F2A confirmed service restored after firmware roll-back.” TOKEN ↔ VALUE MAPPING · enterprise only CUST-7F2A  →  Marlene Schmidt  (also “Mr Schmidt”, “Marlene”) mapping never leaves the enterprise boundary
Figure 1 · Three different mentions of the same customer all resolve to one consistent token. The LLM can reason about “the customer” throughout the document; the mapping back to the real identity stays inside the enterprise.

2. Deterministic vs Randomised Tokenisation

The first architectural decision is whether a given sensitive value always produces the same token, or whether it produces a different token each time.

2.1 Deterministic Tokenisation

Deterministic tokenisation means Marlene Schmidt always becomes CUST-7F2A, in every document, in every workflow. The token is a function of the value (and usually a secret key).

The benefit is consistency across documents. If two tickets reference the same customer, the LLM sees the same token in both, and analytics that depend on cross-document linkage continue to work. For workflows that aggregate or compare across documents — fraud detection patterns, customer history summaries, cohort analysis — deterministic is usually the only viable choice.

The cost is that determinism creates a re-identification surface. An attacker who observes enough tokenised documents and has side information about which customers appear where can correlate tokens to identities. The risk is real for high-volume workflows or for cases where the same entity appears in many tokenised outputs over time.

2.2 Randomised Tokenisation

Randomised tokenisation generates a different token for each occurrence, even of the same value. Marlene Schmidt might become CUST-7F2A in one document and CUST-3B91 in another.

The benefit is that no cross-document linkage is exposed. Each tokenised document is a closed system.

The cost is that cross-document analytics break. The LLM can't tell that two tokens refer to the same customer, because at the structural level they don't. For workflows that don't need cross-document linkage — summarising a single document, extracting clauses from a single contract — randomisation is fine. For workflows that do, randomisation forces the linkage to be reconstructed after the LLM responds, which adds complexity.

2.3 The Hybrid Pattern Most Production Deployments Use

Most production deployments end up with a hybrid: deterministic within a workflow scope (so a multi-turn conversation about a customer stays coherent), randomised across workflow scopes (so analytics from one workflow can't be cross-referenced with another). The boundary of the scope is itself a design decision — by session, by user, by document, by tenant — and is one of the things a team has to settle before the architecture goes live.

Choice Benefit Cost Best fit
Deterministic Cross-document linkage preserved; analytics work across workflows Creates re-identification surface over high-volume workflows Fraud detection, customer history, cohort analysis
Randomised Each tokenised document is a closed system; no cross-document linkage exposed Cross-document analytics break; linkage must be reconstructed post-LLM Single-document summarisation, single-contract extraction
Hybrid (deterministic within scope, randomised across) Coherent within a session/user/tenant boundary; isolated across The scope boundary itself becomes a design decision Most production deployments

3. Format-Preserving Tokenisation — Why Placeholder Strings Aren't Enough

A naive implementation replaces sensitive values with generic placeholders: [CUSTOMER], [ACCOUNT_NUMBER], [DATE]. The LLM sees a document littered with these markers and tries to reason about it.

This works poorly in practice for a specific reason: the LLM's reasoning is shaped by the surface form of the input. A document that reads "Customer Marlene Schmidt called on 2026-03-15 about account 4471-9028" is, to the model, a coherent operational record. The same document with placeholders — "Customer [CUSTOMER] called on [DATE] about account [ACCOUNT_NUMBER]" — reads as a template or a redaction notice. Models are sensitive to that signal, and their outputs degrade accordingly: summaries become more abstract, extraction becomes less precise, and the model occasionally lapses into commentary about the redaction itself.

Format-preserving tokenisation generates tokens that look like the values they replace. A name becomes a plausible-looking name token: Lyra Vesper. A date becomes a real date in a plausible range. An account number becomes a number of the same length, in the same format, that isn't a real account number.

The document the LLM sees reads as a coherent operational document with anonymous-but-realistic stand-ins. The model's outputs come back at the quality the model can actually produce, rather than degraded by the perception that it's being asked to reason about a template.

Format preservation has its own design choices: how plausible to make the tokens, whether to draw from a fixed pool of fake names or generate them on the fly, how to handle dates and numerics where the value itself has analytical meaning (a date in 2019 vs a date in 2024 may matter to the analysis even if the exact date is sensitive). The general rule is that the token has to preserve whatever analytical property the original value carried, no more and no less.

4. Where the Mapping Lives

Tokenisation only protects the data if the mapping — the table that connects tokens to original values — stays inside the enterprise environment. This is the part of the architecture that most consistently determines whether the approach actually delivers its protection.

Three properties of the mapping have to hold:

  1. It stays in the enterprise's exclusive control. The mapping is, in effect, the key that re-identifies the data. If it leaves the environment, the protection collapses to whatever protection the new location provides. For workflows where data must stay in the EU region or other defined boundaries, the mapping has to live within that same boundary — colocated with the source systems, not with the AI endpoint.
  2. It is integrity-protected. Tampering with the mapping changes what gets reconstructed when the LLM's response comes back. An attacker who can modify the mapping can substitute identities in the output. Standard practice is to apply integrity checks to the mapping itself — signed entries, audit logs of access — so that any tampering is detectable.
  3. It is access-controlled separately from the LLM workflow. The team that operates the LLM integration doesn't need read access to the mapping. The reconstruction step pulls from the mapping programmatically; it doesn't require humans to see the original values. Separating those two access paths means the mapping can be governed under stricter controls than the LLM workflow itself.

Storage technology is secondary to these properties. The mapping can live in a dedicated database, a key-value store, an encrypted file, or a hardware-backed vault — the architectural choice depends on volume, latency requirements, and existing infrastructure. What matters is that the three properties above are non-negotiable design constraints, not configurable options.

5. Token Consistency — Same Entity, Same Token

A subtler design problem: ensuring that the same entity gets the same token, consistently, even when the entity is referenced in different ways across a document.

A service ticket might mention "the customer," then "Mr Schmidt," then "Marlene," then "the subscriber" — all referring to the same person. A naive tokeniser sees four different mentions and produces four different tokens, breaking the LLM's ability to track that these all refer to one entity. The summary that comes back may treat them as four people.

Resolving this requires entity resolution before tokenisation: identifying which mentions in a document refer to the same underlying entity, and ensuring they all map to the same token. This is a non-trivial problem in general — entity resolution is a research field of its own — but in practice it's tractable because enterprise documents have structural cues (a customer ID in the header tying together free-text mentions, formal naming conventions in operational logs, schema-defined relationships in structured records).

The other half of consistency is across documents within a workflow scope. If two tickets reference the same customer, and the workflow needs to treat them as related, the tokenisation has to produce the same token for the customer in both. This is where the deterministic-vs-randomised choice from earlier interacts: deterministic-within-scope is what enables the LLM to see "the same customer appears in three tickets" without learning who the customer is.

A well-designed tokenisation layer handles both kinds of consistency — within-document and within-scope — as part of the transformation, not as an afterthought. Teams that retrofit consistency onto a per-mention tokeniser usually find the workflow degrades in ways that look like model quality problems but are actually data preparation problems.

6. Additional Protection Layers — When Tokenisation Alone Isn't Enough

Tokenisation handles the substitution problem. For most workflows, well-implemented tokenisation with the mapping under the enterprise's exclusive control is sufficient. For some workflows, an additional layer of protection is worth adding on top.

The case for additional protection arises when the residual risk is not in the tokens themselves but in the patterns the tokens form. A tokenised document may contain enough structural information — frequencies, co-occurrences, sequences, ratios — that a sophisticated correlator could re-identify entities even without the raw values. The risk is particularly relevant for high-cardinality data, long time series, and workflows where many tokenised outputs accumulate over time.

The standard responses are differential privacy, k-anonymity, and similar statistical protections applied to the tokenised data. Each adds noise or aggregation in a controlled way that limits how much an attacker can learn from the tokenised output, at the cost of some analytical precision. Whether the trade-off is worth it depends on the threat model and the workflow's tolerance for noise.

For most enterprise AI workflows this layer is optional. For workflows where the data is highly sensitive, the volume is high, or the data posture demands defence in depth, it is worth the complexity. The decision is best made workflow by workflow, not as a global setting.

7. What Not to Tokenise

A final design question that often gets answered by accident: what not to tokenise.

Tokenising the wrong things degrades the AI's output without improving protection. A tokeniser that replaces every proper noun produces unreadable documents. A tokeniser that replaces every numeric field destroys analytical signal. The temptation is to be aggressive — "tokenise everything that could conceivably be sensitive" — but the cost shows up immediately in output quality.

The disciplined approach is to define sensitivity explicitly, in the enterprise's own terms, and tokenise only those elements. Generic PII categories are a starting point, not a complete list. Internal project codes, customer-segment identifiers, sector-specific references — whatever the enterprise's data posture treats as protected — go on the list. Everything else stays.

The list has to be versioned, because what counts as sensitive changes over time. It also has to be auditable, because an audit review of the workflow will want to know what was tokenised, when, under which definition. The definition layer is where most of the long-term operational cost of this architecture lives, and where most teams under-invest at the start.

8. The Next Step in the Workflow

Tokenisation prepares the document for the external model. The model processes the tokenised document and returns a tokenised response. The response, on its own, is not yet useful to the workflow — the tokens have to be mapped back to original values inside the enterprise environment before the output reaches the user.

That reconstruction step is the subject of the next article in this series. For the broader pattern this article is part of, see the pillar overview on running external LLMs on sensitive enterprise data. For why masking and redaction don't substitute for tokenisation in operational workflows, see the article on why AI workflows stall on operational data.

Have a deployment question?

Bring your industry, your regulatory profile, and your data. We respond within one business day.

Request a Live Demo

Email : contact@cubig.ai

CUBIG LTD (United Kingdom)

Company Number: NI735459
Address: 21 Arthur Street, Belfast, Antrim, United Kingdom, BT1 4GA


CUBIG CORP (Republic of Korea)

Business Registration Number : 133-81-45679

E-Commerce Registration : 2023-Seoul-Seocho-2822

Address: 4F, NAVER 1784, 95, Jeongjail-ro, Bundang-gu, Seongnam-si, Gyeonggi-do, Republic of Korea

©️ 2026 CUBIG Corp. All rights Reserved.

Consent Preferences

Email : contact@cubig.ai

CUBIG LTD (United Kingdom)

Company Number: NI735459
Address: 21 Arthur Street, Belfast, Antrim, United Kingdom, BT1 4GA


CUBIG CORP (Republic of Korea)

Business Registration Number : 133-81-45679

E-Commerce Registration : 2023-Seoul-Seocho-2822

Address: 4F, NAVER 1784, 95, Jeongjail-ro, Bundang-gu, Seongnam-si, Gyeonggi-do, Republic of Korea

©️ 2026 CUBIG Corp. All rights Reserved.

Consent Preferences

Email : contact@cubig.ai

CUBIG LTD (United Kingdom)

Company Number: NI735459
Address: 21 Arthur Street, Belfast, Antrim, United Kingdom, BT1 4GA


CUBIG CORP (Republic of Korea)

Business Registration Number : 133-81-45679

E-Commerce Registration : 2023-Seoul-Seocho-2822

Address: 4F, NAVER 1784, 95, Jeongjail-ro, Bundang-gu, Seongnam-si, Gyeonggi-do, Republic of Korea

©️ 2026 CUBIG Corp. All rights Reserved.

Consent Preferences

Email : contact@cubig.ai

CUBIG LTD (United Kingdom)

Company Number: NI735459
Address: 21 Arthur Street, Belfast, Antrim, United Kingdom, BT1 4GA


CUBIG CORP (Republic of Korea)

Business Registration Number : 133-81-45679

E-Commerce Registration : 2023-Seoul-Seocho-2822

Address: 4F, NAVER 1784, 95, Jeongjail-ro, Bundang-gu, Seongnam-si, Gyeonggi-do, Republic of Korea

©️ 2026 CUBIG Corp. All rights Reserved.

Consent Preferences

Email : contact@cubig.ai

CUBIG LTD (United Kingdom)

Company Number: NI735459
Address: 21 Arthur Street, Belfast, Antrim, United Kingdom, BT1 4GA


CUBIG CORP (Republic of Korea)

Business Registration Number : 133-81-45679

E-Commerce Registration : 2023-Seoul-Seocho-2822

Address: 4F, NAVER 1784, 95, Jeongjail-ro, Bundang-gu, Seongnam-si, Gyeonggi-do, Republic of Korea

©️ 2026 CUBIG Corp. All rights Reserved.

Consent Preferences