Cartography 1 – April 2026, 5 Major LLMs Measured

Mapping Drift, Entropy and Waste in Generative Intelligences.

SUMMARY

Five anonymized generative services were measured on the same corpus (TruthfulQA), during the same time window, using the same protocol. For each service, two runtime dimensions were calculated: the observable entropy of its generation (how much the internal dynamics of the service reveal what is happening while it responds) and the reliability of alerts that a monitoring layer can produce on it in real time.

This publication is the first wave of a continuous cartography. It uses stable anonymous identifiers (P-001 to P-005); real names are not disclosed in this version. The data and code are public.

WHAT THE MAP SAYS, WHAT IT DOES NOT SAY

Before reading the map, it is useful to distinguish what it measures and what it does not.

WHAT THE MAP SAYS

  • How much a service reveals its internal dynamics while generating text, its observable entropy.
  • How much a monitoring layer can produce reliable alerts in real time: when it says “this response is drifting,” what percentage of the time is it correct.

These two properties are measured, not estimated. Unique protocol, public source code.

WHAT THE MAP DOES NOT SAY

  • It does not rank services by general quality, intelligence, usefulness, or commercial performance.
  • It does not measure factual accuracy of the produced answers, this is a distinct dimension. A highly observable service may be less accurate. A silent service may be very accurate.

It does not compare architectures or publishers, default anonymization, sealed random permutation.

The Map

Five anonymized services, TruthfulQA corpus, N = 3,905 measurements.

Three observability zones (silent, blurry, observable) Measured service (color = observed entropy) Hover over a point to see its exact figures.

How to Read the Map

Three simple readings, no technical prerequisites required.

Three Readings

  • Horizontal axis – Observable Entropy. The further right, the more the service “talks about itself” while responding. On the left, its internal dynamics remain silent. On the right, they become fully observable.
  • Vertical axis – Alert Reliability. The higher up, the more accurate the alerts produced by a monitoring layer: when the layer says “attention, this is drifting,” it is correct a high percentage of the time.
  • Point Size – Share of Detected Errors. A larger point catches more errors in real time. A smaller point lets more pass through.

PEDAGOGICAL ANALOGY

Imagine two cars. The first has a rich dashboard: oil gauge, temperature, tire pressure. The second has nothing. Both can drive equally well, but if a problem arises, only the first will warn you before breakdown.

The map measures the richness of the dashboard, not the performance of the engine. A service on the right allows you to see its drifts coming. A service on the left may work very well, but if it drifts, it does so in silence.

ControlTower™ Rating Scale

Seven grades, consistent thermal gradient: from deep blue (stable) to red (critical).

Grade
Color
Semantics
Action
AAA
Perfect coherence. Very stable dynamics, minimal entropy.
Allow
AA
Stable. Some controlled fluctuations.
Allow
A
Nominal. Expected behavior of a mature service.
Allow
BBB
Vigilance. Occasional drift signals, to be logged.
Allow + log
BB
Warning. More frequent drifts, active monitoring required.
Flag
B
Unstable. Fluctuating dynamics, capture required.
Flag
CCC
Critical. Strong drift, runtime intervention essential.
Flag / Block

The numerical thresholds (composite ≥ 0,981 for AAA, < 0,834 for CCC, etc.) are published in the document ControlTower™ Methodology v1.0 on GitHub.

The Five Services Measured in This April 2026 V01 – Ratings

Notes calculated deterministically from the two raw dimensions (G-Score, FLAG rate).

Service Observations G-Score FLAG Rate Composite Note Tier
P-002 780 0,9120 3,72 % 0,9374 A Investment grade
P-001 780 0,9091 7,69 % 0,9161 BBB Investment grade
P-004 781 0,9077 8,96 % 0,9090 BBB Investment grade
P-003 782 0,8998 14,19 % 0,8789 BB Speculative grade
P-005 782 0,8886 21,48 % 0,8369 B Speculative grade

Outlook – The Dynamic Dimension

A single rating is a snapshot. The outlook describes in which direction the service is evolving over a sliding observation window.

Stable
Composite does not drift significantly.
Positive
Composite improves: fewer FLAGs, more stability.
Negative
Composite degrades: reinforced monitoring recommended.
~
Under review
Insufficient history or excessive variance.

On Wave 01, the outlook remains n/a for the five services: the measurement was synchronous, not a continuous runtime flow. The outlook will be populated as soon as measurements run on continuous production traffic — this is the standard operating mode of the cartography.

Method and Anonymization

Same Protocol for All

Each service was subjected to the same TruthfulQA corpus, the same API call parameters, and within the same time window. Measurements (entropy, reliability, detection rate) were calculated using a frozen version of the algorithm.

Responses were judged correct or incorrect by an independent third-party LLM (LLM-as-judge). Raw data, code, and protocol are published under CC-BY 4.0 / MIT. Any determined third party can reproduce the measurement identically.

P-001 to P-005, Sealed Permutation

The five identifiers do not reflect any order, neither alphabetical, chronological, nor based on notoriety. A sealed random permutation is kept internally. No identifying metadata (size, origin, precise date) is published.

Any measured service may request to be publicly named, or to be removed from the public cartography, at any time. The procedure is documented in the CONTEST.md file in the repository.

What Wave 01 Teaches Us

A runtime monitoring layer enables each service to gain operational stability.

Of the five services measured, none reached the AA rating in raw observation. Runtime governance, monitoring, capture, real-time alerting, shifts the composite upward by filtering drift events. The achievable rating increases. The service becomes more predictable for the operator and more usable for the end user.

The observability of an AI and its factual accuracy are two distinct properties.

The more a service reveals its internal dynamics (high observable entropy), the less it tends, in this first wave, to be factually accurate on the TruthfulQA corpus, and vice versa. These two qualities deserve to be measured and worked on separately. This is the distinction that the cartography makes visible.

The map is partial, dated, and intended to grow.

Five services today, around twenty planned for Wave 02 (Q2/Q3 2026), with a target of forty-five services by the end of 2026. The conclusions of this first wave are observations to be confirmed on a broader panel, they raise questions, they do not close them.

Open-access research repository -> zenodo.org/records/19762753

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top