Cartography 1 – April 2026, 5 Major LLMs Measured

Mapping Drift, Entropy and Waste in Generative Intelligences.

Important Note:

This publication [V1 (data/v1-2026-04-26)] is preserved as an exploratory release. It was used to build and validate the initial anonymization and measurement pipeline.
The next publication [V2 (data/v2-2026-05-17)] constitutes the first official reference baseline of the NeoMundi Runtime Cartography, generated through the reproducible anonymization pipeline.
Provider pseudonyms introduced in V2 are stable for future releases, unless explicitly documented. Direct provider-by-provider comparisons should therefore start from V2.

See GitHub for more information -> https://github.com/neomundi-io/llm-cartography

SUMMARY

Five anonymized generative services were measured on the same corpus (TruthfulQA), during the same time window, using the same protocol. For each service, two runtime dimensions were calculated: the observable entropy of its generation (how much the internal dynamics of the service reveal what is happening while it responds) and the reliability of alerts that a monitoring layer can produce on it in real time.

This publication is the first wave of a continuous cartography. It uses stable anonymous identifiers (P-001 to P-005); real names are not disclosed in this version. The data and code are public.

WHAT THE MAP SAYS, WHAT IT DOES NOT SAY

Before reading the map, it is useful to distinguish what it measures and what it does not.

WHAT THE MAP SAYS

How much a service reveals its internal dynamics while generating text, its observable entropy.
How much a monitoring layer can produce reliable alerts in real time: when it says “this response is drifting,” what percentage of the time is it correct.

These two properties are measured, not estimated. Unique protocol, public source code.

WHAT THE MAP DOES NOT SAY

It does not rank services by general quality, intelligence, usefulness, or commercial performance.
It does not measure factual accuracy of the produced answers, this is a distinct dimension. A highly observable service may be less accurate. A silent service may be very accurate.

It does not compare architectures or publishers, default anonymization, sealed random permutation.

The Map

Five anonymized services, TruthfulQA corpus, N = 3,905 measurements.

Three observability zones (silent, blurry, observable) Measured service (color = observed entropy) Hover over a point to see its exact figures.

How to Read the Map

Three simple readings, no technical prerequisites required.

Three Readings

Horizontal axis – Observable Entropy. The further right, the more the service “talks about itself” while responding. On the left, its internal dynamics remain silent. On the right, they become fully observable.
Vertical axis – Alert Reliability. The higher up, the more accurate the alerts produced by a monitoring layer: when the layer says “attention, this is drifting,” it is correct a high percentage of the time.
Point Size – Share of Detected Errors. A larger point catches more errors in real time. A smaller point lets more pass through.

PEDAGOGICAL ANALOGY

Imagine two cars. The first has a rich dashboard: oil gauge, temperature, tire pressure. The second has nothing. Both can drive equally well, but if a problem arises, only the first will warn you before breakdown.

The map measures the richness of the dashboard, not the performance of the engine. A service on the right allows you to see its drifts coming. A service on the left may work very well, but if it drifts, it does so in silence.

ControlTower™ Rating Scale

Seven grades, consistent thermal gradient: from deep blue (stable) to red (critical).

Grade

Color

Semantics

Action

AAA

Perfect coherence. Very stable dynamics, minimal entropy.

Allow

Stable. Some controlled fluctuations.

Allow

Nominal. Expected behavior of a mature service.

Allow

BBB

Vigilance. Occasional drift signals, to be logged.

Allow + log

Warning. More frequent drifts, active monitoring required.

Flag

Unstable. Fluctuating dynamics, capture required.

Flag

CCC

Critical. Strong drift, runtime intervention essential.

Flag / Block

The numerical thresholds (composite ≥ 0,981 for AAA, < 0,834 for CCC, etc.) are published in the document ControlTower™ Methodology v1.0 on GitHub.

The Five Services Measured in This April 2026 V01 – Ratings

Notes calculated deterministically from the two raw dimensions (G-Score, FLAG rate).

Service	Observations	G-Score	FLAG Rate	Composite	Note	Tier
P-002	780	0,9120	3,72 %	0,9374	A	Investment grade
P-001	780	0,9091	7,69 %	0,9161	BBB	Investment grade
P-004	781	0,9077	8,96 %	0,9090	BBB	Investment grade
P-003	782	0,8998	14,19 %	0,8789	BB	Speculative grade
P-005	782	0,8886	21,48 %	0,8369	B	Speculative grade

Outlook – The Dynamic Dimension

A single rating is a snapshot. The outlook describes in which direction the service is evolving over a sliding observation window.

→

Stable

Composite does not drift significantly.

↑

Positive

Composite improves: fewer FLAGs, more stability.

↓

Negative

Composite degrades: reinforced monitoring recommended.

Under review

Insufficient history or excessive variance.

On Wave 01, the outlook remains n/a for the five services: the measurement was synchronous, not a continuous runtime flow. The outlook will be populated as soon as measurements run on continuous production traffic — this is the standard operating mode of the cartography.

Method and Anonymization

Same Protocol for All

Each service was subjected to the same TruthfulQA corpus, the same API call parameters, and within the same time window. Measurements (entropy, reliability, detection rate) were calculated using a frozen version of the algorithm.

Responses were judged correct or incorrect by an independent third-party LLM (LLM-as-judge). Raw data, code, and protocol are published under CC-BY 4.0 / MIT. Any determined third party can reproduce the measurement identically.

P-001 to P-005, Sealed Permutation

The five identifiers do not reflect any order, neither alphabetical, chronological, nor based on notoriety. A sealed random permutation is kept internally. No identifying metadata (size, origin, precise date) is published.

Any measured service may request to be publicly named, or to be removed from the public cartography, at any time. The procedure is documented in the CONTEST.md file in the repository.

What Wave 01 Teaches Us

A runtime monitoring layer enables each service to gain operational stability.

Of the five services measured, none reached the AA rating in raw observation. Runtime governance, monitoring, capture, real-time alerting, shifts the composite upward by filtering drift events. The achievable rating increases. The service becomes more predictable for the operator and more usable for the end user.

The observability of an AI and its factual accuracy are two distinct properties.

The more a service reveals its internal dynamics (high observable entropy), the less it tends, in this first wave, to be factually accurate on the TruthfulQA corpus, and vice versa. These two qualities deserve to be measured and worked on separately. This is the distinction that the cartography makes visible.

The map is partial, dated, and intended to grow.

Five services today, around twenty planned for Wave 02 (Q2/Q3 2026), with a target of forty-five services by the end of 2026. The conclusions of this first wave are observations to be confirmed on a broader panel, they raise questions, they do not close them.

Open-access research repository -> zenodo.org/records/19762753

Mapping Drift, Entropy and Waste in Generative Intelligences.

SUMMARY

WHAT THE MAP SAYS, WHAT IT DOES NOT SAY

The Map

How to Read the Map

ControlTower™ Rating Scale

The Five Services Measured in This April 2026 V01 – Ratings

Outlook – The Dynamic Dimension

Method and Anonymization

What Wave 01 Teaches Us

A runtime monitoring layer enables each service to gain operational stability.

The observability of an AI and its factual accuracy are two distinct properties.

The map is partial, dated, and intended to grow.

Leave a Comment Cancel Reply