Mapping Drift, Entropy and Waste in Generative Intelligences.
Important Note:
This publication [V1 (data/v1-2026-04-26)] is preserved as an exploratory release. It was used to build and validate the initial anonymization and measurement pipeline.
The next publication [V2 (data/v2-2026-05-17)] constitutes the first official reference baseline of the NeoMundi Runtime Cartography, generated through the reproducible anonymization pipeline.
Provider pseudonyms introduced in V2 are stable for future releases, unless explicitly documented. Direct provider-by-provider comparisons should therefore start from V2.
See GitHub for more information -> https://github.com/neomundi-io/llm-cartography
SUMMARY
Five anonymized generative services were measured on the same corpus (TruthfulQA), during the same time window, using the same protocol. For each service, two runtime dimensions were calculated: the observable entropy of its generation (how much the internal dynamics of the service reveal what is happening while it responds) and the reliability of alerts that a monitoring layer can produce on it in real time.
This publication is the first wave of a continuous cartography. It uses stable anonymous identifiers (P-001 to P-005); real names are not disclosed in this version. The data and code are public.
WHAT THE MAP SAYS, WHAT IT DOES NOT SAY
Before reading the map, it is useful to distinguish what it measures and what it does not.
WHAT THE MAP SAYS
- How much a service reveals its internal dynamics while generating text, its observable entropy.
- How much a monitoring layer can produce reliable alerts in real time: when it says “this response is drifting,” what percentage of the time is it correct.
These two properties are measured, not estimated. Unique protocol, public source code.
WHAT THE MAP DOES NOT SAY
- It does not rank services by general quality, intelligence, usefulness, or commercial performance.
- It does not measure factual accuracy of the produced answers, this is a distinct dimension. A highly observable service may be less accurate. A silent service may be very accurate.
It does not compare architectures or publishers, default anonymization, sealed random permutation.
The Map
Five anonymized services, TruthfulQA corpus, N = 3,905 measurements.
How to Read the Map
Three simple readings, no technical prerequisites required.
Three Readings
- Horizontal axis – Observable Entropy. The further right, the more the service “talks about itself” while responding. On the left, its internal dynamics remain silent. On the right, they become fully observable.
- Vertical axis – Alert Reliability. The higher up, the more accurate the alerts produced by a monitoring layer: when the layer says “attention, this is drifting,” it is correct a high percentage of the time.
- Point Size – Share of Detected Errors. A larger point catches more errors in real time. A smaller point lets more pass through.
PEDAGOGICAL ANALOGY
Imagine two cars. The first has a rich dashboard: oil gauge, temperature, tire pressure. The second has nothing. Both can drive equally well, but if a problem arises, only the first will warn you before breakdown.
The map measures the richness of the dashboard, not the performance of the engine. A service on the right allows you to see its drifts coming. A service on the left may work very well, but if it drifts, it does so in silence.
ControlTower™ Rating Scale
Seven grades, consistent thermal gradient: from deep blue (stable) to red (critical).
The numerical thresholds (composite ≥ 0,981 for AAA, < 0,834 for CCC, etc.) are published in the document ControlTower™ Methodology v1.0 on GitHub.
The Five Services Measured in This April 2026 V01 – Ratings
Notes calculated deterministically from the two raw dimensions (G-Score, FLAG rate).
| Service | Observations | G-Score | FLAG Rate | Composite | Note | Tier |
|---|---|---|---|---|---|---|
| P-002 | 780 | 0,9120 | 3,72 % | 0,9374 | A | Investment grade |
| P-001 | 780 | 0,9091 | 7,69 % | 0,9161 | BBB | Investment grade |
| P-004 | 781 | 0,9077 | 8,96 % | 0,9090 | BBB | Investment grade |
| P-003 | 782 | 0,8998 | 14,19 % | 0,8789 | BB | Speculative grade |
| P-005 | 782 | 0,8886 | 21,48 % | 0,8369 | B | Speculative grade |
Outlook – The Dynamic Dimension
A single rating is a snapshot. The outlook describes in which direction the service is evolving over a sliding observation window.
On Wave 01, the outlook remains n/a for the five services: the measurement was synchronous, not a continuous runtime flow. The outlook will be populated as soon as measurements run on continuous production traffic — this is the standard operating mode of the cartography.
Method and Anonymization
Same Protocol for All
Each service was subjected to the same TruthfulQA corpus, the same API call parameters, and within the same time window. Measurements (entropy, reliability, detection rate) were calculated using a frozen version of the algorithm.
Responses were judged correct or incorrect by an independent third-party LLM (LLM-as-judge). Raw data, code, and protocol are published under CC-BY 4.0 / MIT. Any determined third party can reproduce the measurement identically.
P-001 to P-005, Sealed Permutation
The five identifiers do not reflect any order, neither alphabetical, chronological, nor based on notoriety. A sealed random permutation is kept internally. No identifying metadata (size, origin, precise date) is published.
Any measured service may request to be publicly named, or to be removed from the public cartography, at any time. The procedure is documented in the CONTEST.md file in the repository.
What Wave 01 Teaches Us
A runtime monitoring layer enables each service to gain operational stability.
Of the five services measured, none reached the AA rating in raw observation. Runtime governance, monitoring, capture, real-time alerting, shifts the composite upward by filtering drift events. The achievable rating increases. The service becomes more predictable for the operator and more usable for the end user.
The observability of an AI and its factual accuracy are two distinct properties.
The more a service reveals its internal dynamics (high observable entropy), the less it tends, in this first wave, to be factually accurate on the TruthfulQA corpus, and vice versa. These two qualities deserve to be measured and worked on separately. This is the distinction that the cartography makes visible.
The map is partial, dated, and intended to grow.
Five services today, around twenty planned for Wave 02 (Q2/Q3 2026), with a target of forty-five services by the end of 2026. The conclusions of this first wave are observations to be confirmed on a broader panel, they raise questions, they do not close them.
Open-access research repository -> zenodo.org/records/19762753
