NeoMundi Recherche publishes a public, sanitized version of an independent methodological review of its real-time approach to measuring the stability of generative AI. The review is based on a May 2026 comparative cohort of eight anonymized providers. It examines what a real-time signal can and cannot capture, documents the current limits of the method, and outlines the improvements now being tested. This document is not a certification, an official validation, or a guarantee of performance. It is shared as a contribution to open and auditable AI governance.

Key points
- Real-time measurement tracks the stability of a model’s generation dynamics, without requiring full content disclosure.
- Stability is not the same as truth: a model can be stable and still wrong.
- An anonymized case (P-003) produced large factual errors fluidly, without raising any anomaly.
- Current limits include reduced sensitivity on highly stable models and single-judge evaluation bias.
- Proposed improvements: non-linear scoring, multi-judge validation, and statistical drift detection.
Why real-time measurement matters
Most AI evaluation happens after the fact, on finished outputs. Real-time measurement is different: it observes a model while it is generating. For governance and auditability, that timing matters. Model behavior can shift between versions, or drift quietly after a silent provider update. A signal captured during generation gives an early, continuous view of that behavior.
It can also do so without relying on full content disclosure in the real-time stability layer. This is useful in settings where outputs are confidential, regulated, or sensitive, and where a governance layer needs a signal that does not depend on exposing every generated word.
What the real-time signal measures
The Real-time signal describes the dynamics of generation rather than the meaning of the text. In practice it produces three things: a continuous stability score, a measure of how much that score varies, and a FLAG rate for detected anomalies.

Depending on the module, this stability layer can be complemented by other signals, for example coherence checks, an evaluation judge, or embedding-based comparisons. The stability layer itself, however, is deliberately lightweight.
Stability is not truth
A smooth, regular generation process tells us that the model is behaving consistently. It does not tell us that the content is correct. This distinction is the heart of the review.
The key methodological lesson
KEY TAKEAWAY
A model can be stable and wrong.
One model in the cohort, referenced only as P-003, illustrates this clearly. It recorded the worst accuracy of the group, around 24% correct answers, an error rate above 75%.
Yet its generation signals looked near-perfect. Its stability score was the highest in the cohort, its variation was effectively zero, and it raised no anomalies at all. In other words, the model produced large factual errors in a fluid, regular, and statistically stable way. Mathematical regularity during generation is not a guarantee of truthfulness.
Limits identified

The review is candid about where the method, in its current form, falls short:
- Reduced sensitivity on highly stable models. On highly stable models, recall remains limited, and many factual errors may not be flagged by the stability signal alone.
- Single-judge evaluation bias. Relying on one evaluation judge introduces bias, and the reliability of the evaluation itself cannot be measured.
- Compensation effects in a linear composite score. When stability and anomalies are simply averaged, strong stability can mask poor behavior.
- Rigid drift thresholds. A single fixed threshold ignores each model’s natural variance.
- Difficulty detecting slow drift. Gradual, step-by-step degradation can pass unnoticed from one measurement window to the next.
Proposed improvements
The review proposes concrete, testable directions rather than fixed conclusions:
- A non-linear composite score, so generation quality can no longer offset poor behavior.
- A multiplicative or exponential penalty applied to the anomaly rate.
- Multi-judge validation, replacing the single evaluator.
- Cohen’s Kappa, or Fleiss’ Kappa for more than two judges, to measure inter-rater agreement.
- Welch’s t-test for variance-aware detection of behavioral change.
- CUSUM monitoring to capture slow, cumulative drift.
- Integration of semantic accuracy into offline benchmark grading.
Public research roadmap
The next steps are empirical and open:
- Benchmark the candidate formulas on historical runs.
- Move from single-judge to multi-judge evaluation.
- Report inter-rater agreement alongside results.
- Improve drift detection with variance-aware methods.
- Ensure that stability never replaces truthfulness.
Download the full review
The full sanitized methodological review is available as a PDF, with the complete reasoning, formulas, and limitations.
Disclaimer
This article summarizes an independent methodological contribution by Abdelkrim Halimi, Independent Data Scientist Contributor. It is not a certification, an official validation, a guarantee of performance, or a commercial endorsement of NeoMundi. Provider identities, proprietary datasets, API keys, infrastructure details, and commercially sensitive benchmark data are not disclosed.
Author of the report

Abdelkrim Halimi
NeoMundi Research Contributor · Data Scientist / Computer Vision
Qualifications: Data Scientist specialized in OCR, computer vision, predictive maintenance, and applied data analysis.
Mission: Independent methodological auditor and data analyst for the V3 cartography.
Objective: Provide an external, rigorous perspective on score distributions, signal consistency, methodological limits, and potential biases in NeoMundi’s runtime measurements.
Profile: [LinkedIn]
