Public methodological review of NeoMundi’s May 2026 Comparative Cohort

NeoMundi Recherche publishes a public, sanitized version of an independent methodological review of its real-time approach to measuring the stability of generative AI. The review is based on a May 2026 comparative cohort of eight anonymized providers. It examines what a real-time signal can and cannot capture, documents the current limits of the method, and outlines the improvements now being tested. This document is not a certification, an official validation, or a guarantee of performance. It is shared as a contribution to open and auditable AI governance.

Key points

Real-time measurement tracks the stability of a model’s generation dynamics, without requiring full content disclosure.
Stability is not the same as truth: a model can be stable and still wrong.
An anonymized case (P-003) produced large factual errors fluidly, without raising any anomaly.
Current limits include reduced sensitivity on highly stable models and single-judge evaluation bias.
Proposed improvements: non-linear scoring, multi-judge validation, and statistical drift detection.

Why real-time measurement matters

Most AI evaluation happens after the fact, on finished outputs. Real-time measurement is different: it observes a model while it is generating. For governance and auditability, that timing matters. Model behavior can shift between versions, or drift quietly after a silent provider update. A signal captured during generation gives an early, continuous view of that behavior.

It can also do so without relying on full content disclosure in the real-time stability layer. This is useful in settings where outputs are confidential, regulated, or sensitive, and where a governance layer needs a signal that does not depend on exposing every generated word.

What the real-time signal measures

The Real-time signal describes the dynamics of generation rather than the meaning of the text. In practice it produces three things: a continuous stability score, a measure of how much that score varies, and a FLAG rate for detected anomalies.

Depending on the module, this stability layer can be complemented by other signals, for example coherence checks, an evaluation judge, or embedding-based comparisons. The stability layer itself, however, is deliberately lightweight.

Stability is not truth

A smooth, regular generation process tells us that the model is behaving consistently. It does not tell us that the content is correct. This distinction is the heart of the review.

The key methodological lesson

KEY TAKEAWAY
A model can be stable and wrong.

One model in the cohort, referenced only as P-003, illustrates this clearly. It recorded the worst accuracy of the group, around 24% correct answers, an error rate above 75%.

Yet its generation signals looked near-perfect. Its stability score was the highest in the cohort, its variation was effectively zero, and it raised no anomalies at all. In other words, the model produced large factual errors in a fluid, regular, and statistically stable way. Mathematical regularity during generation is not a guarantee of truthfulness.

Limits identified

The review is candid about where the method, in its current form, falls short:

Reduced sensitivity on highly stable models. On highly stable models, recall remains limited, and many factual errors may not be flagged by the stability signal alone.
Single-judge evaluation bias. Relying on one evaluation judge introduces bias, and the reliability of the evaluation itself cannot be measured.
Compensation effects in a linear composite score. When stability and anomalies are simply averaged, strong stability can mask poor behavior.
Rigid drift thresholds. A single fixed threshold ignores each model’s natural variance.
Difficulty detecting slow drift. Gradual, step-by-step degradation can pass unnoticed from one measurement window to the next.

Proposed improvements

The review proposes concrete, testable directions rather than fixed conclusions:

A non-linear composite score, so generation quality can no longer offset poor behavior.
A multiplicative or exponential penalty applied to the anomaly rate.
Multi-judge validation, replacing the single evaluator.
Cohen’s Kappa, or Fleiss’ Kappa for more than two judges, to measure inter-rater agreement.
Welch’s t-test for variance-aware detection of behavioral change.
CUSUM monitoring to capture slow, cumulative drift.
Integration of semantic accuracy into offline benchmark grading.

Public research roadmap

The next steps are empirical and open:

Benchmark the candidate formulas on historical runs.
Move from single-judge to multi-judge evaluation.
Report inter-rater agreement alongside results.
Improve drift detection with variance-aware methods.
Ensure that stability never replaces truthfulness.

Download the full review

The full sanitized methodological review is available as a PDF, with the complete reasoning, formulas, and limitations.

Download the full report in PDF

Disclaimer

This article summarizes an independent methodological contribution by Abdelkrim Halimi, Independent Data Scientist Contributor. It is not a certification, an official validation, a guarantee of performance, or a commercial endorsement of NeoMundi. Provider identities, proprietary datasets, API keys, infrastructure details, and commercially sensitive benchmark data are not disclosed.

Author of the report

Abdelkrim Halimi
NeoMundi Research Contributor · Data Analyst & Methodological Reviewer
Data Science, Computer Vision & Independent Critical Analysis

Abdelkrim Halimi is a data scientist specialized in OCR, computer vision, predictive maintenance, and applied data analysis.

Within the Observatory, he contributes as a methodological reviewer, providing independent critical analysis of measurement campaigns, score distributions, signal consistency, and the limitations of observation protocols.

His contribution aims to strengthen the scientific robustness of NeoMundi’s cartographies and reports by bringing an external, rigorous, and documented perspective on observed results, potential biases, interpretation assumptions, and methodological points of vigilance.

He notably contributed to an independent methodological review of the May 2026 comparative cohort, in a spirit of scientific caution, transparency, and continuous improvement of analysis methods.

Mission : Independent critical review of results, identification of methodological limitations, analysis of potential biases, and contribution to the scientific robustness of the Observatory’s reports.

Profile : [LinkedIn]