Benchmark Fournisseurs

Charles Dana — Monce SAS — 2026
10 modeles Snake SAT pour la decision d'achat

1. The Dana Theorem — SAT construction

Any indicator function over a finite discrete domain can be encoded as a SAT instance in polynomial time. The construction is direct: for each non-member of the target class, build a clause that excludes it using feature-value differences. The conjunction of all such clauses is a CNF formula that accepts exactly the target class.

f(x) = Σ_{p⊆{1,...,n}} a_p · Π_i∈p x_i

Snake implements this constructively: oppose() finds literals that separate classes, minimize_clause() removes redundant literals, and the loop terminates when all non-members are covered. No backtracking, no exponential search. Total complexity: O(L × n × b × m) where L = layers, n = samples, b = bucket size, m = features.

            Key insight: Snake doesn't solve SAT. It builds structured formulas where the data is the assignment. The NP-hardness of SAT applies to finding satisfying assignments for arbitrary formulas — Snake never does that. It constructs formulas directly from data in polynomial time, then evaluates them (also polynomial) at inference.
        

2. 10 models, one decision

The benchmark is a 10-dimensional classification problem. A monolithic model would need to predict compound labels like "Optimal with outlier prix and rising trend and high risk" — a combinatorial explosion across 3 × 2 × 2 × 2 × 4 × 3 × 3 × 3 × 3 × 4 = 15,552 possible class combinations.

Instead, we decompose into 10 independent Snake models:

Model	Classes	AUROC	Accuracy
supplier_score	Optimal / Acceptable / Outlier	0.9715	85.6%
outlier_prix	Normal / Outlier	0.9517	95.7%
outlier_delai	Normal / Outlier	0.9385	94.0%
outlier_moq	Normal / Outlier	0.9412	94.8%
tendance_prix	Stable / Hausse / Baisse / Volatile	0.9841	96.7%
fiabilite	High / Medium / Low	0.9634	97.3%
conditions_rating	Favorable / Standard / Defavorable	0.9731	96.3%
competitivite	Competitive / Average / Uncompetitive	0.9333	95.7%
risque_approvisionnement	Low / Medium / High	0.9552	96.0%
recommandation	Maintenir / Developper / Negocier / Exclure	0.9734	94.5%
Average		0.9585	94.7%

Each model has 2-4 classes. Each is easy to train (3000 samples, 15 layers, <15s). The 10 predictions compose into a complete supplier profile: the buyer sees one table with 10 independent assessments, each with its own probability and audit trail.

3. The scoring formula

score = 0.35×prix + 0.30×delai + 0.20×fiabilite + 0.10×MOQ + 0.05×conditions

This formula was used to generate training labels for the supplier_score model. But Snake doesn't know about these weights. It learns the classification boundary from data, building SAT clauses that approximate the weighted sum without being told the weights.

The resulting clauses are interpretable. A Snake audit trail reads like: "if score_prix > 0.8 and score_delai > 0.6 → Optimal." The weights emerge from clause structure, not from a parameter vector. This is the explainability payoff of SAT-based classification: the model's decision is a boolean formula you can read.

Why 5 axes, not 1?

A procurement manager doesn't think in composite scores. They think: "this supplier is cheap but slow." The spider chart (radar) renders 5 independent axes — and behind each axis, a dedicated Snake model validates the assessment. The composite score is for ranking. The individual axes are for understanding.

4. The V1 → V2 data fix

The best example of iterative improvement in the benchmark classifier. Three models had broken minority classes in V1:

Model	Problem (V1)	Fix (V2)	Result
outlier_prix	6% Outlier class → 0% recall	Rebalanced to 40%	93.9% recall
recommandation	2% Exclure class → 0% recall	Oversampled to 20%	92.5% recall
risque_appro.	Fuzzy features → 84% accuracy	Sharper risk formula	96.0% accuracy

            The lesson: Snake is only as good as the data distribution. No model architecture change was needed — just better data. This is a general principle for all Monce Suite models: when a minority class has 0% recall, the first thing to check is class balance, not hyperparameters.
        

The fix was mechanical: generate more samples of the minority class with deliberate feature separation. outlier_prix went from 96 Outlier samples out of 2000 (5%) to 1200 out of 3000 (40%). The model immediately learned the boundary. Snake doesn't need massive data — it needs representative data.

5. Multilingual reference harmonisation

The benchmark service handles multiple suppliers, each with their own product codes and potentially their own language:

GlassCorp:   "VS-FEU442"       → "Feuillete 44.2 LowE One"
VitroSupply: "VTS-4420-LE"     → "Feuillete 44.2 Low-E"
EuroVerre:   "EV-FEU-44.2-LE"  → "Laminated 44.2 Low-E coated"

These are the same product. The harmonisation pipeline:

Claude Haiku extraction: normalises descriptions to standard French terminology. "Laminated" (EN) → "Feuillete" (FR). "Low-E coated" → "LowE". Abbreviations resolved: "Feuil." → "Feuillete".
Internal reference matching: maps all three supplier codes to internal ref 66019.
Confidence scoring: each match gets a confidence note (0.87-0.91 in the example).

Cross-language article matching is solved by the LLM layer. Snake works on the normalised refs — it never sees the raw multilingual descriptions. The LLM handles ambiguity, Snake handles classification. Clean separation of concerns.

6. Architecture

POST /classify
  → extraction.py: Claude Haiku harmonises N supplier refs
    → articles_harmonises (normalised, matched to internal refs)
      → classification.py: 10 Snake models per offer
        → supplier_score, outlier_prix, outlier_delai, outlier_moq
        → tendance_prix, fiabilite, conditions_rating
        → competitivite, risque_approvisionnement, recommandation
          → routes.py: assembles benchmark response
            → ranking + outliers + spider + xai audit

All 10 Snake models load at startup from JSON. Inference is ~50ms per model per offer. The bottleneck is Claude extraction (~1.5-2s). Total latency: ~2.4s.

7. anthropic=False: Snake sans LLM

/comprendre defaults to anthropic: false. In this mode, no LLM is called — the text is parsed by regex, each product line becomes its own article (no cross-supplier matching), and the 10 Snake models run on the raw extracted features.

            Design principle: the intelligence lives in the 10 Snake models, not in the LLM. Claude Haiku is a quality enhancer (matching, normalisation), not a requirement. An EC2 with no API key still runs a full benchmark pipeline — the Snake models score, detect outliers, classify risk, and recommend actions. The quality drops from 0.89 to 0.45, but the structure and the decisions are intact.
        

POST /comprendre (anthropic=false, default)
  → regex parser: extracts prices, delays, MOQ from text
    → passthrough harmonise: 1 line = 1 article (no matching)
      → 10 Snake models: score, outlier, trend, risk, recommendation
        → full benchmark response, mode: "regex", quality: 0.45

POST /comprendre (anthropic=true)
  → Claude Haiku: structures text into JSON offres
    → Claude Haiku: harmonises refs across suppliers
      → 10 Snake models: same pipeline
        → full benchmark response, mode: "haiku", quality: 0.89

The anthropic flag is explicit — the caller decides when to spend the ~$0.001 Haiku call. This is intentional: procurement tools run on internal networks where LLM calls may be restricted. Snake-only mode respects that constraint while still delivering actionable scoring.

See /genesis for the complete /comprendre specification.