Real results from production data — 154 documents, 765 Q&A pairs, three task types. ECW vs. Base Model head-to-head.
| Capability | Awarity (ECW) | RAG | Native LLM |
|---|---|---|---|
| Handles unlimited dataset size | Yes | Partial | No — ~1M token cap |
| Zero retrieval lossiness | Yes | No — lossy by design | Yes |
| Runs fully on-prem / offline | Yes | Partial | No |
| No embedding infrastructure | Yes | No — requires vector DB | Yes |
| Model-agnostic | Yes — any LLM | Partial | No — locked to provider |
| Reads every document every time | Yes | No — retrieval sampling | No — token limit |
| Plugs into existing workflows | Yes — CLI + API | Partial | Partial |
Cost savings are consistent across all three task types (needle-in-haystack, multi-document synthesis, contradiction detection) and all three model families. The more expensive the frontier model, the larger the ECW advantage.
%%{init: {'theme': 'base', 'themeVariables': {'xyChart': {'plotColorPalette': '#0fa37f'}, 'primaryColor': '#0fa37f'}}}%%
xychart-beta
title "Cost Savings: ECW vs. Base Model (%)"
x-axis ["5.4 Needle", "5.4 Synth", "5.4 Contr.", "4.1 Needle", "4.1 Synth", "Opus Needle", "Opus Synth", "Opus Contr."]
y-axis "Savings (%)" 0 --> 100
bar [92, 91, 91, 68, 68, 89, 88, 90]
All tests used the same 154-document due diligence catalog (731K tokens, 285 chunks). gpt-5.4 savings reflect OpenAI's 2× surcharge for prompts over 272K tokens — a threshold ECW's infer phase never crosses.
ECW keeps the infer context under 272K tokens, avoiding OpenAI's 2× surcharge that BaseModel always triggers with large catalogs.
22 test cases, 285 chunks. ECW: $6.91. BaseModel: $80.42. Quality was identical — both near-perfect.
The infer phase processes a small curated context regardless of catalog size. As map models get cheaper, ECW's total cost approaches this floor.
ECW matched or exceeded Base Model accuracy on every task type. Bars show ECW scores; the line shows Base Model scores. Higher is better.
%%{init: {'theme': 'base', 'themeVariables': {'xyChart': {'plotColorPalette': '#0fa37f,#f59e0b'}}}}%%
xychart-beta
title "Answer Correctness: ECW (bar) vs. Base Model (line)"
x-axis ["5.4 Needle", "5.4 Synth", "4.1 Needle", "4.1 Synth", "Opus Needle", "Opus Synth", "Opus Contr."]
y-axis "Score (%)" 75 --> 105
bar [94, 89.9, 93.8, 90.1, 98.6, 91.8, 100]
line [90.7, 90.9, 92.4, 83.8, 95.7, 93.1, 94.3]
Scores are answer correctness × 100. ECW with Claude Opus scored 100 on contradiction detection — all 14 cases correct. Base Model missed 4.
On multi-document synthesis — the hardest category — gpt-4.1 with ECW scored 90.1, nearly matching gpt-5.4 BaseModel at 90.9. Without ECW, gpt-4.1 scored only 83.8.
The gap isn't model intelligence — it's attention dilution. When you stuff 731K tokens into a single prompt, even a capable model struggles to locate the relevant facts. ECW gives every chunk focused attention in the map phase, then synthesizes only the relevant material.
Cost comparison: gpt-5.4 BaseModel costs ~$3.67 per synthesis case. gpt-4.1 with ECW costs ~$0.31 — 92% less, for equivalent quality.
%%{init: {'theme': 'base', 'themeVariables': {'xyChart': {'plotColorPalette': '#0fa37f,#0fa37f,#f59e0b'}}}}%%
xychart-beta
title "Multi-Document Synthesis Score"
x-axis ["gpt-5.4 BaseModel", "gpt-4.1 + ECW", "gpt-4.1 BaseModel"]
y-axis "Score (%)" 80 --> 95
bar [90.9, 90.1, 83.8]
gpt-4.1 + ECW nearly matches the flagship model at a fraction of the cost.
Every Base Model approach is bounded by its context window. Claude Opus (200K tokens) couldn't even attempt the full 285-chunk catalog and had to be limited to 60 chunks. ECW processed the full catalog for every model with no truncation.
| Model | Context Window | BaseModel Max | ECW Max |
|---|---|---|---|
| Claude Opus | 200K tokens | ~155K tokens | Unlimited |
| gpt-4.1 | 512K tokens | ~480K tokens | Unlimited |
| gpt-5.4 | 1M tokens | ~950K tokens | Unlimited |
| Corpus size | BaseModel | ECW (4.1-mini map) | ECW (open-source map) |
|---|---|---|---|
| 731K tokens | $7.34 | $0.64 | ~$0.04 |
| 8M tokens | Impossible | ~$7 | ~$0.40 |
| 100M tokens | Impossible | ~$88 | ~$4 |
With an open-source map model, the only ECW cost is the single infer call. A 100M token catalog — ~75,000 pages — reasoned over for under $4.
ECW has two cost components: a map phase (cheap model, scales linearly) and a fixed infer phase (~$0.04 regardless of catalog size). The break-even is remarkably low.
| Infer model | Break-even point | Equivalent size |
|---|---|---|
| gpt-4.1 ($2/$8 per M) | ~22K tokens | ~16 pages |
| gpt-5.4 ($5/$22.50 per M) | ~4.3K tokens | ~3 pages |
| Claude Opus ($5/$25 per M) | ~3.5K tokens | ~3 pages |
%%{init: {'theme': 'base', 'themeVariables': {'xyChart': {'plotColorPalette': '#0fa37f,#f59e0b'}}}}%%
xychart-beta
title "Cost vs. Corpus Size — ECW (green) vs. gpt-4.1 BaseModel (amber)"
x-axis ["50K", "100K", "250K", "500K", "731K", "1M"]
y-axis "Cost ($)" 0 --> 3
line [0.08, 0.12, 0.24, 0.44, 0.62, 0.83]
line [0.13, 0.26, 0.66, 1.32, 1.93, 2.64]
ECW (green) starts slightly above BaseModel at tiny corpus sizes due to the fixed infer cost, then diverges sharply. Past ~22K tokens, ECW is cheaper on every query.
Corpus: The UrbanWind due diligence catalog — 154 real documents (employment agreements, board resolutions, balance sheets, financial statements), 731,195 tokens across 285 pre-chunked segments. Production data from the Avalanche ingestion pipeline, not synthetic filler.
Test cases: An LLM generated 3–5 Q&A pairs per document, producing ~765 question-answer pairs across three categories:
Scoring: Fully deterministic — no LLM-as-judge. Answer Correctness = weighted keyword presence (40%), forbidden keyword absence (20%), pattern match (20%), non-empty check (20%). Both runners received identical inputs.
Models tested: gpt-5.4 and gpt-4.1 at full 285-chunk catalog; Claude Opus at 60 chunks (limited by its 200K context window). ECW map model: gpt-4.1-mini ($0.40/$1.60 per M tokens). Concurrency: 4 parallel map calls.
Full test data and raw results available on request — hello@awarity.ai
Want to benchmark Awarity on your catalog? Get in touch and we'll set up a private evaluation.
hello@awarity.ai