Performance · March 2026

Benchmark

Real results from production data — 154 documents, 765 Q&A pairs, three task types. ECW vs. Base Model head-to-head.

68–92%
Cost reduction
vs. sending the full catalog to a frontier model. Grows as model pricing increases.
1.000
Perfect score
ECW with Claude Opus achieved a perfect score on all 14 contradiction detection cases.
Unlimited
Catalog size
Base Model hit its context ceiling. ECW processed every chunk — no truncation.
Comparison

Awarity vs. the alternatives

Capability Awarity (ECW) RAG Native LLM
Handles unlimited dataset size Yes Partial No — ~1M token cap
Zero retrieval lossiness Yes No — lossy by design Yes
Runs fully on-prem / offline Yes Partial No
No embedding infrastructure Yes No — requires vector DB Yes
Model-agnostic Yes — any LLM Partial No — locked to provider
Reads every document every time Yes No — retrieval sampling No — token limit
Plugs into existing workflows Yes — CLI + API Partial Partial
Cost

ECW is 68–92% cheaper — across every model tested

Cost savings are consistent across all three task types (needle-in-haystack, multi-document synthesis, contradiction detection) and all three model families. The more expensive the frontier model, the larger the ECW advantage.

%%{init: {'theme': 'base', 'themeVariables': {'xyChart': {'plotColorPalette': '#0fa37f'}, 'primaryColor': '#0fa37f'}}}%%
xychart-beta
    title "Cost Savings: ECW vs. Base Model (%)"
    x-axis ["5.4 Needle", "5.4 Synth", "5.4 Contr.", "4.1 Needle", "4.1 Synth", "Opus Needle", "Opus Synth", "Opus Contr."]
    y-axis "Savings (%)" 0 --> 100
    bar [92, 91, 91, 68, 68, 89, 88, 90]
      

All tests used the same 154-document due diligence catalog (731K tokens, 285 chunks). gpt-5.4 savings reflect OpenAI's 2× surcharge for prompts over 272K tokens — a threshold ECW's infer phase never crosses.

92%

gpt-5.4 savings

ECW keeps the infer context under 272K tokens, avoiding OpenAI's 2× surcharge that BaseModel always triggers with large catalogs.

91%

Contradiction detection

22 test cases, 285 chunks. ECW: $6.91. BaseModel: $80.42. Quality was identical — both near-perfect.

$0.04

Minimum infer cost

The infer phase processes a small curated context regardless of catalog size. As map models get cheaper, ECW's total cost approaches this floor.

Accuracy

Equal or better quality at a fraction of the cost

ECW matched or exceeded Base Model accuracy on every task type. Bars show ECW scores; the line shows Base Model scores. Higher is better.

%%{init: {'theme': 'base', 'themeVariables': {'xyChart': {'plotColorPalette': '#0fa37f,#f59e0b'}}}}%%
xychart-beta
    title "Answer Correctness: ECW (bar) vs. Base Model (line)"
    x-axis ["5.4 Needle", "5.4 Synth", "4.1 Needle", "4.1 Synth", "Opus Needle", "Opus Synth", "Opus Contr."]
    y-axis "Score (%)" 75 --> 105
    bar [94, 89.9, 93.8, 90.1, 98.6, 91.8, 100]
    line [90.7, 90.9, 92.4, 83.8, 95.7, 93.1, 94.3]
      

Scores are answer correctness × 100. ECW with Claude Opus scored 100 on contradiction detection — all 14 cases correct. Base Model missed 4.

Quality equalization

A cheaper model with ECW outperforms a pricier model without it

On multi-document synthesis — the hardest category — gpt-4.1 with ECW scored 90.1, nearly matching gpt-5.4 BaseModel at 90.9. Without ECW, gpt-4.1 scored only 83.8.

The gap isn't model intelligence — it's attention dilution. When you stuff 731K tokens into a single prompt, even a capable model struggles to locate the relevant facts. ECW gives every chunk focused attention in the map phase, then synthesizes only the relevant material.

Cost comparison: gpt-5.4 BaseModel costs ~$3.67 per synthesis case. gpt-4.1 with ECW costs ~$0.31 — 92% less, for equivalent quality.

%%{init: {'theme': 'base', 'themeVariables': {'xyChart': {'plotColorPalette': '#0fa37f,#0fa37f,#f59e0b'}}}}%%
xychart-beta
    title "Multi-Document Synthesis Score"
    x-axis ["gpt-5.4 BaseModel", "gpt-4.1 + ECW", "gpt-4.1 BaseModel"]
    y-axis "Score (%)" 80 --> 95
    bar [90.9, 90.1, 83.8]
          

gpt-4.1 + ECW nearly matches the flagship model at a fraction of the cost.

Scale

Base Model hits a wall. ECW doesn't.

Every Base Model approach is bounded by its context window. Claude Opus (200K tokens) couldn't even attempt the full 285-chunk catalog and had to be limited to 60 chunks. ECW processed the full catalog for every model with no truncation.

Context window limits

Model Context Window BaseModel Max ECW Max
Claude Opus 200K tokens ~155K tokens Unlimited
gpt-4.1 512K tokens ~480K tokens Unlimited
gpt-5.4 1M tokens ~950K tokens Unlimited

Cost at scale (gpt-5.4 infer)

Corpus size BaseModel ECW (4.1-mini map) ECW (open-source map)
731K tokens $7.34 $0.64 ~$0.04
8M tokens Impossible ~$7 ~$0.40
100M tokens Impossible ~$88 ~$4

With an open-source map model, the only ECW cost is the single infer call. A 100M token catalog — ~75,000 pages — reasoned over for under $4.

Break-even

ECW pays for itself after just a few pages

ECW has two cost components: a map phase (cheap model, scales linearly) and a fixed infer phase (~$0.04 regardless of catalog size). The break-even is remarkably low.

Infer model Break-even point Equivalent size
gpt-4.1 ($2/$8 per M) ~22K tokens ~16 pages
gpt-5.4 ($5/$22.50 per M) ~4.3K tokens ~3 pages
Claude Opus ($5/$25 per M) ~3.5K tokens ~3 pages
%%{init: {'theme': 'base', 'themeVariables': {'xyChart': {'plotColorPalette': '#0fa37f,#f59e0b'}}}}%%
xychart-beta
    title "Cost vs. Corpus Size — ECW (green) vs. gpt-4.1 BaseModel (amber)"
    x-axis ["50K", "100K", "250K", "500K", "731K", "1M"]
    y-axis "Cost ($)" 0 --> 3
    line [0.08, 0.12, 0.24, 0.44, 0.62, 0.83]
    line [0.13, 0.26, 0.66, 1.32, 1.93, 2.64]
      

ECW (green) starts slightly above BaseModel at tiny corpus sizes due to the fixed infer cost, then diverges sharply. Past ~22K tokens, ECW is cheaper on every query.

Methodology

How we tested

Corpus: The UrbanWind due diligence catalog — 154 real documents (employment agreements, board resolutions, balance sheets, financial statements), 731,195 tokens across 285 pre-chunked segments. Production data from the Avalanche ingestion pipeline, not synthetic filler.

Test cases: An LLM generated 3–5 Q&A pairs per document, producing ~765 question-answer pairs across three categories:

Scoring: Fully deterministic — no LLM-as-judge. Answer Correctness = weighted keyword presence (40%), forbidden keyword absence (20%), pattern match (20%), non-empty check (20%). Both runners received identical inputs.

Models tested: gpt-5.4 and gpt-4.1 at full 285-chunk catalog; Claude Opus at 60 chunks (limited by its 200K context window). ECW map model: gpt-4.1-mini ($0.40/$1.60 per M tokens). Concurrency: 4 parallel map calls.

Full test data and raw results available on request — hello@awarity.ai

Run it against your own data

Want to benchmark Awarity on your catalog? Get in touch and we'll set up a private evaluation.

hello@awarity.ai