Benchmark

Comparison

Awarity vs. the alternatives

Capability	Awarity (ECW)	RAG	Native LLM
Handles unlimited dataset size	Yes	Partial	No — ~1M token cap
Zero retrieval lossiness	Yes	No — lossy by design	Yes
Runs fully on-prem / offline	Yes	Partial	No
No embedding infrastructure	Yes	No — requires vector DB	Yes
Model-agnostic	Yes — any LLM	Partial	No — locked to provider
Reads every document every time	Yes	No — retrieval sampling	No — token limit
Plugs into existing workflows	Yes — CLI + API	Partial	Partial

Cost

ECW is 68–92% cheaper — across every model tested

Cost savings are consistent across all three task types (needle-in-haystack, multi-document synthesis, contradiction detection) and all three model families. The more expensive the frontier model, the larger the ECW advantage.

%%{init: {'theme': 'base', 'themeVariables': {'xyChart': {'plotColorPalette': '#0fa37f'}, 'primaryColor': '#0fa37f'}}}%%
xychart-beta
    title "Cost Savings: ECW vs. Base Model (%)"
    x-axis ["5.4 Needle", "5.4 Synth", "5.4 Contr.", "4.1 Needle", "4.1 Synth", "Opus Needle", "Opus Synth", "Opus Contr."]
    y-axis "Savings (%)" 0 --> 100
    bar [92, 91, 91, 68, 68, 89, 88, 90]

All tests used the same 154-document due diligence catalog (731K tokens, 285 chunks). gpt-5.4 savings reflect OpenAI's 2× surcharge for prompts over 272K tokens — a threshold ECW's infer phase never crosses.

92%

gpt-5.4 savings

ECW keeps the infer context under 272K tokens, avoiding OpenAI's 2× surcharge that BaseModel always triggers with large catalogs.

91%

Contradiction detection

22 test cases, 285 chunks. ECW: $6.91. BaseModel: $80.42. Quality was identical — both near-perfect.

$0.04

Minimum infer cost

The infer phase processes a small curated context regardless of catalog size. As map models get cheaper, ECW's total cost approaches this floor.

Accuracy

Equal or better quality at a fraction of the cost

ECW matched or exceeded Base Model accuracy on every task type. Bars show ECW scores; the line shows Base Model scores. Higher is better.

%%{init: {'theme': 'base', 'themeVariables': {'xyChart': {'plotColorPalette': '#0fa37f,#f59e0b'}}}}%%
xychart-beta
    title "Answer Correctness: ECW (bar) vs. Base Model (line)"
    x-axis ["5.4 Needle", "5.4 Synth", "4.1 Needle", "4.1 Synth", "Opus Needle", "Opus Synth", "Opus Contr."]
    y-axis "Score (%)" 75 --> 105
    bar [94, 89.9, 93.8, 90.1, 98.6, 91.8, 100]
    line [90.7, 90.9, 92.4, 83.8, 95.7, 93.1, 94.3]

Scores are answer correctness × 100. ECW with Claude Opus scored 100 on contradiction detection — all 14 cases correct. Base Model missed 4.

Quality equalization

A cheaper model with ECW outperforms a pricier model without it

On multi-document synthesis — the hardest category — gpt-4.1 with ECW scored 90.1, nearly matching gpt-5.4 BaseModel at 90.9. Without ECW, gpt-4.1 scored only 83.8.

The gap isn't model intelligence — it's attention dilution. When you stuff 731K tokens into a single prompt, even a capable model struggles to locate the relevant facts. ECW gives every chunk focused attention in the map phase, then synthesizes only the relevant material.

Cost comparison: gpt-5.4 BaseModel costs ~$3.67 per synthesis case. gpt-4.1 with ECW costs ~$0.31 — 92% less, for equivalent quality.

%%{init: {'theme': 'base', 'themeVariables': {'xyChart': {'plotColorPalette': '#0fa37f,#0fa37f,#f59e0b'}}}}%%
xychart-beta
    title "Multi-Document Synthesis Score"
    x-axis ["gpt-5.4 BaseModel", "gpt-4.1 + ECW", "gpt-4.1 BaseModel"]
    y-axis "Score (%)" 80 --> 95
    bar [90.9, 90.1, 83.8]

gpt-4.1 + ECW nearly matches the flagship model at a fraction of the cost.

On-prem & open-source map models

Run the map phase locally. Drop costs to near-zero.

The map phase performs a simple, well-defined task: read a chunk, extract relevant facts, score relevance. It doesn't require frontier-model intelligence. Models like Qwen and Llama running locally are more than capable — which means the map phase can cost $0. The only remaining cost is the single infer call to a frontier model.

Swapping to a cheaper API map model — gpt-4o-mini

We tested gpt-4o-mini ($0.15/$0.60/M — 62% cheaper than gpt-4.1-mini) as the map model on multi-document synthesis:

Map model	Accuracy	Cost	vs. gpt-5.4 BaseModel
gpt-4.1-mini	88.6%	$0.36	−95%
gpt-4.1-mini (standard)	89.9%	$0.66	−91%
gpt-5.4 BaseModel	90.9%	$7.34	—

A 1.3% accuracy drop for 95% cost savings. The map task is straightforward enough that a generation-old model handles it nearly as well.

The open-source path: map cost → $0

With Qwen, Llama, or any capable open-source model running locally, the map phase costs nothing. The only fee is the final infer call — a compact, curated context regardless of catalog size.

Map model	Map cost	Infer cost	Total	vs. gpt-5.4 Base
gpt-4.1-mini (API)	~$0.60	~$0.04	~$0.64	−91%
gpt-4o-mini (API)	~$0.22	~$0.04	~$0.26	−96%
Qwen / Llama (local)	$0.00	~$0.04	~$0.04	−99%

The trade-off: speed

On-prem models are slower than cloud API calls. ECW already adds 25–50s of latency vs. BaseModel's 12–20s (at 4x concurrency with cloud map models). With a local model like Qwen or Llama, map phase latency increases further — the exact overhead depends on your hardware. For batch pipelines and overnight jobs this is irrelevant. For interactive queries, weigh the cost savings against your latency requirements. Cloud map models (gpt-4o-mini) offer a middle path: 96% savings with no additional latency vs. gpt-4.1-mini.

Scale

Base Model hits a wall. ECW doesn't.

Every Base Model approach is bounded by its context window. Claude Opus (200K tokens) couldn't even attempt the full 285-chunk catalog and had to be limited to 60 chunks. ECW processed the full catalog for every model with no truncation.

Context window limits

Model	Context Window	BaseModel Max	ECW Max
Claude Opus	200K tokens	~155K tokens	Unlimited
gpt-4.1	512K tokens	~480K tokens	Unlimited
gpt-5.4	1M tokens	~950K tokens	Unlimited

Cost at scale (gpt-5.4 infer)

Corpus size	BaseModel	ECW (4.1-mini map)	ECW (open-source map)
731K tokens	$7.34	$0.64	~$0.04
8M tokens	Impossible	~$7	~$0.40
100M tokens	Impossible	~$88	~$4

With an open-source map model, the only ECW cost is the single infer call. A 100M token catalog — ~75,000 pages — reasoned over for under $4.

Break-even

ECW pays for itself after just a few pages

ECW has two cost components: a map phase (cheap model, scales linearly) and a fixed infer phase (~$0.04 regardless of catalog size). The break-even is remarkably low.

Infer model	Break-even point	Equivalent size
gpt-4.1 ($2/$8 per M)	~22K tokens	~16 pages
gpt-5.4 ($5/$22.50 per M)	~4.3K tokens	~3 pages
Claude Opus ($5/$25 per M)	~3.5K tokens	~3 pages

%%{init: {'theme': 'base', 'themeVariables': {'xyChart': {'plotColorPalette': '#0fa37f,#f59e0b'}}}}%%
xychart-beta
    title "Cost vs. Corpus Size — ECW (green) vs. gpt-4.1 BaseModel (amber)"
    x-axis ["50K", "100K", "250K", "500K", "731K", "1M"]
    y-axis "Cost ($)" 0 --> 3
    line [0.08, 0.12, 0.24, 0.44, 0.62, 0.83]
    line [0.13, 0.26, 0.66, 1.32, 1.93, 2.64]

ECW (green) starts slightly above BaseModel at tiny corpus sizes due to the fixed infer cost, then diverges sharply. Past ~22K tokens, ECW is cheaper on every query.

Methodology

How we tested

Corpus: The UrbanWind due diligence catalog — 154 real documents (employment agreements, board resolutions, balance sheets, financial statements), 731,195 tokens across 285 pre-chunked segments. Production data from the Avalanche ingestion pipeline, not synthetic filler.

Test cases: An LLM generated 3–5 Q&A pairs per document, producing ~765 question-answer pairs across three categories:

Needle-in-haystack — finding 3–5 specific facts buried across hundreds of chunks
Multi-document synthesis — combining information from 5–7 different documents into one answer
Contradiction detection — identifying conflicting facts across documents (hardest category)

Scoring: Fully deterministic — no LLM-as-judge. Answer Correctness = weighted keyword presence (40%), forbidden keyword absence (20%), pattern match (20%), non-empty check (20%). Both runners received identical inputs.

Models tested: gpt-5.4 and gpt-4.1 at full 285-chunk catalog; Claude Opus at 60 chunks (limited by its 200K context window). ECW map model: gpt-4.1-mini ($0.40/$1.60 per M tokens). Concurrency: 4 parallel map calls.

Full test data and raw results available on request — hello@awarity.ai

Awarity vs. the alternatives

ECW is 68–92% cheaper — across every model tested

gpt-5.4 savings

Contradiction detection

Minimum infer cost

Equal or better quality at a fraction of the cost

A cheaper model with ECW outperforms a pricier model without it

Run the map phase locally. Drop costs to near-zero.

Swapping to a cheaper API map model — gpt-4o-mini

The open-source path: map cost → $0

The trade-off: speed

Base Model hits a wall. ECW doesn't.

Context window limits

Cost at scale (gpt-5.4 infer)

ECW pays for itself after just a few pages

How we tested

Run it against your own data