Calibration-weighted truth.

A self-improving prediction oracle where every AI agent's influence is weighted by its historical Brier score. Python engine, Solidity mirror on Base Sepolia, every probability and every weight independently verifiable.

100%
Accuracy
50-case benchmark, seed=42
0.0724
Brier score
Lower than any single agent
742
Python tests
90 adversarial · 55 Foundry · CI green
3.3s
Consensus latency
3 parallel agents, local LLM

How it works

One question, three parallel agents, one calibrated answer. The weight formula is two lines of math; you can read it in swarm_oracle/weights.py and find the same math, bit-for-bit, in contracts/src/CalibrationRegistry.sol.

$ python swarm_verify.py "Did BTC close above 100K on May 5, 2026?"
========================================================================
  SWARM ORACLE  |  Calibration-Weighted Consensus
========================================================================
Question : Did BTC close above 100K on May 5, 2026?
Agents   : 3
Elapsed  : 3.30s

Individual votes:
  agent-oracle     | strategy=api         | P(YES)=0.030 | conf=0.90 | weight= 10.00 ( 60.1%) ████████████········
  agent-reliable   | strategy=web_search  | P(YES)=0.050 | conf=0.80 | weight=  5.56 ( 33.5%) ███████·············
  agent-novice     | strategy=knowledge   | P(YES)=0.500 | conf=0.00 | weight=  1.07 (  6.4%) █···················

Consensus:
  Weighted P(YES) = 0.0303
  Variance        = 0.0143
  Decision        = NO
========================================================================

1. Question in

Any binary (YES/NO) prediction question. CLI, FastAPI endpoint, or direct Python.

2. Parallel research

Each agent runs a different research strategy — API lookup, web search, knowledge-only — and reasons independently.

3. Calibration weighting

weight = 1 / (brier + ε) scaled by a confidence ramp for new agents. Well-calibrated agents get more vote.

4. Weighted consensus

Linear opinion pool produces a single P(YES). If weighted variance crosses a threshold, the result is flagged DISPUTE rather than forced.

5. On-chain verification

CalibrationRegistry.sol and SwarmConsensus.sol mirror the math in WAD (18-decimal) fixed-point. Anyone can recompute weights from public Brier scores.

6. Self-improving

Every resolution becomes training data. Brier scores update, weights re-derive, future predictions sharpen. The protocol gets smarter without a re-deploy.

The benchmark

A 50-case deterministic benchmark (seed=42), balanced YES/NO, with agents designed to fail on different subsets. DISPUTE = correct abstention when agents genuinely disagree — that is accuracy, not failure. Reproduce with make benchmark.

Method Accuracy Brier ↓ Disputes Notes
swarm 100% 0.0724 18/50 Best Brier of all methods
majority vote 92.0% 0.0785 0
average 98.0% 0.0935 0
agent-oracle 84.0% 0.1029 0 Best single agent
agent-reliable 80.0% 0.1332 0
agent-novice 68.0% 0.2009 0

The swarm protocol beats every single agent on Brier score — including the oracle's 0.1029. The variance gate converts genuine disagreement into honest DISPUTE outcomes rather than forcing a wrong answer. Reproduce locally with make benchmark.

The architecture

Six Python modules feed three Solidity contracts. The Python ↔ Solidity boundary is policed by a 14-test parity suite that compares bit-for-bit on a frozen corpus.

Swarm Oracle Architecture Calibration-Weighted Multi-Agent Prediction Consensus Question Input (CLI / API) verifier.py — ThreadPool Orchestrator agent-oracle strategy: API lookup Brier: 0.08 · weight: 60% agent-reliable strategy: web search Brier: 0.15 · weight: 34% agent-novice strategy: knowledge Brier: n/a · weight: 6% consensus.py + weights.py Weighted linear opinion pool · Dispute detection YES NO DISPUTE BASE SEPOLIA L2 — On-Chain Verification Layer CalibrationRegistry Brier storage + WAD weight computation SwarmConsensus Vote aggregation + resolution events RewardDistribution ETH pools · 70/30 split Pull-payment pattern AgentIdentity Soulbound ERC-721 Reputation NFTs Self-Improving Feedback Loop Resolution → Brier update → Weight recalculation → DPO training data 742 Python tests · 55 Foundry tests · 14 parity tests · 3.3s consensus · zero external deps

The on-chain layer

Four contracts, written in Solidity 0.8.24, optimised for Base Sepolia. Weight math is pure WAD (18-decimal fixed-point) — no external libraries, no on-chain sqrt, no approximations.

CalibrationRegistry.sol

Per-agent Brier-score storage. computeWeight(agent) reproduces the Python formula bit-for-bit on a 14-case parity corpus.

SwarmConsensus.sol

Vote aggregation. Reads weights from CalibrationRegistry, computes weighted P(YES) and squared-variance dispute threshold, emits Resolution(YES|NO|DISPUTE).

RewardDistribution.sol

Per-question reward pools. 70/30 split between correctness payouts and calibration improvement. Pull-payment pattern, no re-entrancy surface.

AgentIdentity.sol

Soulbound ERC-721 per agent node. Transfers blocked. Stores cumulative Brier and resolution-count on-chain for transparent reputation.

Try it locally

Two minutes from git clone to a calibration-weighted answer. No API keys. No paid services. Local LLM optional — demo mode runs with zero network calls.

Quickstart
git clone https://github.com/solmonger/swarm-oracle.git
cd swarm-oracle

# Demo mode — no LLM required, deterministic, 3 seconds
python swarm_verify.py --demo "Did BTC close above 100K on May 5, 2026?"

# Or: full pipeline with a local llama.cpp / Ollama server
export LLM_API_URL="http://localhost:8080/v1/chat/completions"
python swarm_verify.py "Will ETH close above $3,000 on June 1, 2026?"

# Or: one-shot Docker
docker compose up                         # API at http://localhost:8000/docs
docker compose run oracle demo            # CLI demo, no LLM needed

Verify the math

Test & verify
make test               # 742 Python tests
make test-solidity      # 55 Foundry tests
make test-integration   # End-to-end pipeline
make benchmark          # Reproduce the comparison table above (100% accuracy, 0.0724 Brier)
make adversarial-compare  # Sybil vs bribery attack cost comparison
make economic-model-mvp   # Minimum viable pool by market size