AI Model Cards & Evaluation
Models, prompt versions, measured evaluation metrics, and known limitations for every AI enrichment — the glass-box behind PortfolIQ's AI-enriched data.
AI Model Cards & Evaluation
PortfolIQ enriches financial data with AI. Unlike most "AI-enriched data" products, which are black boxes, we publish which model and prompt produced each value, how reliable it is, and how that reliability is measured. This is our Angle A (explainable, historised AI) made verifiable.
The machine-readable version of everything below is served at
GET /v1/ai/methodology (public, no API key) so you can automate vendor
due-diligence against us.
Boundary of responsibility. PortfolIQ supplies factual and contextual data. It does not issue a halal verdict, an investment recommendation, a price target, or a valuation. Any compliance screening (e.g. AAOIFI) is performed by downstream consumers, not by PortfolIQ.
Not financial advice. Not a fatwa. Methodology disclosed. Signal only.
1. Models in use
| Role | Model | Share | Used for |
|---|---|---|---|
| Default | Claude Haiku 4.5 | ~85% | news summaries, sentiment, asset descriptions |
| Escalation | Claude Sonnet 4.6 | ~15% | factual context / event extraction (high-risk, downstream-feeding) |
Provider: Anthropic. Hosting region: EU. Model IDs are pinned; any upgrade follows the Model Upgrade Policy (explicit human decision, 7-day shadow run, non-regression gate — never a silent swap).
2. Prompt & methodology versions
Every AI payload carries model_id, prompt_version, and methodology_version
in its ai_metadata block. Prompt versions are tracked per asset_kind in the
prompt registry, with rotation (deprecated_at / replaced_by). Two analyses
produced under different prompt versions are therefore directly comparable over
time — the temporal moat applies to how the data was produced, not just what.
3. Evaluation metrics
Measured on a fixed annotated golden set, on a quarterly cadence and before any prompt-version promotion (non-regression gate). The current published figures:
| # | Metric | What it measures | Value | Target | Status |
|---|---|---|---|---|---|
| M1 | Hallucination rate (news / context) | atomic claims not supported by the source | pending first campaign | ≤ 2% / ≤ 1% | pending |
| M2 | Consistency (sentiment) | label stability on re-run (N=3) | pending first campaign | ≥ 90% | pending |
| M3 | Coverage | valid non-empty output ratio | 100% (seed) | ≥ 98% | pass |
| M4 | Recency (news) | median source age at enrichment | 1h (seed) | ≤ 48h | pass |
| M5 | Appropriate abstention | correct refusal when source is insufficient | 100% (seed) | ≥ 95% | pass |
| AMF | Forbidden-key guard | absence of verdict/fair-value/valuation keys | 0 breaches | 0 | pass |
Honesty note. The current report has
status: "seed": deterministic metrics (M3/M4/M5/AMF) are measured on a 15-item seed set; the LLM-as-judge metrics (M1/M2) and the full 120-item campaign with human spot-check are not yet run. We publish this state explicitly rather than imply measured factuality we do not yet have. The first full campaign replaces these figures and flips the status tomeasured.
Abstention is a feature. A system that says "insufficient signal" when it lacks corroborated sources is more trustworthy than one that always answers. M5 measures that this guard works; below a confidence threshold, a card shows "insufficient signal" instead of an invented narrative.
4. Per-enrichment model cards
news_summary (Haiku 4.5 — high factual risk)
- Grounding: the source article only; all figures come from the source, never
from model memory.
source_urlis mandatory. - Known limitations: may omit context present elsewhere; summarises a single article, not the full news landscape.
- What is NOT guaranteed: completeness, neutrality of the underlying outlet, or that the event described is material.
- Refresh: perishable — re-generated on new news or after an elapsed window.
sentiment_score (Haiku 4.5 — medium factual risk)
- Grounding: the set of recent headlines/sources for the asset.
- Known limitations: a bounded 3-class directional signal
(Positive / Neutral / Negative).
confidencereflects inter-source coherence, not certainty about price direction. - What is NOT guaranteed: a price prediction or trading signal — there is none.
- Refresh: perishable — on new sources or elapsed window.
fundamental_summary / asset descriptions (Haiku 4.5 — medium factual risk)
- Grounding: DB-injected fundamentals (sector, country, market cap, name).
- Known limitations: describes only the injected fields; will not invent a founding date, founder, or narrative not in the data.
- Refresh: stable — quarterly or on a material data change.
event_extraction / factual context (Sonnet 4.6 — high factual risk)
- Grounding: official and reliable-secondary sources, weighted by source tier.
- Known limitations: extracts only what the source states; abstains on speculation or single anonymous sources. Feeds downstream consumers, so it is held to the stricter ≤ 1% hallucination target and undergoes human spot-check.
- Refresh: on new qualifying source.
5. Confidence score semantics
The confidence exposed in every payload is a deterministic Python score in
[0, 1], not a Claude self-report. The formula depends on analysis_type
(source coverage × source agreement × recency for news; inter-source label
coherence × sample size for sentiment; source-tier weighting for events). It
expresses confidence in the enrichment, never in a market outcome.
6. Refresh cadence
- Stable (descriptions, fundamentals): quarterly or on a material data change.
- Perishable (news, sentiment): on a trigger — new news, price movement, or an elapsed window — never a fresh model call when a valid cached analysis exists.
7. Disclaimer
All AI outputs carry, programmatically:
Not financial advice. Not a fatwa. Methodology disclosed. Signal only.