AI Model Cards & Evaluation

Models, prompt versions, measured evaluation metrics, and known limitations for every AI enrichment — the glass-box behind PortfolIQ's AI-enriched data.

AI Model Cards & Evaluation

PortfolIQ enriches financial data with AI. Unlike most "AI-enriched data" products, which are black boxes, we publish which model and prompt produced each value, how reliable it is, and how that reliability is measured. This is our Angle A (explainable, historised AI) made verifiable.

The machine-readable version of everything below is served at GET /v1/ai/methodology (public, no API key) so you can automate vendor due-diligence against us.

Boundary of responsibility. PortfolIQ supplies factual and contextual data. It does not issue an investment recommendation, a price target, or a valuation. Interpretation is performed by downstream consumers, not by PortfolIQ.

Not financial advice. Methodology disclosed. Signal only.

1. Models in use

Role	Model	Share	Used for
Default	Claude Haiku 4.5	~85%	news summaries, sentiment, asset descriptions
Escalation	Claude Sonnet 4.6	~15%	factual context / event extraction (high-risk, downstream-feeding)

Provider: Anthropic. Hosting region: EU. Model IDs are pinned; any upgrade follows the Model Upgrade Policy (explicit human decision, 7-day shadow run, non-regression gate — never a silent swap).

2. Prompt & methodology versions

Every AI payload carries model_id, prompt_version, and methodology_version in its ai_metadata block. Prompt versions are tracked per asset_kind in the prompt registry, with rotation (deprecated_at / replaced_by). Two analyses produced under different prompt versions are therefore directly comparable over time — the temporal moat applies to how the data was produced, not just what.

3. Evaluation metrics

Measured on a fixed annotated golden set, on a quarterly cadence and before any prompt-version promotion (non-regression gate). The current published figures:

#	Metric	What it measures	Value	Target	Status
M1	Hallucination rate (news / context)	atomic claims not supported by the source	pending first campaign	≤ 2% / ≤ 1%	pending
M2	Consistency (sentiment)	label stability on re-run (N=3)	pending first campaign	≥ 90%	pending
M3	Coverage	valid non-empty output ratio	100% (seed)	≥ 98%	pass
M4	Recency (news)	median source age at enrichment	1h (seed)	≤ 48h	pass
M5	Appropriate abstention	correct refusal when source is insufficient	100% (seed)	≥ 95%	pass
AMF	Forbidden-key guard	absence of verdict/fair-value/valuation keys	0 breaches	0	pass

Honesty note. The current report has status: "seed": deterministic metrics (M3/M4/M5/AMF) are measured on a 15-item seed set; the LLM-as-judge metrics (M1/M2) and the full 120-item campaign with human spot-check are not yet run. We publish this state explicitly rather than imply measured factuality we do not yet have. The first full campaign replaces these figures and flips the status to measured.

Abstention is a feature. A system that says "insufficient signal" when it lacks corroborated sources is more trustworthy than one that always answers. M5 measures that this guard works; below a confidence threshold, a card shows "insufficient signal" instead of an invented narrative.

4. Per-enrichment model cards

`news_summary` (Haiku 4.5 — high factual risk)

Grounding: the source article only; all figures come from the source, never from model memory. source_url is mandatory.
Known limitations: may omit context present elsewhere; summarises a single article, not the full news landscape.
What is NOT guaranteed: completeness, neutrality of the underlying outlet, or that the event described is material.
Refresh: perishable — re-generated on new news or after an elapsed window.

`sentiment_score` (Haiku 4.5 — medium factual risk)

Grounding: the set of recent headlines/sources for the asset.
Known limitations: a bounded 3-class directional signal (Positive / Neutral / Negative). confidence reflects inter-source coherence, not certainty about price direction.
What is NOT guaranteed: a price prediction or trading signal — there is none.
Refresh: perishable — on new sources or elapsed window.

`fundamental_summary` / asset descriptions (Haiku 4.5 — medium factual risk)

Grounding: DB-injected fundamentals (sector, country, market cap, name).
Known limitations: describes only the injected fields; will not invent a founding date, founder, or narrative not in the data.
Refresh: stable — quarterly or on a material data change.

`event_extraction` / factual context (Sonnet 4.6 — high factual risk)

Grounding: official and reliable-secondary sources, weighted by source tier.
Known limitations: extracts only what the source states; abstains on speculation or single anonymous sources. Feeds downstream consumers, so it is held to the stricter ≤ 1% hallucination target and undergoes human spot-check.
Refresh: on new qualifying source.

5. Confidence score semantics

The confidence exposed in every payload is a deterministic Python score in [0, 1], not a Claude self-report. The formula depends on analysis_type (source coverage × source agreement × recency for news; inter-source label coherence × sample size for sentiment; source-tier weighting for events). It expresses confidence in the enrichment, never in a market outcome.

6. Refresh cadence

Stable (descriptions, fundamentals): quarterly or on a material data change.
Perishable (news, sentiment): on a trigger — new news, price movement, or an elapsed window — never a fresh model call when a valid cached analysis exists.

7. Disclaimer

All AI outputs carry, programmatically:

Not financial advice. Methodology disclosed. Signal only.

AI Model Cards & Evaluation

AI Model Cards & Evaluation

1. Models in use

2. Prompt & methodology versions

3. Evaluation metrics

4. Per-enrichment model cards

news_summary (Haiku 4.5 — high factual risk)

sentiment_score (Haiku 4.5 — medium factual risk)

fundamental_summary / asset descriptions (Haiku 4.5 — medium factual risk)

event_extraction / factual context (Sonnet 4.6 — high factual risk)

5. Confidence score semantics

6. Refresh cadence

7. Disclaimer

`news_summary` (Haiku 4.5 — high factual risk)

`sentiment_score` (Haiku 4.5 — medium factual risk)

`fundamental_summary` / asset descriptions (Haiku 4.5 — medium factual risk)

`event_extraction` / factual context (Sonnet 4.6 — high factual risk)