portfolIQ
Documentation

AI Model Cards & Evaluation

Models, prompt versions, measured evaluation metrics, and known limitations for every AI enrichment — the glass-box behind PortfolIQ's AI-enriched data.

AI Model Cards & Evaluation

PortfolIQ enriches financial data with AI. Unlike most "AI-enriched data" products, which are black boxes, we publish which model and prompt produced each value, how reliable it is, and how that reliability is measured. This is our Angle A (explainable, historised AI) made verifiable.

The machine-readable version of everything below is served at GET /v1/ai/methodology (public, no API key) so you can automate vendor due-diligence against us.

Boundary of responsibility. PortfolIQ supplies factual and contextual data. It does not issue a halal verdict, an investment recommendation, a price target, or a valuation. Any compliance screening (e.g. AAOIFI) is performed by downstream consumers, not by PortfolIQ.

Not financial advice. Not a fatwa. Methodology disclosed. Signal only.


1. Models in use

RoleModelShareUsed for
DefaultClaude Haiku 4.5~85%news summaries, sentiment, asset descriptions
EscalationClaude Sonnet 4.6~15%factual context / event extraction (high-risk, downstream-feeding)

Provider: Anthropic. Hosting region: EU. Model IDs are pinned; any upgrade follows the Model Upgrade Policy (explicit human decision, 7-day shadow run, non-regression gate — never a silent swap).


2. Prompt & methodology versions

Every AI payload carries model_id, prompt_version, and methodology_version in its ai_metadata block. Prompt versions are tracked per asset_kind in the prompt registry, with rotation (deprecated_at / replaced_by). Two analyses produced under different prompt versions are therefore directly comparable over time — the temporal moat applies to how the data was produced, not just what.


3. Evaluation metrics

Measured on a fixed annotated golden set, on a quarterly cadence and before any prompt-version promotion (non-regression gate). The current published figures:

#MetricWhat it measuresValueTargetStatus
M1Hallucination rate (news / context)atomic claims not supported by the sourcepending first campaign≤ 2% / ≤ 1%pending
M2Consistency (sentiment)label stability on re-run (N=3)pending first campaign≥ 90%pending
M3Coveragevalid non-empty output ratio100% (seed)≥ 98%pass
M4Recency (news)median source age at enrichment1h (seed)≤ 48hpass
M5Appropriate abstentioncorrect refusal when source is insufficient100% (seed)≥ 95%pass
AMFForbidden-key guardabsence of verdict/fair-value/valuation keys0 breaches0pass

Honesty note. The current report has status: "seed": deterministic metrics (M3/M4/M5/AMF) are measured on a 15-item seed set; the LLM-as-judge metrics (M1/M2) and the full 120-item campaign with human spot-check are not yet run. We publish this state explicitly rather than imply measured factuality we do not yet have. The first full campaign replaces these figures and flips the status to measured.

Abstention is a feature. A system that says "insufficient signal" when it lacks corroborated sources is more trustworthy than one that always answers. M5 measures that this guard works; below a confidence threshold, a card shows "insufficient signal" instead of an invented narrative.


4. Per-enrichment model cards

news_summary (Haiku 4.5 — high factual risk)

  • Grounding: the source article only; all figures come from the source, never from model memory. source_url is mandatory.
  • Known limitations: may omit context present elsewhere; summarises a single article, not the full news landscape.
  • What is NOT guaranteed: completeness, neutrality of the underlying outlet, or that the event described is material.
  • Refresh: perishable — re-generated on new news or after an elapsed window.

sentiment_score (Haiku 4.5 — medium factual risk)

  • Grounding: the set of recent headlines/sources for the asset.
  • Known limitations: a bounded 3-class directional signal (Positive / Neutral / Negative). confidence reflects inter-source coherence, not certainty about price direction.
  • What is NOT guaranteed: a price prediction or trading signal — there is none.
  • Refresh: perishable — on new sources or elapsed window.

fundamental_summary / asset descriptions (Haiku 4.5 — medium factual risk)

  • Grounding: DB-injected fundamentals (sector, country, market cap, name).
  • Known limitations: describes only the injected fields; will not invent a founding date, founder, or narrative not in the data.
  • Refresh: stable — quarterly or on a material data change.

event_extraction / factual context (Sonnet 4.6 — high factual risk)

  • Grounding: official and reliable-secondary sources, weighted by source tier.
  • Known limitations: extracts only what the source states; abstains on speculation or single anonymous sources. Feeds downstream consumers, so it is held to the stricter ≤ 1% hallucination target and undergoes human spot-check.
  • Refresh: on new qualifying source.

5. Confidence score semantics

The confidence exposed in every payload is a deterministic Python score in [0, 1], not a Claude self-report. The formula depends on analysis_type (source coverage × source agreement × recency for news; inter-source label coherence × sample size for sentiment; source-tier weighting for events). It expresses confidence in the enrichment, never in a market outcome.


6. Refresh cadence

  • Stable (descriptions, fundamentals): quarterly or on a material data change.
  • Perishable (news, sentiment): on a trigger — new news, price movement, or an elapsed window — never a fresh model call when a valid cached analysis exists.

7. Disclaimer

All AI outputs carry, programmatically:

Not financial advice. Not a fatwa. Methodology disclosed. Signal only.