AI in Insurance: Engineering the Underwriting and Claims Stack for 2026

Introduction
The 2026 Reality: Numbers That Matter
Where AI Actually Moves the Unit Economics
Architecture: The Modern AI Insurance Stack
Submission Triage with Multimodal Extraction
Risk Scoring with Gradient Boosting
Explainability with SHAP for Regulatory Compliance
Fraud Detection with Isolation Forests and Graph Signals
RAG for Policy and Coverage Questions
Drift Monitoring in Production
The Swiss and European Regulatory Layer
Operating Model: What Changes Beyond the Code
What's Next: Agentic AI for Insurance
Conclusion

Introduction

Insurance is the oldest risk-pricing industry on earth. It is also one of the slowest to digitise — for two decades, "digital transformation" in insurance largely meant turning paper into PDFs, faxes into email attachments, and filing cabinets into servers. The work itself stayed the same.

That is no longer true. In 2026, AI in insurance is no longer a deck topic or a corporate hackathon project — it is operational infrastructure. Underwriting cycles that ran in days now resolve in minutes. Claims that used to require human eyes for routine review are paid out automatically. And the regulators — FINMA, BaFin, the EU Commission, Colorado, New York — have moved from observing this shift to actively governing it.

This post is a technical walk-through of how a modern insurance AI stack actually fits together. We will cover the architecture, walk through code for the four highest-value workloads (submission triage, risk scoring, fraud detection, and policy Q&A), and look at the regulatory machinery that any production deployment in Europe or Switzerland has to satisfy. The goal is not abstract advocacy — it is to show what a defensible, auditable, performant insurance AI system looks like at the line-of-code level.

I work in Zurich, which is a useful vantage point for this topic. Switzerland hosts Zurich Insurance, Swiss Re, Swiss Life, Helvetia, Baloise, and the European footprints of AXA and Allianz. The country also runs a principle-based AI regime under FINMA Guidance 08/2024 that sits next to the rules-based EU AI Act. Most insurers I talk to in the DACH region are dealing with exactly this overlap — and the engineering decisions follow from it.

The 2026 Reality: Numbers That Matter

Before getting into architecture, it is worth grounding the discussion in what is actually happening at scale. The figures below are drawn from carrier earnings calls, McKinsey, hyperexponential, FactSet, Insurity, and the LexisNexis 2025 U.S. Auto Insurance Trends Report — all 2025 / 2026 sources.

Quote-to-bind cycles: Hiscox publicly reported a 99.4% reduction in quote cycle time on London Market specialty lines, compressing turnaround from three days to roughly three minutes. Industry benchmarks across commercial P&C carriers show 60–99% reductions broadly.
Loss ratios: Carriers using agentic underwriting platforms report 3–5 percentage-point loss-ratio improvements — material at scale, given that a one-point combined-ratio shift on a $5B book is $50M of underwriting profit.
Claims throughput: Travelers cited that over half of claims now qualify for straight-through processing — paid without human interaction. AIG reported 4× submission throughput with a 20% improvement in bind rate.
Operating leverage: Travelers consolidated claims operations from four centres to two and cut staffing 30%. McKinsey estimates that underwriters spend 30–40% of their time on administrative rekeying — a fraction that is being reclaimed by intake automation.
Adoption pace: 76% of US insurers had integrated generative AI into operations by the end of 2024. FINMA's own survey of 400+ Swiss financial-services firms shows roughly half are already using AI or actively developing applications, with 90%+ of adopters using generative AI specifically.
Customer acceptance: Insurity's 2026 AI in Insurance Report found that 39% of US consumers now actively support insurers using AI — nearly double the 20% figure from 2025. Resistance is collapsing on the routine end of the spectrum (quoting, status checks) but persists on autonomous decisions (claim approvals, policy cancellations).

The chart below captures the cycle-time compression that is driving most of this:

Quote-to-bind compression across lines of business in 2026

The compression is not uniform — it is most dramatic in lines where intake was historically dominated by unstructured submissions (PDFs, broker emails, ACORD forms). Personal auto was already partly automated, so the gain is smaller in absolute percentage terms but still real.

The straight-through processing trajectory is even more telling. The gap between top-quartile carriers and laggards has widened sharply over the past three years:

Straight-through claims processing rates from 2022 to 2026

This is the divergence that matters strategically. Carriers in the top quartile are not just faster — they are accumulating more loss data per unit of expense, which feeds back into better risk selection, which lowers loss ratios, which funds further AI investment. The flywheel is real and it is widening the gap between leaders and the rest of the market.

Where AI Actually Moves the Unit Economics

It helps to be specific about where AI changes the P&L of an insurer, because the answer is narrower than the marketing material suggests.

Underwriting

The underwriting bottleneck has always been intake. Brokers send submissions as PDFs, emails, spreadsheets, and ACORD forms in inconsistent formats. McKinsey's number — 30–40% of underwriter time on rekeying — has held remarkably steady for years. The job that LLMs do well here is reading: a thousand-page litigation file or a twenty-year medical history can be parsed and summarised in the time it takes to refill a coffee cup. Combined with structured risk-scoring models, this turns the underwriter from a data-entry role into an exception-handling role.

The McKinsey figure that matters is that only about 40% of submissions actually get underwritten in many commercial lines — the rest expire or get declined by default because no one had time to read them. AI does not change the pricing models that insurers spent centuries refining. It executes those models on submissions that previously never got priced at all. This is the single most underrated source of value in the entire AI-in-insurance conversation.

Claims

Claims is where AI most directly hits the customer experience. The FNOL (First Notice of Loss) call, the photo upload, the damage assessment, the coverage check, the payment — each of those steps used to involve at least one human handoff. In 2026, for routine personal lines, the entire chain runs end-to-end without a human for the simple cases, with a human-in-the-loop pattern for anything ambiguous. The economic effect is two-sided: lower LAE (loss adjustment expense) per claim and higher NPS, because customers genuinely prefer to upload three photos and get paid in twenty minutes rather than wait three weeks for an adjuster.

Fraud

Fraud detection is the workload where AI has been quietly transformative for the longest. Traditional fraud detection used static rules — claim amount thresholds, repeat claimant flags, suspicious provider lists. Modern stacks combine anomaly detection (isolation forests, autoencoders) with graph-based features that capture network patterns: the same garage appearing across unrelated claims, the same phone number behind multiple identities, sudden bursts of FNOL volume from a specific postcode. The fraud-detection lift is consistently above 30% across published benchmarks, and unlike underwriting it has been a clear positive-ROI workload since well before the GenAI wave.

Customer Experience and Distribution

The least technically interesting category, but operationally significant. AI-assisted customer service handles the long tail of policy questions, certificate-of-insurance requests, and coverage-detail look-ups. The interesting design choice here is the trust gradient — Insurity's 2026 survey shows 46% of consumers are comfortable with AI generating a quote, but only 22% are comfortable with AI filing a claim on their behalf. The implication for product design is clear: keep humans visibly in the loop for the high-trust steps.

The aggregate P&L footprint of these four workloads, deployed at scale, looks roughly like this:

Illustrative percentage-point P&L impact of scaled AI deployment for a mid-size P&C carrier

These are illustrative numbers for a mid-size P&C carrier — actuals vary by line of business and starting point — but the directional pattern is consistent across published benchmarks. The combined-ratio improvement of around ten points is what makes this a board-level conversation rather than an IT one.

Architecture: The Modern AI Insurance Stack

A production insurance AI stack in 2026 has a recognisable shape. The diagram below shows the layers and the data flow:

Reference architecture for a modern AI insurance stack

Three things about this architecture are worth emphasising:

The governance layer is not optional. Under FINMA, the EU AI Act, and Solvency II, every decision-influencing model needs SHAP-style explanation, a versioned registry entry, drift monitoring, and an audit trail. If you build the decision layer first and bolt governance on later, you will rebuild it.
The extraction layer is where LLMs earn their cost. Risk scoring and pricing have used ML for decades — gradient boosting and GLMs do not need GenAI. What was new with LLMs is the ability to turn a messy PDF submission into structured features cheaply and accurately. That is where the marginal economics come from.
The action layer mediates trust. Auto-bind and auto-pay are the high-leverage capabilities, but they are also the ones consumers and regulators are most cautious about. Most production deployments today gate auto-action behind confidence thresholds and route the rest to human queues.

The next sections walk through the core code patterns that sit inside this architecture.

Submission Triage with Multimodal Extraction

The first place AI enters the underwriting workflow is intake. A typical commercial submission arrives as a broker email with one or more PDF attachments — a SOV (statement of values), a loss run, an ACORD form. The job is to turn that mess into structured fields the rating engine can consume.

The pattern below uses Anthropic's Claude API with structured-output prompting. The same pattern works with any frontier model that accepts PDF input.

"""
Submission triage: broker email + PDF -> structured submission record.
"""
import base64
import json
from pathlib import Path
from anthropic import Anthropic

client = Anthropic()

EXTRACTION_SCHEMA = {
    "named_insured": "Legal entity name on the submission",
    "naics_code":    "6-digit NAICS industry classification",
    "effective_date": "Policy effective date in ISO 8601",
    "expiration_date": "Policy expiration date in ISO 8601",
    "tiv":           "Total insured value in USD as a number",
    "locations": [
        {
            "address":    "Full street address",
            "occupancy":  "Building occupancy / use",
            "construction": "ISO construction class",
            "year_built": "Year the building was constructed",
            "sprinklered": "Boolean for sprinkler protection",
            "tiv_at_location": "TIV in USD at this location",
        }
    ],
    "loss_history": [
        {"date": "ISO 8601", "amount_paid": "USD", "cause_of_loss": "string"}
    ],
}


def extract_submission(pdf_path: Path) -> dict:
    """Parse a broker submission PDF into a structured record."""
    pdf_b64 = base64.standard_b64encode(pdf_path.read_bytes()).decode()

    message = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=4096,
        system=(
            "You are an underwriting assistant. Extract submission fields "
            "exactly as the schema specifies. If a field is not present in "
            "the document, return null for that field — do NOT infer. "
            "Return ONLY valid JSON, no commentary."
        ),
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "document",
                        "source": {
                            "type": "base64",
                            "media_type": "application/pdf",
                            "data": pdf_b64,
                        },
                    },
                    {
                        "type": "text",
                        "text": (
                            f"Extract the submission into this schema:\n"
                            f"{json.dumps(EXTRACTION_SCHEMA, indent=2)}"
                        ),
                    },
                ],
            }
        ],
    )

    return json.loads(message.content[0].text)


def triage(submission: dict, appetite: dict) -> dict:
    """Apply rule-based appetite check before invoking the risk model."""
    reasons = []
    if submission["naics_code"] not in appetite["allowed_naics"]:
        reasons.append(f"NAICS {submission['naics_code']} out of appetite")
    if submission["tiv"] > appetite["max_tiv"]:
        reasons.append(f"TIV {submission['tiv']:,} exceeds {appetite['max_tiv']:,}")
    if any(loc["construction"] in appetite["excluded_construction"]
           for loc in submission.get("locations", [])):
        reasons.append("Excluded construction class present")

    return {
        "in_appetite": len(reasons) == 0,
        "decline_reasons": reasons,
        "submission": submission,
    }

A few things to call out:

The extraction step is deterministic, not creative. The system prompt tells the model to return null for missing fields rather than guess. This is non-negotiable in regulated workflows — hallucinated TIV numbers will end up in priced bound policies.
Rule-based appetite check runs first. Most submissions can be declined or accepted on simple rules before any ML scoring happens. Running the expensive model on every submission is a cost-engineering mistake people make in their first production deployment.
The schema lives in code, not in the prompt. The schema is checked into version control, reviewed by underwriting and legal, and changed through the same pull-request flow as anything else. This is what an auditable extraction pipeline looks like.

In production, this layer typically runs in roughly 3–8 seconds per submission and costs a few cents per call. At broker submission volumes — tens of thousands per month per major carrier — that is a rounding error against the underwriter time it replaces.

Risk Scoring with Gradient Boosting

Once you have a structured submission, the risk score is a classical ML problem. Gradient-boosted trees — XGBoost, LightGBM, CatBoost — remain the production-standard choice. They handle mixed numerical and categorical features, deal with missingness gracefully, and produce calibrated probability outputs that pricing engines can consume directly.

A clean, defensible training pipeline looks like this:

"""
Risk scoring model for commercial property submissions.
Target: probability of a claim above $25k in the first policy year.
"""
import numpy as np
import pandas as pd
import lightgbm as lgb
import mlflow
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
    roc_auc_score, brier_score_loss, average_precision_score
)
from sklearn.calibration import CalibratedClassifierCV

# Feature contract — checked in with the model
NUMERIC_FEATURES = [
    "tiv", "year_built", "stories", "sq_ft",
    "distance_to_coast_km", "distance_to_fire_station_km",
    "prior_claim_count_3y", "prior_claim_paid_3y",
]
CATEGORICAL_FEATURES = [
    "naics_2digit", "construction_class", "occupancy_class",
    "sprinklered", "alarm_central_station",
]
TARGET = "claim_above_25k_year1"


def train_risk_model(df: pd.DataFrame, run_name: str) -> dict:
    """Train, calibrate, and log a risk scoring model."""
    X = df[NUMERIC_FEATURES + CATEGORICAL_FEATURES].copy()
    for col in CATEGORICAL_FEATURES:
        X[col] = X[col].astype("category")
    y = df[TARGET].astype(int)

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    oof = np.zeros(len(y))
    models = []

    params = dict(
        objective="binary",
        learning_rate=0.03,
        num_leaves=63,
        min_data_in_leaf=200,         # reduces overfitting on rare segments
        feature_fraction=0.85,
        bagging_fraction=0.85,
        bagging_freq=5,
        lambda_l2=1.0,
        verbose=-1,
    )

    with mlflow.start_run(run_name=run_name):
        mlflow.log_params(params)

        for fold, (tr, va) in enumerate(cv.split(X, y)):
            model = lgb.LGBMClassifier(**params, n_estimators=2000)
            model.fit(
                X.iloc[tr], y.iloc[tr],
                eval_set=[(X.iloc[va], y.iloc[va])],
                categorical_feature=CATEGORICAL_FEATURES,
                callbacks=[lgb.early_stopping(100), lgb.log_evaluation(0)],
            )
            oof[va] = model.predict_proba(X.iloc[va])[:, 1]
            models.append(model)

        # Isotonic calibration so probabilities are usable for pricing
        calibrator = CalibratedClassifierCV(
            estimator=models[-1], method="isotonic", cv="prefit"
        )
        calibrator.fit(X, y)

        metrics = {
            "auc":  roc_auc_score(y, oof),
            "ap":   average_precision_score(y, oof),
            "brier": brier_score_loss(y, oof),
        }
        mlflow.log_metrics(metrics)
        mlflow.lightgbm.log_model(models[0], "model")

    return {"models": models, "calibrator": calibrator, "metrics": metrics}

A couple of points that come up repeatedly when actuarial teams audit this kind of pipeline:

Calibration matters more than raw AUC. A pricing engine consumes the predicted probability as a number, not a ranking. An uncalibrated model with AUC of 0.82 can produce systematically biased prices; a calibrated model with AUC of 0.78 can produce loss-cost-aligned prices. Brier score and reliability diagrams are the right metrics to watch.
Five-fold cross-validation, not a single holdout. This is partly to get out-of-fold predictions for calibration, partly because insurance datasets often have temporal and geographic clustering that a single random split can hide.
Class imbalance is the default. Only a small fraction of policies generate the target event in a given year. min_data_in_leaf and class weights matter; is_unbalance flags or focal loss variants are common.
Feature contract is part of the artefact. The list of features is checked in with the model and treated as part of its public interface — this is what BaFin and FINMA examiners actually want to see when they ask for "model documentation."

Explainability with SHAP for Regulatory Compliance

Under the EU AI Act and FINMA Guidance 08/2024, an insurance carrier deploying a model in underwriting or claims must be able to explain individual decisions. The current production-standard tool for this is SHAP (SHapley Additive exPlanations), which decomposes a single prediction into per-feature contributions that sum to the model's output.

"""
Per-decision explanations for an underwriting model.
Produces both a regulator-grade decomposition and a customer-facing summary.
"""
import shap
import numpy as np


def explain_decision(model, calibrator, x_row, feature_names: list,
                     base_rate: float, top_k: int = 5) -> dict:
    """Return SHAP-based explanation for a single submission."""
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(x_row)
    if isinstance(shap_values, list):       # binary -> take positive class
        shap_values = shap_values[1]

    # Per-feature contributions, sorted by absolute impact
    contributions = sorted(
        zip(feature_names, x_row.values.flatten(), shap_values.flatten()),
        key=lambda t: abs(t[2]),
        reverse=True,
    )

    score_raw = float(model.predict_proba(x_row)[:, 1])
    score_calibrated = float(calibrator.predict_proba(x_row)[:, 1])

    drivers = []
    for name, value, shap_val in contributions[:top_k]:
        drivers.append({
            "feature": name,
            "value": value,
            "shap": float(shap_val),
            "direction": "increases risk" if shap_val > 0 else "decreases risk",
        })

    # Customer-facing template — pre-approved by legal and compliance
    top_pos = next((d for d in drivers if d["shap"] > 0), None)
    top_neg = next((d for d in drivers if d["shap"] < 0), None)

    summary = (
        f"Your risk score of {score_calibrated:.1%} compares to a portfolio "
        f"average of {base_rate:.1%}. "
    )
    if top_pos:
        summary += (
            f"The largest factor increasing your score is "
            f"{top_pos['feature']} ({top_pos['value']}). "
        )
    if top_neg:
        summary += (
            f"The largest factor decreasing your score is "
            f"{top_neg['feature']} ({top_neg['value']})."
        )

    return {
        "score_raw":        score_raw,
        "score_calibrated": score_calibrated,
        "base_rate":        base_rate,
        "drivers":          drivers,
        "customer_summary": summary,
    }

The piece of this that is easy to underestimate is the customer-facing summary template. Under GDPR Article 22 and the equivalent provisions in the revised Swiss FADP, a data subject who is materially affected by an automated decision has a right to a meaningful explanation. A raw SHAP plot is not a meaningful explanation to a small-business owner — but a templated natural-language sentence built from the top SHAP contributors is. This is also exactly what FINMA examiners ask for in supervisory reviews: "Show me the explanation a customer would actually receive."

Fraud Detection with Isolation Forests and Graph Signals

Fraud has been an ML problem in insurance for at least two decades, and the underlying tooling is mature. Isolation forests remain a strong baseline for unsupervised anomaly detection on structured claim features, and the modern twist is to add graph-derived features that capture network effects.

"""
Hybrid fraud scoring: anomaly detection on tabular features + graph signals.
"""
import numpy as np
import pandas as pd
import networkx as nx
from sklearn.ensemble import IsolationForest


def build_claim_graph(claims: pd.DataFrame) -> nx.Graph:
    """Build a bipartite-ish graph linking claimants to shared entities."""
    G = nx.Graph()
    for _, row in claims.iterrows():
        claim_id = f"claim:{row.claim_id}"
        G.add_node(claim_id, node_type="claim")

        for entity_field in ("claimant_phone", "garage_id",
                             "attorney_id", "provider_id", "vehicle_vin"):
            entity = row.get(entity_field)
            if pd.notna(entity):
                node = f"{entity_field}:{entity}"
                G.add_node(node, node_type=entity_field)
                G.add_edge(claim_id, node)
    return G


def graph_features(G: nx.Graph, claim_ids: list) -> pd.DataFrame:
    """Extract per-claim graph features."""
    rows = []
    for cid in claim_ids:
        node = f"claim:{cid}"
        if node not in G:
            rows.append({"claim_id": cid, "shared_entity_count": 0,
                         "max_entity_degree": 0, "component_size": 1})
            continue

        neighbors = list(G.neighbors(node))
        max_degree = max((G.degree(n) for n in neighbors), default=0)
        component = nx.node_connected_component(G, node)

        rows.append({
            "claim_id": cid,
            "shared_entity_count": len(neighbors),
            # high if a phone/garage/attorney is linked to many other claims
            "max_entity_degree": max_degree,
            "component_size": len(component),
        })
    return pd.DataFrame(rows)


def score_fraud(claims: pd.DataFrame) -> pd.DataFrame:
    """Combine tabular anomaly score with graph-derived signals."""
    tabular_features = [
        "claim_amount", "days_to_report", "prior_claim_count",
        "policy_age_days", "loss_to_premium_ratio",
    ]

    iso = IsolationForest(
        n_estimators=300,
        contamination=0.02,        # expect ~2% truly anomalous
        random_state=42,
    )
    iso.fit(claims[tabular_features])

    # Higher = more anomalous
    claims["anomaly_score"] = -iso.score_samples(claims[tabular_features])

    G = build_claim_graph(claims)
    g_feat = graph_features(G, claims["claim_id"].tolist())
    claims = claims.merge(g_feat, on="claim_id", how="left")

    # Combined score: anomaly score boosted by network exposure
    claims["fraud_score"] = (
        0.6 * claims["anomaly_score"] +
        0.25 * np.log1p(claims["max_entity_degree"]) +
        0.15 * np.log1p(claims["component_size"])
    )
    claims["fraud_score"] = (
        (claims["fraud_score"] - claims["fraud_score"].min()) /
        (claims["fraud_score"].max() - claims["fraud_score"].min())
    )
    return claims

Two design choices in this code matter operationally:

The graph is updated incrementally. In production you do not rebuild it from scratch on every claim — you append nodes and edges as new claims arrive and run network-feature recomputation in a streaming job. The static rebuild here is for clarity.
The output is a score, not a decision. Fraud models do not directly deny claims — they route them to a Special Investigations Unit (SIU) queue. The threshold above which a claim hits the queue is set against SIU capacity, not against any abstract "right" cutoff.

Across published carrier benchmarks the lift from this kind of hybrid approach is consistently above 30% versus rules-only fraud detection — and the false-positive rate, which is what actually drives SIU productivity, drops materially.

RAG for Policy and Coverage Questions

The fastest-growing GenAI use case inside insurers in 2025–2026 has been retrieval-augmented generation over policy documents and internal underwriting guidelines. The reason is structural: policies are dense, layered, and full of exceptions that nobody can hold in working memory. An adjuster trying to answer "is mould covered under this homeowner's policy if it follows from a covered water-damage event?" used to have to read the policy. Now they can ask.

The minimum-viable pattern looks like this:

"""
RAG over policy documents. The hard part is not the LLM call -
it's the chunking, the citation discipline, and the refusal behaviour.
"""
from anthropic import Anthropic
import chromadb
from chromadb.utils import embedding_functions

client = Anthropic()
chroma = chromadb.PersistentClient(path="/var/data/policy-rag")

embedder = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

policies = chroma.get_or_create_collection(
    name="policies", embedding_function=embedder
)


def index_policy(policy_id: str, sections: list[dict]) -> None:
    """Index policy sections with rich metadata for citation-ready retrieval."""
    policies.add(
        ids=[f"{policy_id}:{s['section_id']}" for s in sections],
        documents=[s["text"] for s in sections],
        metadatas=[
            {
                "policy_id": policy_id,
                "section_id": s["section_id"],
                "section_title": s["title"],
                "form_number":  s["form_number"],
                "page": s["page"],
            }
            for s in sections
        ],
    )


def answer_coverage_question(policy_id: str, question: str, k: int = 6) -> dict:
    """Answer a coverage question grounded in retrieved policy sections."""
    results = policies.query(
        query_texts=[question],
        n_results=k,
        where={"policy_id": policy_id},
    )

    context_blocks = []
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        context_blocks.append(
            f"[Section {meta['section_id']} · {meta['section_title']} · "
            f"Form {meta['form_number']} · p.{meta['page']}]\n{doc}"
        )
    context = "\n\n---\n\n".join(context_blocks)

    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system=(
            "You are an insurance coverage analyst. Answer ONLY using the "
            "policy sections provided. Cite the section ID for every factual "
            "claim, in the format [Section X.Y]. If the policy sections do "
            "not contain the answer, reply: 'The provided policy sections do "
            "not address this question. Refer the inquiry to a licensed "
            "claims adjuster.' Do not invent coverage that is not stated."
        ),
        messages=[{
            "role": "user",
            "content": (
                f"Policy sections:\n\n{context}\n\n"
                f"Question: {question}"
            ),
        }],
    )

    return {
        "question": question,
        "answer":   response.content[0].text,
        "sources":  results["metadatas"][0],
    }

The thing to internalise is that the LLM call is the easy part. The hard parts are:

Chunking that respects the document. A homeowner's policy has named perils, exclusions, conditions, endorsements, and definitions — and the meaning of any single sentence often depends on a definition four sections away. Chunking by paragraph alone destroys this. Production systems use document-structure-aware chunking that preserves cross-references.
Citation discipline. The system prompt above mandates section-level citations and explicit refusal when the retrieved context is silent. This is what turns a chatbot into something a compliance officer will sign off on. If you make the model answer anyway, you have just built a liability machine.
Refusal as a feature. The "refer to a licensed adjuster" branch is critical. Insurance coverage is a regulated activity; an AI system that confidently answers "yes, that's covered" when it should not is worse than one that says "I don't know."

Drift Monitoring in Production

A model trained in 2024 on 2022 data is a different model in 2026, even if the weights have not changed. Distributions shift, claimants behave differently, weather patterns evolve, and the world the model was trained on stops existing. FINMA Guidance 08/2024 is explicit about this: continuous monitoring is part of governance, not a nice-to-have.

The two metrics that do most of the work are PSI (Population Stability Index) for feature drift and KS distance for output drift.

"""
Drift monitoring for production scoring models.
Run nightly against the previous day's scoring traffic vs the training set.
"""
import numpy as np
import pandas as pd
from scipy.stats import ks_2samp


def population_stability_index(expected: np.ndarray, actual: np.ndarray,
                               buckets: int = 10) -> float:
    """PSI between two distributions. <0.1 stable, 0.1-0.25 minor, >0.25 alert."""
    breakpoints = np.percentile(expected, np.linspace(0, 100, buckets + 1))
    breakpoints[0], breakpoints[-1] = -np.inf, np.inf

    expected_pct = np.histogram(expected, breakpoints)[0] / len(expected)
    actual_pct   = np.histogram(actual,   breakpoints)[0] / len(actual)

    # Smooth zero buckets to avoid log(0)
    expected_pct = np.where(expected_pct == 0, 1e-6, expected_pct)
    actual_pct   = np.where(actual_pct   == 0, 1e-6, actual_pct)

    return float(np.sum((actual_pct - expected_pct) *
                        np.log(actual_pct / expected_pct)))


def drift_report(training_df: pd.DataFrame, prod_df: pd.DataFrame,
                 feature_cols: list, score_col: str = "score") -> pd.DataFrame:
    """Per-feature PSI plus output KS test."""
    rows = []
    for col in feature_cols:
        if pd.api.types.is_numeric_dtype(training_df[col]):
            psi = population_stability_index(
                training_df[col].dropna().values,
                prod_df[col].dropna().values,
            )
        else:
            train_dist = training_df[col].value_counts(normalize=True)
            prod_dist  = prod_df[col].value_counts(normalize=True)
            common = train_dist.index.union(prod_dist.index)
            train_dist = train_dist.reindex(common, fill_value=1e-6)
            prod_dist  = prod_dist.reindex(common,  fill_value=1e-6)
            psi = float(np.sum((prod_dist - train_dist) *
                               np.log(prod_dist / train_dist)))

        rows.append({
            "feature": col,
            "psi":     psi,
            "status":  "stable"  if psi < 0.10
                       else "minor"   if psi < 0.25
                       else "alert",
        })

    # Output drift on the score itself
    ks_stat, ks_p = ks_2samp(training_df[score_col], prod_df[score_col])
    rows.append({
        "feature": "[output score]",
        "psi":     float(ks_stat),     # KS stat instead of PSI for the score
        "status":  "stable" if ks_p > 0.05 else "alert",
    })

    return pd.DataFrame(rows).sort_values("psi", ascending=False)

In practice you wire this into an Airflow / Prefect job that runs nightly, writes the report to your data warehouse, and pages the model owner if any feature crosses the alert threshold for two consecutive days. The two-day rule matters — single-day spikes are usually data-pipeline issues, not real drift.

The Swiss and European Regulatory Layer

This is the section most engineering posts about insurance AI skip, and it is the one that determines whether your system actually goes to production in this region.

FINMA Guidance 08/2024

Switzerland does not have an AI-specific law in force in 2026, but FINMA — the financial market supervisor that oversees Swiss insurers — issued binding guidance in December 2024 that sets out exactly what it expects. The guidance is principle-based rather than prescriptive, but the principles are consequential:

Centralised AI inventory. Every AI application a regulated insurer runs has to be in a single inventory, classified by materiality and risk likelihood. "We have a model somewhere" is not an acceptable answer.
Named ownership. Every model in a regulated decision must have a named owner accountable for performance, documentation, and remediation. Anonymous ownership is a recurring finding in FINMA reviews.
Independent validation. High-impact models — anything driving underwriting, pricing, claims settlement, or reserving — need validation by a function independent of the model's developers, with the same rigour applied to actuarial models.
Continuous monitoring. Drift monitoring, performance monitoring, and bias monitoring run continuously, not at annual review.
Explainability of outcomes. Customers and regulators must be able to receive a meaningful explanation of decisions that materially affect them.

The recent FINMA survey of 400+ Swiss financial-services firms reveals the gap that is currently driving supervisory attention: about 50% of institutions are using AI, 90%+ of those use generative AI, but only about 50% have an explicit AI strategy, and most have governance structures that focus on data protection or cybersecurity rather than on algorithmic risks like explainability and bias. That gap is what FINMA examiners are probing in 2026 reviews.

EU AI Act: High-Risk Classification for Insurance

The EU AI Act entered into force on 1 August 2024. Most of its substantive obligations apply from 2 August 2026. For insurance, the relevant categorisation is direct and unambiguous: AI systems used to "evaluate the creditworthiness or establish the credit score of natural persons" and AI systems used "to assess and price risk in relation to life and health insurance" are listed in Annex III as high-risk. This means that, by August 2026:

Risk management system: A continuous, documented risk management process across the AI system lifecycle.
Data governance: Training, validation, and testing datasets must be relevant, sufficiently representative, and free of errors as far as possible.
Technical documentation: The "model card" requirements are formalised — capabilities, limitations, intended use, training data summaries.
Record-keeping / automatic logging: Every decision must be logged for traceability.
Transparency: Users (i.e. underwriters, adjusters) must be able to interpret the system's output and use it appropriately.
Human oversight: Effective oversight must be designed into the system, including the ability to intervene or override.
Accuracy, robustness, cybersecurity: Tested and documented.
Conformity assessment: Either a self-assessment or, in some cases, a notified-body assessment before the system is placed on the market.

The provider/deployer distinction matters operationally. If a Swiss insurer takes a vendor model and substantially retrains it on its own portfolio, it likely crosses the line from "deployer" to "provider" — and the heavier provider obligations (Articles 9–19) attach. This is becoming a major architectural decision in 2026 — many carriers are deliberately keeping their use of vendor models within the deployer envelope to avoid the provider compliance burden.

The AI literacy requirement (Article 4) has been in force since February 2025. Underwriters, claims adjusters, compliance officers, and senior managers need to understand how their AI systems work — capabilities, limitations, and risks. There is no formal exam, but regulators expect evidence of training.

Solvency II Model Risk Management

For European insurers, the Solvency II governance framework — specifically Article 41 on the System of Governance and the ORSA (Own Risk and Solvency Assessment) process — interacts with the AI Act in a way that surprises many engineering teams the first time they encounter it. The TL;DR for the engineering side:

AI model risk has to be in the ORSA. Concentration risk on a single vendor model, data dependency risk, and model drift risk are explicit topics.
You must be able to reconstruct any historical decision. Model versioning (MLflow or equivalent) and dataset versioning are not best practice — they are an effective regulatory requirement, because BaFin and FINMA may ask for retrospective analysis months after the fact.
Independent validation is mandatory. The model owner cannot also be the model validator.

GDPR Article 22 — and the equivalent provision in the revised Swiss FADP (in force since September 2023) — gives data subjects the right not to be subject to decisions based solely on automated processing that produce legal effects or similarly significant effects, with limited exceptions. For insurance this means:

A pure auto-decline must be reviewable by a human on request. The architecture has to support an "appeal to a human" path.
The data subject has a right to a meaningful explanation. This is the customer-facing summary in the SHAP code above.
Sensitive categories of data: health, biometric, etc. require explicit legal basis.

The single most common architectural mistake I see in DACH insurers is treating these regulatory requirements as a final-stage checklist rather than as architectural constraints. By the time a model is in UAT, retrofitting versioning, drift monitoring, SHAP explanations, and a human-appeal path is expensive. Designing the governance layer first — the right side of the architecture diagram — is cheaper and faster to get to production.

Operating Model: What Changes Beyond the Code

The Grant Thornton 2026 AI Impact Survey finds a number that is easy to dismiss but should not be: only 7% of insurance leaders believe their workforce is fully ready to adopt AI, and 39% of insurance respondents say frontline employees need the most support to adopt AI-enabled ways of working. AI is the easy part; rewiring how 5,000 underwriters and adjusters do their jobs is the hard part.

The shifts that matter operationally:

Underwriters become exception handlers. Their day stops being "read the submission and decide" and becomes "review the model's recommendation, focus on the cases it flagged as ambiguous, and own the relationship-led complex risks." This is a different job. Comp structures, performance metrics, and quality assurance processes all need to follow.
Claims adjusters supervise an automated pipeline. Straight-through processing handles the routine; the adjuster's role is the long tail of complex, contested, or coverage-ambiguous claims, plus quality review on a sampled fraction of the auto-paid cohort. The adjuster who previously closed twenty simple claims a day now closes five hard ones and audits two hundred. That requires a different skill set — closer to a senior loss-adjustment specialist than to a generalist desk adjuster — and most carriers do not have enough of those people.
Actuarial pricing becomes a layered exercise. Traditional GLM rate filings still anchor the regulated rate plan, but an ML overlay re-prices at the segment level inside the filed envelope. This creates a working relationship between actuaries and ML engineers that simply did not exist before. The carriers that have made this work treat the two functions as joint owners of the pricing system rather than as a handoff.
Model risk management becomes a first-class function. What used to be an annual actuarial validation process becomes a continuous, instrumented, alerted discipline. The MRM team now sits between data science and compliance, runs the model registry, owns the validation playbook, and is the regulator's first point of contact. In most Swiss insurers I have seen, this function did not exist as a distinct unit two years ago.
The "AI Center of Excellence" pattern is dying. Centralised AI teams that built models in a sandbox and tossed them over the wall to business units produced a long tail of unowned models that never made it to production. The pattern that is replacing it is federated: ML engineers embedded in lines of business, with a thin central team owning shared infrastructure (the model registry, the feature store, the governance tooling). Centralisation of capability, not centralisation of delivery.

The pattern across all of these is the same — the bottleneck moves from the algorithm to the organisation. The technology is mostly off-the-shelf in 2026; the differentiator is whether the carrier has the change-management muscle to retrain its workforce, restructure its incentive system, and rewrite its operating procedures fast enough to capture the value the technology makes available. The boards that I see asking the right questions are the ones asking about the operating model, not the model architecture.

What's Next: Agentic AI for Insurance

The next wave — already visible in 2026 pilots — is agentic AI: systems that do not just answer a question or score a row, but execute multi-step workflows by calling tools, querying internal systems, and producing artefacts that other systems consume. In insurance terms, this looks like:

Agentic submission triage that does not just extract fields but actively reaches back to the broker for missing information, pulls third-party data (SAYHA, Verisk perils, MVR), runs the appetite check, drafts the quote letter, and routes the final package to an underwriter for sign-off.
Agentic FNOL that, given a customer's first-notice call, schedules the inspection, dispatches the rental car, queues the photo upload prompt, opens the reserve, and pings the right team if any single step fails.
Agentic SIU investigations that pull police reports from public records, request the medical narrative from the provider, run the network-graph query against the carrier's own claim history, and produce a structured investigation pack that a human investigator can confirm or contest in minutes rather than hours.
Agentic policy servicing that handles certificate-of-insurance issuance, mid-term endorsements, and audit-premium adjustments end-to-end, with the policyholder confirming each step.

The architectural implications are not subtle. Agentic systems write more than they read, which makes the safety properties of the system fundamentally different from those of a chatbot:

Tool sandboxing has to be real. An agent that can execute a policy endorsement is an agent that can mis-execute a policy endorsement. The set of writeable tools available to an agent has to be enumerated, rate-limited, and observability-instrumented in ways that pure read-side RAG does not need.
Write actions need staged authorisation. The pattern that is emerging is "agent proposes, human ratifies for material actions." Reserves above a threshold, endorsements that change premium materially, and any cancellation or rescission action stay human-approved even when the surrounding workflow is automated.
Memory and state become regulatory artefacts. Under the EU AI Act's logging requirement, the agent's full reasoning trace — the tools it called, the data it retrieved, the intermediate outputs — has to be reconstructable for any decision the regulator might revisit. This is heavier than logging a single model prediction.
Hallucination tolerance is much lower than in chat. A chatbot that occasionally says something slightly wrong is annoying. An agent that occasionally pays a $50,000 claim to the wrong account is a different problem entirely. The engineering bar for grounding, refusal, and verification is materially higher.

The honest read on agentic insurance AI in 2026 is that the marketing is significantly ahead of the production reality. The capabilities are real, the pilots are working, but the carriers that are scaling agents into bound-policy or paid-claim workflows are doing so behind heavy guardrails — confidence thresholds, write gates, and human-in-the-loop checkpoints at every material action. The carriers that try to skip those steps will provide the cautionary tales that shape the next round of supervisory guidance.

Conclusion

The interesting story about AI in insurance in 2026 is not the headline numbers — though the headline numbers are large. It is the structural shift in what an insurance company is. The pricing models are not new. The risk theory is not new. What is new is that the friction between a customer's intent to be insured and a bound, priced, paid policy has collapsed by an order of magnitude — and the carriers that are capturing that compression are the ones treating governance, drift, calibration, and explainability as load-bearing engineering, not as compliance overhead.

For an engineer building these systems, the work is unglamorous in the best way. The interesting code is the schema-locked extraction prompt, the calibrated probability, the SHAP explanation rendered into a sentence a small-business owner can read, the PSI threshold that pages someone at 03:00, the refusal branch that says "refer to a licensed adjuster." None of those will go on a marketing slide. All of them are what stands between a model that ships and a model that gets pulled back six months later.

For a Swiss or European carrier, the regulatory layer is not an obstacle — it is a forcing function for good engineering. Principle-based regimes like FINMA's reward thoughtful design; rules-based regimes like the EU AI Act punish the absence of it. Either way, the same versioned registry, drift monitor, audit log, and explanation surface are required. Building them once, properly, into the architecture is faster than retrofitting them later under regulatory pressure.

The carriers that win the next decade will be the ones whose model risk management function, actuarial function, and software engineering function speak the same language. The ones that don't will keep producing impressive prototypes that never make it through validation. The technology stopped being the bottleneck a year ago. The discipline of operating it is the new bottleneck — and that is a much more interesting problem to work on.