groundy
security

Benchmarking RAG Over Cyber Threat Intelligence: Where Retrieval Breaks

CTIConnect, a KDD 2026 benchmark of 1,860 QA pairs across five CTI feeds, shows retrieval quality, not model size, determines copilot accuracy across ten LLMs.

7 min · · · 3 sources ↓

CTIConnect, a benchmark accepted to KDD 2026 and updated June 3, 2026, tests retrieval-augmented LLMs against five heterogeneous cyber-threat-intelligence feeds (CVE, CWE, CAPEC, MITRE ATT&CK, and unstructured vendor reports) and arrives at an inconvenient finding for the SOC-copilot market: across ten state-of-the-art LLMs, model choice was not the differentiator. Retrieval quality was.

What CTIConnect Tests

CTIConnect constructs 1,860 expert-verified question-answer pairs drawn from five source types that any real threat analyst actually uses: CVE descriptions, CWE definitions, CAPEC attack patterns, MITRE ATT&CK structured data, and free-form vendor threat reports. The QA pairs span nine distinct tasks organized into three categories: Entity Linking, Multi-Document Synthesis, and Entity Attribution.

That taxonomy matters. Each category tests a structurally different retrieval challenge. Entity Linking requires recognizing that a description in one source refers to the same identifier in another. Multi-Document Synthesis requires pulling consistent signal from documents that disagree on format, vocabulary, and granularity. Entity Attribution requires tracing a technique or behavior to a specific actor across sources written for different audiences with different controlled vocabularies.

The heterogeneity is the point. CTI data in production doesn’t arrive through a single normalized API. It arrives as a STIX bundle from one vendor, a CVE JSON from NVD, a PDF advisory from a government CERT, and a blog post from a threat researcher who names the same group four different ways.

The Retrieval Bottleneck

Across all ten models tested, performance gaps between tasks did not correlate with model capability rankings on general benchmarks. According to the CTIConnect paper, the performance bottleneck shifts between retrieval infrastructure and evidence utilization depending on the task category, meaning the binding constraint isn’t which LLM you’re calling, it’s whether the right documents reached the context window in the first place.

This holds across temporal splits spanning 2008 to 2025, suggesting the bottleneck is a structural property of the data rather than something that ages out as newer model versions appear.

Swapping one frontier model for another does not fix the underlying retrieval problem. It shifts blame for the same failure to a different vendor.

Where Vector Retrieval Fails

A companion paper on graph-augmented retrieval for industrial knowledge graphs identifies five query classes that are structurally unreachable for vector retrieval alone. No amount of model scaling would reach them without changing the retrieval architecture, because the required information cannot be recovered from embedding proximity.

The authors frame this as the “operator vocabulary thesis”: the barrier to LLM-based graph reasoning is not model intelligence but the computational operators available as tools. An LLM Query Planner equipped with nine typed traversal primitives achieved F1 = 0.632, compared to 0.472 for bespoke handlers, with the difference coming entirely from retrieval structure rather than from the underlying model.

This maps directly onto CTI. The MITRE ATT&CK knowledge graph has typed relationships: a technique belongs to a tactic, an adversary uses a technique, software implements a technique. Treating that graph as a flat document corpus and embedding it loses the edges. A query like “which APT groups have used this technique against financial sector targets since 2023” requires graph traversal primitives, not nearest-neighbor lookup.

What This Means for SOC Copilot Vendors

The current market pitch for AI-powered threat intelligence tools follows a predictable pattern: name a frontier model, cite a capability benchmark, and let the customer infer that better reasoning produces better threat analysis. CTIConnect’s findings cut directly against that logic.

If retrieval quality is the binding constraint, then two products backed by the same frontier model can produce widely different results depending on how they ingest, normalize, and index CTI feeds. A vendor with a clean STIX normalization pipeline and typed index structures has a durable advantage over one that chunked the same data and ran cosine similarity over embeddings. The LLM becomes a tie-breaker at best.

The cost this creates is largely invisible in vendor comparisons. Data normalization across heterogeneous formats is engineering work, not API work. Normalizing CVE JSON, STIX 2.1, ATT&CK Navigator layers, and free-form vendor prose into a consistent, queryable structure takes months and domain expertise. Vendors who haven’t done it tend not to advertise that fact; buyers discover it during integration.

Domain-Specific Retrieval Beats Generic Pipelines

According to CTIConnect, domain-specific retrieval strategies outperform both retrieve-then-rerank and IRCoT on the benchmark’s tasks. Both are strong general-purpose pipelines with demonstrated performance on heterogeneous QA; beating them with domain-specific strategies indicates the CTI retrieval problem has structural properties that generic improvements don’t address.

IRCoT (Interleaved Retrieval with Chain-of-Thought) interleaves retrieval steps with reasoning steps and is designed precisely for multi-hop questions across disparate documents. That it underperforms domain-specific strategies on CTI tasks is a stronger result than beating a naive baseline. It indicates the cross-source semantic gap in CTI data is severe enough that general multi-hop strategies don’t close it.

The practical consequence: building a trustworthy CTI copilot requires task-specific retrieval design, not a single improved pipeline. That’s more engineering surface area than most vendor pitches acknowledge.

The Broader Pattern

The finding fits a pattern visible across several recent domain-specific LLM evaluations. FrontierOR, a May 2026 benchmark evaluating seven LLMs on large-scale optimization problems, found the strongest one-shot model outperforms Gurobi in only 31% of cases, not because the models reason incorrectly about problem structure, but because domain-efficient algorithms require structural knowledge that general reasoning doesn’t reconstruct on demand.

The common thread: general-purpose LLMs perform well on general-purpose language tasks. The further a problem type departs from language into structured domain knowledge (graph traversal, operational research, multi-schema entity resolution), the more the binding constraint shifts from model capability to infrastructure.

For CTI specifically, that infrastructure is the data pipeline. Threat intelligence arriving through five different formats, written for five different audiences, with overlapping but inconsistent controlled vocabularies, does not become queryable by embedding it and calling a frontier model. CTIConnect now has 1,860 expert-verified examples to support that claim. SOC teams evaluating copilot vendors should probably ask to see the data normalization architecture before the demo.

Frequently Asked Questions

Do these retrieval findings apply to real-time threat feeds, or only curated benchmark data?

CTIConnect’s temporal splits span 2008 through 2025, which simulates cumulative knowledge but not freshness. A production SOC ingests intelligence continuously, and the benchmark’s static QA pairs cannot capture whether a retrieval system degrades as new STIX bundles arrive with novel vocabulary or technique identifiers that lack historical neighbors in the embedding space. The cross-source semantic gap would likely sharpen under real-time ingestion because the normalization pipeline must handle new schema variants as they appear, not just the five canonical formats the benchmark covers.

How do products like Microsoft Security Copilot handle the CTI retrieval layer?

Public documentation for Microsoft Security Copilot and CrowdStrike Charlotte AI emphasizes model capability and natural-language query interfaces but does not describe the underlying CTI normalization architecture in detail. CTIConnect’s results suggest two products backed by the same frontier model could diverge sharply in accuracy depending on how each ingests and indexes feeds such as STIX 2.1 bundles, ATT&CK Navigator layers, and unstructured CERT advisories. Without published retrieval-quality benchmarks against heterogeneous CTI sources, vendor accuracy claims are difficult for buyers to verify independently.

What does a domain-specific CTI retrieval strategy look like in practice?

The graph-augmented retrieval study’s operator vocabulary thesis offers a concrete template: instead of embedding documents and running nearest-neighbor lookup, the system exposes typed traversal primitives (such as follow-edge, filter-by-type, and expand-neighborhood) that an LLM query planner selects and sequences. For CTI, this maps to chains like traversing from a CVE to its CWE parent, then to CAPEC attack patterns exploiting that weakness category, then to ATT&CK techniques that realize those patterns. Building these typed indexes requires parsing each feed into a shared graph schema before any query is issued, which is months of domain-engineering work that does not shrink by choosing a different LLM.

What SOC tasks fall outside CTIConnect’s nine task types?

The benchmark covers Entity Linking, Multi-Document Synthesis, and Entity Attribution, but does not test indicator prioritization (deciding which IOCs deserve immediate investigation), temporal trend analysis (identifying whether a technique’s prevalence is rising or declining across reporting periods), or alert triage under ambiguous evidence. These are common in SOC workflows and may introduce retrieval challenges distinct from the cross-source semantic gap, such as confidence calibration across sources of varying reliability and freshness.

If retrieval quality were solved, would the CTI copilot problem be solved?

Probably not. FrontierOR’s optimization benchmark found the strongest one-shot LLM beats Gurobi in only 31% of cases, with the failure mode mirroring CTIConnect: models reason correctly about problem structure but cannot reconstruct domain-efficient algorithms on demand. For CTI, even with perfect retrieval, a SOC copilot would still need structured algorithmic support for tasks like attack-path enumeration and evidence correlation across time, which move beyond language understanding into operational graph search where the operator vocabulary thesis applies a second time.

sources · 3 cited

  1. CTIConnect: A Benchmark for Retrieval-Augmented LLMs over Heterogeneous Cyber Threat Intelligence primary accessed 2026-06-07
  2. Beyond Vector Similarity: A Structural Analysis of Graph-Augmented Retrieval for Industrial Knowledge Graphs primary accessed 2026-06-07
  3. FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization analysis accessed 2026-06-07