Does Information-Theoretic Example Selection Beat kNN for In-Context Learning?

Nearly every few-shot prompt and RAG pipeline picks demonstrations the same way: embed the query, grab the k closest examples by cosine similarity, stuff them into the prompt. KITE (arXiv:2509.15676) argues this is leaving accuracy on the table. By replacing nearest-neighbor retrieval with a kernelized information-theoretic selection criterion, the method trades proximity for coverage and non-redundancy. The answer to the title question is a qualified yes: on classification benchmarks, KITE reports significant improvements over kNN-based selection, though the authors do not publish exact numerical deltas in the abstract.

Why your few-shot prompts pick the wrong examples

The standard pipeline is simple: encode every candidate example into an embedding space, encode the query, sort by cosine distance, take the top k. This is kNN applied to prompt construction, and it works well enough that most practitioners treat it as solved. The assumption baked in is that the examples most similar to the query are the most useful for the model to see.

KITE’s framing rejects that assumption. The method treats example selection as a query-specific optimization problem: given a specific input, choose a subset from the example bank that minimizes prediction error for that particular query, rather than selecting examples that generalize across all queries. Two examples that are both close to the query in embedding space likely carry redundant information, wasting context-window slots that a more diverse selection could use.

The redundancy problem compounds with the curse of dimensionality. As embedding dimensionality grows, distances between points become more uniform, meaning the distinction between the “nearest” and “tenth-nearest” neighbor carries less signal than it does in lower dimensions. The ranking kNN returns is still a ranking, but the differences it ranks on become noise. KATE, the named baseline in the KITE paper, exemplifies this: it selects by embedding similarity and suffers from both poor generalization and insufficient diversity in high-dimensional spaces.

How KITE works: submodular optimization with a kernel trick

KITE builds its selector in three layers.

First, it models the LLM as a linear predictor over input embeddings and derives a surrogate objective that approximates the prediction error for a given query-example set. This surrogate turns out to be approximately submodular, meaning a greedy algorithm that adds one example at a time comes with a provable approximation guarantee. In practice, the team uses a greedy forward selection: start with an empty set, score each remaining candidate, add the best one, repeat until the budget is full. The submodularity property ensures this greedy pass captures at least a constant fraction of the optimal set’s value.

Second, KITE applies the kernel trick to operate in high-dimensional feature spaces without explicitly computing the mapping. This lets the method capture nonlinear relationships between examples and queries that a raw cosine-similarity comparison would miss.

Third, the method adds a diversity regularizer based on optimal design principles, explicitly penalizing redundancy in the selected set. Where kNN returns a cluster of near-duplicates, KITE’s regularizer pushes the selection toward examples that cover different regions of the feature space relevant to the query.

Benchmark results: where KITE wins and where the evidence is thin

The authors report “significant improvements over standard retrieval methods” across a suite of classification tasks, according to the KITE abstract. The v2 revision, submitted June 2, 2026, likely contains updated experiments, but without access to the full PDF tables, specific numerical deltas are not available for this article. This gap limits independent verification: “significant” is the authors’ characterization, not an independently verified result.

The comparison scope also warrants scrutiny. KATE is the primary named baseline, but the abstract does not mention other recent demonstration-selection methods such as vote-k or coverage-based selectors. Without those comparisons, it is unclear whether KITE’s gains come from the information-theoretic criterion specifically or from the diversity regularizer that any competent coverage-based method could incorporate.

Latency vs. accuracy: is the compute tradeoff worth it?

KNN retrieval is cheap: one embedding comparison per candidate, sorted by distance, done. KITE’s greedy submodular optimization requires scoring each candidate against the current selection at every iteration, and the kernel trick adds per-candidate computation that raw cosine similarity avoids. For a small example bank and a low k, this overhead is manageable. For a production RAG pipeline pulling from a vector store with millions of documents, the cost becomes material.

The paper does not report inference-time latency measurements, so the accuracy-compute tradeoff cannot be quantified here. What can be said: KITE is most likely to be practical in settings where the example bank is small (hundreds to low thousands of candidates), k is modest (3-10 demonstrations), and the inference budget can absorb a greedy optimization pass per query. For large-scale retrieval where latency is already a bottleneck, the method may require pre-computation or caching strategies that the paper does not address.

What teams should do next

For practitioners currently using cosine-similarity top-k for few-shot demonstration selection, KITE identifies a real inefficiency. The practical migration path depends on the setting:

Label-scarce classification tasks are where KITE’s evidence is strongest. If your pipeline uses few-shot prompting for text classification, sentiment analysis, or topic labeling, testing a coverage-aware selector against your current top-k baseline is a low-risk experiment. The diversity regularizer alone, without the full kernelized criterion, may capture most of the gain.
Open-ended generation and RAG remain untested. Teams should not assume KITE’s classification results transfer. The right experiment here is to benchmark a coverage-based selector on your specific generation task and measure whether output quality changes, using whatever evaluation metric your pipeline already trusts.
Latency-sensitive serving requires profiling the greedy selection overhead against your latency budget before committing. If the bank is large, consider pre-clustering candidates and running the selection within a relevant cluster rather than over the full set.

The broader takeaway is architectural: the retrieval step in few-shot prompting is not a solved problem, and treating cosine-similarity top-k as the default is a convenience, not an optimal solution. KITE’s contribution is making that tradeoff explicit and providing a principled alternative with theoretical grounding. Whether that alternative is worth adopting depends on the task, the compute budget, and whether the accuracy gains the authors report survive independent replication.

Frequently Asked Questions

Does KITE work for open-ended generation tasks like summarization or translation?

The paper evaluates only classification, where accuracy is binary and the loss function is well defined. For generative tasks, demonstration selection may interact with output length, style, and factual grounding in ways the information-theoretic criterion does not capture. The submodular approximation guarantee assumes a differentiable loss; substituting a generative quality metric such as BLEU, ROUGE, or an LLM-as-judge score would require re-deriving the surrogate objective entirely, which the authors have not attempted.

How does KITE differ from vote-k or other coverage-based demonstration selectors?

Vote-k selects a fixed representative subset by estimating which examples would receive the most agreement from a model’s predictions across the pool. KITE optimizes per-query: it picks a different set for each input. The paper does not benchmark against vote-k, so it remains unclear whether KITE’s reported gains over KATE come from the kernelized information-theoretic criterion specifically, or from the per-query optimization strategy that vote-k could also adopt with a different scoring function.

What computational overhead does the greedy submodular pass add compared to a single kNN sort?

kNN selection runs in O(n log n) for sorting n candidates by distance. KITE’s greedy forward pass scores every remaining candidate against the growing selection set at each of k iterations, giving roughly O(k times n) kernel evaluations per query. For k=5 and n=1,000 candidates, that is approximately 5,000 kernel computations versus 1,000 dot products for kNN. The gap widens with both bank size and kernel complexity, since each kernel evaluation is itself more expensive than a single cosine comparison.

How much does the linear-model assumption weaken KITE’s theoretical guarantee?

The provable approximation guarantee applies to the submodular surrogate objective, not to the actual LLM’s prediction error. Transformer attention patterns over in-context examples are known to be highly nonlinear, particularly for tasks requiring multi-step reasoning. If the linear approximation degrades substantially, the greedy selector optimizes the wrong objective, and the constant-factor guarantee no longer bounds real performance. The paper does not report sensitivity experiments testing how much the surrogate diverges from actual model behavior across different architectures or task types.

What would deploying KITE in a production RAG pipeline with a large vector store require?

Two capabilities the paper does not address: online kernel caching across queries to amortize per-query computation, and a budget-aware early-stop criterion for the greedy pass when marginal gain drops below a threshold. Without caching, a bank of 100,000 documents with k=10 demonstrations requires roughly one million kernel evaluations per query, which is impractical at serving latency targets. The submodular guarantee degrades gracefully, so an early-stop variant could trade a small accuracy loss for a large latency reduction, but no such variant has been benchmarked.