The measurement problem in AI adoption has a structural flaw: until recently, the only organizations capable of running rigorous studies of how AI affects real labor tasks were the frontier labs whose products stand to benefit from the findings. A preprint posted to arXiv in late May 2026 proposes a fix, building an open-source index that lets any researcher replicate those adoption studies using publicly available user-LLM chat logs and occupational task data from O*NET. The paper also tests AI capability on actual job tasks rather than benchmark suites. Both results have implications for anyone trying to evaluate the closed-API-versus-open-weight question on evidence rather than vendor claims.
Why AI adoption research has been a closed shop
The asymmetry is not subtle. When a frontier lab publishes a study showing that its product affects some percentage of work tasks, that study runs on proprietary data: API call logs, user session records, internal evaluations. Independent researchers have no equivalent access. There is no external audit trail.
The result is a literature that can’t be verified from the outside. Adoption figures cited in policy documents, procurement reviews, and earnings calls derive from data owned by parties with an obvious interest in a particular interpretation of that data. Hugging Face download counts offer a partial alternative, but they measure developer interest, not whether a model is being used for production labor tasks by non-developer workers in finance or healthcare.
arXiv:2606.26118 is a deliberate attempt to break this monopoly. The authors develop what they call the Open Source Economic Index of AI Adoption and Capability, and they open-source both the code and the dataset, positioning it explicitly as independent replication infrastructure.
How the index works: chat logs meet occupational data
The index pairs two data sources that are individually available but haven’t been combined this way before. The first is publicly available user-LLM chat data: real interaction logs between users and language models released for research purposes. The second is O*NET, an occupational database that maps specific work activities to specific job titles with enough resolution to ask whether a given AI interaction was performing a task that appears in a particular occupation’s description.
Combining them allows the researchers to do something the proprietary studies do internally: connect AI usage patterns to economic activity. A chat log showing a user asking a model to analyze a financial statement maps, via O*NET, to the task profiles of financial analysts and accountants. Run enough of those matches across a large corpus of chats, and you get an adoption estimate that isn’t filtered through a vendor’s communications team.
The methodology is designed as an independent replication of the frontier-lab approach using public inputs, not a refutation. The paper’s frame is: here is how you produce an adoption study without proprietary data. The replicability is the contribution.
Which sectors show the highest adoption?
According to the preprint, the occupations with the highest AI adoption rates cluster in three sectors: finance, computer science, and the arts. Finance and computer science fit the expected pattern; both have task structures that map to LLM strengths: code generation, document analysis, data summarization, pattern recognition across large tables. The arts result is less obvious from the outside, though text generation workflows, scripting, and AI-assisted content production would all register as AI-task interactions under the O*NET mapping.
The paper does not publish specific adoption-share percentages in the abstract, and the abstract is what’s currently indexed. The preprint confirms which sectors lead; it does not say by how much, and it does not name which models appear in the underlying chat data. Anyone citing this paper should hold that distinction carefully: “finance leads in adoption” is what the paper reports; “AI handles X% of finance tasks” is not.
What happens when AI tries to do actual job tasks?
The capability evaluation is where the paper produces concrete, testable results. The authors build benchmark scenarios from O*NET task descriptions, deliver them through model-context-protocol (MCP) servers, and run Kimi-k2.5 through an OpenAI agents SDK harness, according to the preprint. The test spans 9 occupations that appear frequently in the adoption index. Kimi-k2.5 is an open-weight model from Moonshot AI; its selection for the capability test does not appear in the adoption findings, and the two results should not be conflated.
The finding, quoted directly from the abstract: “AI correctly executes high-level workflows but often errs in the granular details (such as specific tool calls used).”
That distinction matters for deployment risk assessment. A model that can navigate a multi-step workflow but selects wrong tool calls at individual steps creates a different failure profile than a model that can’t parse the task at all. The high-level planning works; the execution precision doesn’t. For a financial analyst reviewing a report, the consequence of a wrong tool call might be a flagged output. For a code deployment pipeline or a compliance filing, it’s a different category of problem.
The MCP-based evaluation design is also worth examining. By routing tasks through MCP servers, the capability test is measuring agentic behavior: tool selection, multi-step sequencing, context retention across a workflow. That’s closer to how models actually get used in production than single-turn benchmark tasks.
What does this mean for the closed-vs-open procurement decision?
The procurement argument here is narrower than it might appear. This paper does not demonstrate that open-weight models outperform closed APIs on occupational tasks, and it doesn’t try to. The argument is structural: independent adoption measurement is now technically feasible, and any organization making vendor decisions on the basis of self-reported capability metrics from frontier labs should want independent data to exist.
That pressure becomes concrete if follow-on applications of this methodology show open-weight models carrying substantial occupational load in the sectors where organizations are writing checks for closed-API access. DeepSeek-V3, a 671-billion-parameter mixture-of-experts model available under permissive licensing on Hugging Face, matched or rivaled proprietary systems on key benchmarks. An independent adoption index provides the third-party signal that neither lab’s self-reporting can supply.
The current paper doesn’t provide that signal specifically about DeepSeek, Qwen, or any named model: the adoption data doesn’t identify which models generated the chat logs. But the methodological framework is now open-sourced and reproducible. As researchers apply it to more granular data, procurement analysts will have something other than download counts to argue from. The question of whether organizations in finance and computer science are doing real work with open-weight Chinese models, rather than OpenAI or Anthropic APIs, will eventually be answerable with data that isn’t controlled by any of those vendors.
Where the index falls short
Several gaps would need to close before this index carries weight in regulatory or procurement contexts.
The chat log data has an unaddressed selection bias problem. Publicly released user-LLM interaction data is not a random sample of all AI usage. It skews toward interactions that users, researchers, or labs have chosen to release, which likely overrepresents certain use cases, geographies, and model families. The abstract doesn’t discuss representativeness, and that absence matters if adoption figures from the index get treated as population estimates rather than samples from a volunteer dataset.
The capability evaluation is a proof of concept, not a comprehensive map. Nine occupations and one model is enough to demonstrate the methodology works; it isn’t enough to claim the tool-call failure pattern generalizes. That finding might be specific to Kimi-k2.5’s training distribution, to the particular MCP server configuration, or to the way the O*NET task descriptions were phrased for the scenarios. A multi-model comparison across a broader occupation set would be needed to treat it as a general result.
O*NET itself encodes a particular view of labor tasks, calibrated to formal U.S. employment categories. If open-weight model usage is concentrated in markets or informal labor arrangements that don’t map cleanly to those categories, the index undercounts it by construction.
None of these limitations invalidate the contribution. A replicable, open-source methodology for occupational AI adoption measurement is worth having even with imperfect first-run data. The value of open-sourcing the code and dataset is precisely that calibration problems can be addressed iteratively, by anyone, rather than staying locked inside a proprietary research pipeline that only the largest labs can afford to run.
Why the methodology is the news
The sector-level adoption findings and the single-model capability result are preliminary; the methodology is the durable contribution. The inputs for an occupational AI adoption study are now public: the chat log datasets, the O*NET task mappings, the MCP-based capability harness, and the code that connects them. Anyone can run this on updated data. Anyone can apply it to model-identified corpora if those become available.
That shifts the equilibrium in a small but real way. Frontier labs publishing adoption studies now have a methodological counterpart. Independent researchers, regulators, and enterprise buyers can, in principle, produce a competing estimate using the same category of evidence the labs use internally.
The gap between benchmark-score capability and deployed-task performance has been asserted for years. Whether the index’s adoption numbers will diverge from vendor self-reports when applied to larger, more representative datasets is the question that matters. The code and dataset are available. Someone is going to find out.
Frequently Asked Questions
How does this index differ from Hugging Face download counts as an adoption signal?
Download counts record a pull event, not an inference. A single enterprise environment pulls a model once but may run millions of production inferences; a developer pulling ten models may run none in production. The index attempts to connect actual task interactions to O*NET occupational categories, which download metrics cannot do regardless of volume, because download events carry no information about which job tasks were being performed.
What public chat datasets could researchers use to apply this methodology to model-identified data?
LMSYS Chatbot Arena logs multi-model comparisons with model identifiers retained, and WildChat (released by AI2) contains approximately one million real user conversations with model names logged. Neither is a random sample of production AI usage, but both allow the O*NET task-mapping step to run against named models rather than anonymous logs. The paper’s open-sourced code would need minimal adaptation to accept either corpus.
Could the tool-call errors in the capability test reflect the MCP server configuration rather than model ability?
Plausibly yes. Error rates in tool selection rise as a server exposes more candidate tools, because the model must distinguish among more options at each decision step. If the 9-occupation scenarios used dense tool surfaces, the failure pattern could reflect context-management load rather than occupational reasoning ability. The preprint does not report tool-count per scenario, so the two causes cannot be separated from the abstract alone.
Where does Kimi-k2.5 sit relative to other open-weight Chinese models on reasoning benchmarks?
At the time of the preprint’s submission, DeepSeek-R1 had matched OpenAI’s o1 on several reasoning benchmarks while Kimi-k2.5 trailed on comparable evals, placing Kimi in roughly the second tier of competitive open-weight Chinese models rather than at the top. If DeepSeek-R1 or Qwen had been selected for the capability test instead, the tool-call failure pattern might look different, which is one reason the paper’s single-model result should not be read as a ceiling for the open-weight field.