Table of Contents

Function calling (also known as tool calling) provides a powerful mechanism for large language models to interface with external systems, databases, and APIs. When implemented correctly, it extends LLM capabilities beyond their training data, enabling dynamic data retrieval, action execution, and complex workflow automation. However, production implementations frequently encounter reliability issues ranging from hallucinated parameters to schema violations that can crash applications or trigger unintended operations.

What is Function Calling?

Function calling is a multi-step interaction pattern between an LLM and external systems where the model generates structured API calls based on user prompts. According to OpenAI’s documentation, the pattern involves five high-level steps: providing tool definitions to the model, receiving tool call requests, executing the function logic on the application side, returning results to the model, and receiving the final response1.

A function is defined by its JSON schema, which specifies the function name, description, and parameters. When the model determines that a function should be called, it responds with a JSON object containing the arguments for the function rather than executing the function itself2. This architectural separation ensures that the application remains in control of actual execution, providing a critical security boundary.

Anthropic describes two categories of tools: client tools that execute on your systems (requiring your implementation), and server tools that execute on Anthropic’s servers (like web search)3. This distinction matters for security planning—client tools can access internal systems but require careful input validation, while server tools operate within Anthropic’s sandboxed environment.

How Does Function Calling Work?

The technical implementation follows a conversational pattern that maintains state across multiple API calls. When a user sends a prompt that might require external data, the application sends both the user message and available tool definitions to the LLM.

The model then assesses whether any tools can help with the query. If so, it constructs a properly formatted tool use request with a stop_reason of tool_use3. The application extracts the tool name and input, executes the actual function code, and returns results in a new user message containing a tool_result content block.

Consider this practical example from OpenAI’s cookbook: a weather function might take location and format parameters. When a user asks “What’s the weather in Paris?”, the model generates arguments like {"location": "Paris", "format": "celsius"} rather than making up a temperature value4. The application executes the actual API call, receives real data, and returns it to the model for synthesis into a natural language response.

Why Does Function Calling Matter?

The significance of function calling extends beyond simple data retrieval. According to Anthropic’s research with dozens of teams building LLM agents across industries, the most successful implementations use simple, composable patterns rather than complex frameworks2.

Function calling enables three critical capabilities:

Data freshness: LLMs have knowledge cutoff dates. Function calling allows access to real-time information—stock prices, weather, calendar availability, and database records.

Action execution: Beyond reading data, functions can trigger actions like sending emails, creating calendar events, processing refunds, or updating database records.

Workflow orchestration: Complex multi-step processes can be broken down into discrete functions that the LLM orchestrates dynamically based on context.

Common Failure Modes and Reliability Patterns

Production function calling implementations face several predictable failure modes that require defensive engineering.

Schema Violations and Type Errors

Without structured outputs, LLMs can generate malformed JSON responses or invalid tool inputs. Anthropic’s documentation identifies specific issues: parsing errors from invalid JSON syntax, missing required fields, inconsistent data types, and schema violations requiring error handling and retries5.

The solution is constrained decoding through structured outputs. Anthropic’s structured outputs feature guarantees schema-compliant responses by enforcing valid JSON syntax, type-safe fields, and required field presence at the API level5. OpenAI offers similar capabilities through their structured outputs mode, which ensures responses conform exactly to supplied JSON schemas6.

Hallucinated Parameters

LLMs may invent parameter values when user input is ambiguous. The OpenAI cookbook demonstrates this with a system prompt instruction: “Don’t make assumptions about what values to plug into functions. Ask for clarification if a user request is ambiguous”4. When asked “What’s the weather like today?” without location context, the model properly asks for city and temperature unit preference rather than guessing.

Error Handling Patterns

Microsoft’s Azure documentation outlines a three-step error handling approach: call the API with functions and user input, use the model’s response to call your API or function, then call the API again including the function response7. However, this basic pattern needs enhancement for production reliability.

Retry with exponential backoff: Network failures and transient errors require automatic retry mechanisms. The OpenAI cookbook implements this using the tenacity library with @retry(wait=wait_random_exponential(multiplier=1, max=40), stop=stop_after_attempt(3)) decorators4.

Graceful degradation: When functions fail, the application should provide informative error messages back to the model rather than crashing. This allows the LLM to explain the limitation to users or attempt alternative approaches.

Validation layers: Implement server-side validation of all parameters before executing functions. Never trust LLM-generated inputs to be safe or correctly typed.

Schema Design Best Practices

Effective function schemas balance expressiveness with reliability. Based on patterns from OpenAI, Anthropic, and LangChain documentation, here are proven approaches:

Function Naming Conventions

Use descriptive, action-oriented names that clearly indicate what the function does. Prefer get_current_weather over weather or fetch_data. The name appears in the model’s context and influences its selection decisions.

Parameter Design

ApproachBenefitsTrade-offs
Required parametersExplicit data requirementsFails if information missing
Optional with defaultsGraceful degradationMay produce suboptimal results
Enum constraintsPrevents invalid valuesLimited flexibility
Nested objectsComplex data structuresHarder for models to generate correctly

Based on the function calling patterns observed across implementations, required parameters with clear descriptions yield the most reliable results13.

Description Quality

Parameter descriptions should include:

  • What the parameter represents
  • Expected format with examples
  • Constraints or valid ranges
  • How to infer the value from context

Example from Anthropic’s documentation: "description": "The city and state, e.g. San Francisco, CA"3. This pattern—value description followed by concrete example—helps models generate correctly formatted inputs.

Parallel vs. Sequential Function Calls

Modern LLMs support parallel function calling, allowing multiple independent function calls in a single response. This reduces latency for operations that don’t depend on each other—like fetching weather for multiple cities simultaneously7.

However, not all operations can be parallelized. Sequential calling is required when:

  • One function’s output is another’s input
  • Operations have side effects that must complete in order
  • Rate limits or resource constraints require throttling

Comparison: Native APIs vs. Framework Abstractions

Developers face a choice between using LLM APIs directly or adopting frameworks like LangChain that abstract the function calling pattern.

FactorNative APIFramework (LangChain)
ControlFull visibility into prompts and responsesHigher-level abstractions may obscure details
DebuggingDirect access to all request/response dataTracing tools like LangSmith provide visibility
FlexibilityImplement any custom logicConstrained to framework patterns
Learning curveRequires understanding API specificsFaster initial development
PortabilityProvider-specific codeStandardized interface across providers

Anthropic’s guidance is clear: “We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what’s under the hood are a common source of customer error”2.

The Model Context Protocol (MCP) Standard

The Model Context Protocol (MCP) represents an emerging standard for connecting AI applications to external systems. Described as “like a USB-C port for AI applications,” MCP provides a standardized way to connect AI applications to data sources, tools, and workflows8.

MCP reduces development complexity by providing:

  • Standardized tool definitions that work across compatible applications
  • Growing ecosystem of third-party integrations
  • Simplified client implementation patterns

For organizations building multiple AI applications, MCP offers a path to reusable tool definitions that work across Claude, ChatGPT, and other compatible systems.

Production Checklist

Before deploying function calling to production, verify:

  • All function parameters have comprehensive descriptions with examples
  • Structured outputs or strict mode is enabled for schema validation
  • Input validation layer exists between LLM output and function execution
  • Error handling includes retries with exponential backoff
  • Rate limiting is implemented for external API calls
  • Logging captures full request/response chains for debugging
  • Security review completed for all accessible functions
  • Fallback behavior defined for function failures
  • Testing includes edge cases and malformed inputs
  • Monitoring alerts on error rates and latency spikes

Frequently Asked Questions

Q: What’s the difference between function calling and tool use? A: These terms refer to the same capability. OpenAI and most of the industry use “function calling,” while Anthropic uses “tool use.” Both describe the pattern where LLMs generate structured API calls based on provided schemas. The underlying mechanism is identical—only the terminology differs.

Q: How do I prevent LLMs from hallucinating function parameters? A: Use three strategies: (1) Write detailed parameter descriptions with examples, (2) Enable structured outputs to enforce schema compliance, (3) Include system prompts instructing the model to ask for clarification rather than guess when information is ambiguous. Server-side validation provides a final safety net.

Q: Should I use LangChain or call LLM APIs directly? A: Start with direct API calls to understand the underlying patterns. Anthropic recommends this approach for most teams. Frameworks like LangChain add value when you need standardized interfaces across multiple providers or built-in tracing capabilities. Understand the abstractions before adopting them.

Q: How do I handle function call failures gracefully? A: Implement a three-layer approach: schema validation to catch malformed inputs before execution, try-catch blocks around function execution with exponential backoff for retries, and informative error messages returned to the LLM so it can explain issues to users. Never expose internal error details to end users.

Q: Can I use function calling with local LLMs? A: Yes, but support varies by model. Llama 3.1, Mistral, and other modern open-source models support function calling through various formats. Ollama and LM Studio provide interfaces for using function calling with local models. Verify your specific model’s capabilities, as implementation details differ from cloud APIs.


Function calling represents one of the most powerful capabilities in modern LLMs—when implemented correctly. The difference between a prototype and production system often comes down to defensive engineering: comprehensive schemas, robust error handling, and validation at every boundary. As the ecosystem matures with standards like MCP and improved structured outputs, the reliability gap between experimental demos and production systems continues to narrow.

Footnotes

  1. OpenAI. “Function Calling Guide.” https://platform.openai.com/docs/guides/function-calling 2

  2. Anthropic. “Building Effective Agents.” https://docs.anthropic.com/en/docs/build-with-claude/agent-patterns 2 3

  3. Anthropic. “Tool Use Overview.” https://docs.anthropic.com/en/docs/build-with-claude/tool-use 2 3 4

  4. OpenAI Cookbook. “Function Calling with the Chat Completions API.” https://github.com/openai/openai-cookbook 2 3

  5. Anthropic. “Structured Outputs.” https://docs.anthropic.com/en/docs/build-with-claude/structured-outputs 2

  6. OpenAI. “Structured Outputs Guide.” https://platform.openai.com/docs/guides/structured-outputs

  7. Microsoft Azure. “Function Calling with Azure OpenAI Service.” https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/function-calling 2

  8. Model Context Protocol. “Introduction to MCP.” https://modelcontextprotocol.io

Enjoyed this article?

Stay updated with our latest insights on AI and technology.