Building AI Agents for Enterprise: A Practical Guide

AI agents are no longer experimental - they are production systems handling real business workflows. At Obaro Labs, we have deployed AI agents across healthcare, finance, legal, and education. We have seen what works, what fails, and what separates a compelling demo from a system that reliably handles thousands of interactions per day. This guide shares the practical lessons from those deployments.

Architecture: The Controller-Tool Pattern

The most important decision when building an enterprise AI agent is not which LLM to use - it is how you design the agent's interaction with your existing systems. We recommend a controller-tool architecture where a central controller (the LLM) orchestrates a set of well-defined tools, each with explicit permissions, input schemas, and output formats.

The architecture follows this flow:

User input arrives through your application interface
The controller (LLM with system prompt) analyzes the request
The controller selects and calls tools - structured functions with defined schemas
Tool results are returned to the controller
The controller reasons about the results and either calls more tools or generates a final response
The response passes through output validation before reaching the user

This pattern is superior to giving the LLM free-form access to your systems because every tool call is structured, logged, and validated. There are no ambiguous natural language instructions being interpreted differently each time.

// Enterprise Agent: Controller-Tool Architecture
import { z } from "zod";

// Define tools with strict schemas
interface AgentTool<TInput, TOutput> {
  name: string;
  description: string;
  inputSchema: z.ZodSchema<TInput>;
  outputSchema: z.ZodSchema<TOutput>;
  permissions: string[];
  execute: (input: TInput) => Promise<TOutput>;
}

// Example: Customer lookup tool
const customerLookupTool: AgentTool<
  { customerId: string },
  { name: string; plan: string; accountStatus: string }
> = {
  name: "lookup_customer",
  description: "Look up customer information by ID",
  inputSchema: z.object({
    customerId: z.string().regex(/^CUS-d{6}$/),
  }),
  outputSchema: z.object({
    name: z.string(),
    plan: z.string(),
    accountStatus: z.string(),
  }),
  permissions: ["customer:read"],
  execute: async (input) => {
    const customer = await db.customers.findById(input.customerId);
    if (!customer) throw new ToolError("Customer not found");
    return {
      name: customer.name,
      plan: customer.plan,
      accountStatus: customer.status,
    };
  },
};

// Agent controller with permission checking
class EnterpriseAgent {
  private tools: Map<string, AgentTool<any, any>>;
  private userPermissions: string[];

  async processRequest(
    userMessage: string,
    conversationHistory: Message[]
  ): Promise<AgentResponse> {
    // Filter tools to only those the user has permission for
    const availableTools = Array.from(this.tools.values())
      .filter((tool) =>
        tool.permissions.every((p) =>
          this.userPermissions.includes(p)
        )
      );

    const response = await this.llm.chat({
      messages: [
        { role: "system", content: this.systemPrompt },
        ...conversationHistory,
        { role: "user", content: userMessage },
      ],
      tools: availableTools.map((t) => ({
        name: t.name,
        description: t.description,
        parameters: zodToJsonSchema(t.inputSchema),
      })),
    });

    // Validate and execute tool calls
    for (const toolCall of response.toolCalls) {
      const tool = this.tools.get(toolCall.name);
      if (!tool) continue;

      // Validate input against schema
      const validatedInput = tool.inputSchema.parse(
        toolCall.arguments
      );

      // Execute with timeout and error handling
      const result = await withTimeout(
        tool.execute(validatedInput),
        5000 // 5 second timeout per tool call
      );

      // Log for observability
      await this.logger.logToolCall({
        tool: toolCall.name,
        input: validatedInput,
        output: result,
        latencyMs: elapsed,
        userId: this.userId,
        traceId: this.traceId,
      });
    }

    return this.generateFinalResponse(response);
  }
}

Tool-Use Patterns: Lessons from Production

After deploying agents across dozens of enterprise environments, we have identified several tool-use patterns that consistently work well.

Pattern 1: Read-Confirm-Write. For any tool that modifies data, implement a two-step process. The agent first reads the current state (showing the user what will change), confirms the action with the user, and only then executes the write. This prevents accidental modifications and gives users confidence.

Pattern 2: Scoped Tool Sets. Do not give every agent access to every tool. Scope tools based on the user's role, the conversation context, and the task type. A customer support agent does not need database admin tools. A reporting agent does not need write access. Fewer tools also mean fewer irrelevant tool calls and better performance.

Pattern 3: Structured Error Returns. When a tool fails, return a structured error message that helps the agent reason about what went wrong and what to try next. Do not just throw exceptions - return objects with error codes, human-readable messages, and suggested next actions.

LLM Selection for Agent Use Cases

Not all LLMs are equally suited for agent tasks. Based on our production experience, here is how the major models compare for enterprise agent workloads:

Capability	GPT-4o	Claude 3.5 Sonnet	GPT-4o-mini	Claude 3 Haiku
Tool calling reliability	95%+	93%+	88%	85%
Complex reasoning	Excellent	Excellent	Good	Fair
Instruction following	Excellent	Excellent	Good	Good
Latency (median)	1.2s	1.0s	0.4s	0.3s
Cost per 1M tokens (in/out)	$2.50/$10	$3/$15	$0.15/$0.60	$0.25/$1.25
Context window	128K	200K	128K	200K

Our recommendation: Use GPT-4o or Claude 3.5 Sonnet for the controller (the reasoning engine that decides which tools to call). Use GPT-4o-mini or Claude Haiku for tool-adjacent tasks like formatting results, generating summaries, or simple classifications. This hybrid approach typically reduces costs by 50-60% while maintaining high-quality reasoning.

Observability: Seeing Inside the Agent

Enterprise AI agents must be observable. When something goes wrong - and it will - you need to be able to trace exactly what happened: what the user said, what the agent decided, which tools it called, what the tools returned, and how the agent formulated its response.

The observability stack we deploy for every agent:

Trace IDs: Every user interaction gets a unique trace ID that propagates through all tool calls and LLM invocations
Structured logging: Every decision point is logged with the trace ID, timestamp, input, output, and latency
Quality scoring: A sample of interactions are automatically scored by an LLM judge for correctness, helpfulness, and safety
Dashboard metrics: Real-time dashboards showing task completion rate, average tool calls per task, latency percentiles, error rate, and cost per interaction
Alerting: Automated alerts when quality scores drop, error rates spike, or costs exceed thresholds

Real metrics from a production deployment (enterprise customer support agent, 3-month average):

Task completion rate: 87% (remaining 13% escalated to human agents)
Average tool calls per task: 2.3
Median response latency: 3.1 seconds
p95 response latency: 8.4 seconds
Average cost per interaction: $0.08
Customer satisfaction score: 4.2/5.0 (compared to 3.8/5.0 for human agents on similar tasks)
Error rate: 2.1% (tool failures, timeout, malformed responses)

Deployment Patterns

We have deployed enterprise agents using three patterns, each suited to different requirements:

Pattern 1: API-First Deployment. The agent runs as a stateless API service behind a load balancer. Conversation state is stored in Redis or a database. This is the most scalable pattern and our default recommendation.

Pattern 2: WebSocket-Based Streaming. For applications where real-time streaming is important (chat interfaces), we deploy the agent behind a WebSocket gateway that streams partial responses as they are generated. This improves perceived latency significantly.

Pattern 3: Queue-Based Async Processing. For batch workloads (processing documents, generating reports), the agent consumes tasks from a message queue (SQS, RabbitMQ) and writes results to a database. This pattern handles traffic spikes gracefully and allows for retry logic.

Common Pitfalls (Expanded)

After deploying agents across many enterprise environments, these are the pitfalls that catch teams most often:

1. Over-relying on prompt engineering. Prompts are necessary but fragile. A well-crafted prompt will break when it encounters inputs the author did not anticipate. Build structured tool interfaces with validated schemas instead of relying on the LLM to interpret natural language instructions correctly every time.

2. Ignoring latency. Enterprise users expect sub-second responses for simple queries. If your agent takes 8 seconds to answer "what is my account balance?" users will abandon it. Profile your agent pipeline and optimize the critical path. Use caching, parallel tool calls, and model routing.

3. Skipping human-in-the-loop for high-stakes actions. For any action that has significant consequences - financial transactions, medical recommendations, legal filings, account modifications - always include a human confirmation step. The cost of a human review is trivial compared to the cost of an incorrect automated action.

4. Not planning for conversation management. Enterprise conversations are long and complex. Users ask follow-up questions, change topics, and reference earlier parts of the conversation. You need a strategy for managing conversation history, including summarization for long conversations that exceed the context window.

5. Insufficient testing with adversarial inputs. Your users will try things you did not expect. They will attempt prompt injection, ask questions outside the agent's scope, provide malformed input, and test the boundaries of the system. Build adversarial test suites and run them before every deployment.

6. Deploying without a rollback plan. When you update prompts, tools, or model versions, things can break in unexpected ways. Always deploy with the ability to instantly roll back to the previous version. We use feature flags for prompt versions and canary deployments for code changes.

Conclusion

Enterprise AI agents are powerful when built correctly. The key is treating them as software systems - with proper architecture, testing, monitoring, and iteration. The controller-tool pattern provides the right balance of flexibility and control. Invest heavily in observability, test with adversarial inputs, and always have a human escalation path.

At Obaro Labs, we have open-source reference implementations for the patterns described in this post. If you are building enterprise agents and want to discuss architecture, we are happy to share what we have learned across dozens of production deployments.

Building AI Agents for Enterprise: A Practical Guide