Code with Aashish

Reasoning Patterns in LLM: CoT & ReAct

Aashish Chapain — Mon, 13 Apr 2026 08:34:57 GMT

A direct prompt works well with LLM for a simple task, but when a task needs multiple steps to arrive at a particular solution, a direct prompt is likely to fail because of LLM hallucination and assumptions. To reduce this failure, we need to force the model to reason before it acts. The concepts of Chain-of-Thought and ReAct try to solve this kind of problem differently. But let's be clear on when to use each.

Pattern	Reasoning	External Data	Best For
Direct prompt	❌	❌	Simple Q&A
Chain-of-Thought	✅	❌	Classification, triage, scoring
ReAct	✅	✅	Research, workflows, agents

Part 1: Chain-of-Thought - When Your Prompt Has Everything It Needs

Chain-of-Thought forces the model to reason step-by-step before arriving at the solution. It doesn't call any new tools or external data, but forces the LLM to think logically before giving the final output. Let's see this with a simple example of support ticket triage, which categorizes and prioritizes tickets based on urgency.

import os

import instructor
from dotenv import load_dotenv
from litellm import Router
from pydantic import BaseModel

load_dotenv()

MODEL_LIST = [
    {
        "model_name": "gpt-4o-mini",
        "litellm_params": {
            "model": "openai/gpt-4o-mini",
            "api_key": os.getenv("OPENROUTER_KEY"),
            "base_url": "https://openrouter.ai/api/v1",
            "rpm": 6,
        },
    },
]


class SupportTriage(BaseModel):
    reasoning: str
    priority: str  # critical / high / medium / low
    category: str  # billing / technical / account / general


TRIAGE_PROMPT = """You are a senior support engineer triaging incoming tickets.
Think through each ticket before classifying it:
- What is the user actually experiencing?
- What is the business impact if unresolved?
- Which team has the expertise to resolve this?
- How urgently does this need attention?
Reason through these questions explicitly before producing your output.
"""


def triage_ticket(client: instructor.Instructor, ticket: str) -> SupportTriage:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": TRIAGE_PROMPT},
            {"role": "user", "content": f"Ticket:\n{ticket}"},
        ],
        response_model=SupportTriage,
        temperature=0,
        max_retries=3,
    )


if __name__ == "__main__":
    router = Router(model_list=MODEL_LIST)
    client = instructor.from_litellm(router.completion)

    tickets = [
        "Our entire team is locked out since the deployment 20 minutes ago. Client demo in 2 hours.",
        "I was charged twice for my subscription this month. Please refund the duplicate.",
    ]

    for ticket in tickets:
        result = triage_ticket(client, ticket)
        print(f"Ticket   : {ticket[:60]}...")
        print(f"Reasoning: {result.reasoning}")
        print(f"Priority : {result.priority}")
        print(f"Category : {result.category}")
        print()

Here, the reasoning field is the audit trail. So when a ticket gets mis-triaged, you can see exactly where the model's logic broke down and fix the prompt.

CoT is enough when the model already has all the facts, but the moment it needs to fetch something — like prices, documents, or database records — we need something called ReAct.

Part 2: ReAct - When the Model Needs to Get Data

ReAct stands for Reason + Act. Here, the model has the ability to call an external service or tool to get the final output. So the model enters a loop instead of producing a single-shot output.

The workflow of ReAct looks like this:

Thought → Action → Observation → Thought → Action → ... → Final Answer

In each cycle, the model thinks about what it needs, calls a tool, reads the result, and decides what to do next.

Let's understand this concept with an example that analyses and compares different stock prices and news, then gives a final recommendation:

import json
import os
import re

import instructor
from dotenv import load_dotenv
from litellm import Router

load_dotenv()

MODEL_LIST = [
    {
        "model_name": "gpt-4o-mini",
        "litellm_params": {
            "model": "openai/gpt-4o-mini",
            "api_key": os.getenv("OPENROUTER_KEY"),
            "base_url": "https://openrouter.ai/api/v1",
            "rpm": 6,
        },
    },
]


# Tools
def get_stock_price(ticker: str) -> dict:
    prices = {
        "AAPL": {"price": 189.25, "change_pct": 1.2, "volume": "52M"},
        "GOOGL": {"price": 141.80, "change_pct": -0.8, "volume": "21M"},
        "MSFT": {"price": 378.50, "change_pct": 0.5, "volume": "18M"},
    }
    ticker = ticker.upper()
    if ticker not in prices:
        return {"error": f"Ticker '{ticker}' not found"}
    return {"ticker": ticker, **prices[ticker]}


def get_company_news(ticker: str) -> dict:
    news = {
        "AAPL": [
            "Vision Pro sales exceeded Q1 targets",
            "iPhone 16 supply chain ramp-up confirmed",
        ],
        "MSFT": [
            "Copilot reaches 1M enterprise users",
            "Azure OpenAI expands to 12 regions",
        ],
        "GOOGL": [
            "Gemini Ultra integrated into Workspace",
            "Antitrust ruling puts ads under scrutiny",
        ],
    }
    ticker = ticker.upper()
    if ticker not in news:
        return {"error": f"No news for '{ticker}'"}
    return {"ticker": ticker, "headlines": news[ticker]}


def compare_stocks(tickers: list[str]) -> dict:
    results = {
        t: get_stock_price(t) for t in tickers if "error" not in get_stock_price(t)
    }
    if not results:
        return {"error": "No valid tickers"}
    best = max(results, key=lambda t: results[t]["change_pct"])
    return {"comparison": results, "best_performer_today": best}


TOOLS = {
    "get_stock_price": get_stock_price,
    "get_company_news": get_company_news,
    "compare_stocks": compare_stocks,
}

SYSTEM_PROMPT = """You are a stock research assistant with access to:

- get_stock_price(ticker: str) → price, change_pct, volume
- get_company_news(ticker: str) → recent headlines
- compare_stocks(tickers: list[str]) → side-by-side comparison

Respond ONLY in this format:

Thought: 
Action: 
Action Input: 

When done:

Thought: I have all the information needed.
Final Answer: 

Rules: always think before acting, use exact tool names, Action Input must be valid JSON.
"""


def parse_action(text: str) -> tuple[str | None, dict | None]:
    action_match = re.search(r"Action:\s*(\w+)", text)
    input_match = re.search(r"Action Input:\s*(\{.*?\}|\[.*?\])", text, re.DOTALL)
    if not action_match or not input_match:
        return None, None
    try:
        args = json.loads(input_match.group(1).strip())
    except json.JSONDecodeError:
        return action_match.group(1).strip(), None
    return action_match.group(1).strip(), args


def call_tool(tool_name: str, args: dict) -> str:
    if tool_name not in TOOLS:
        return f"Error: Unknown tool '{tool_name}'. Available: {list(TOOLS.keys())}"
    try:
        result = (
            TOOLS[tool_name](**args)
            if isinstance(args, dict)
            else TOOLS[tool_name](args)
        )
        return json.dumps(result, indent=2)
    except Exception as e:
        return f"Error calling {tool_name}: {str(e)}"


def react_agent(
    client: instructor.Instructor, question: str, max_iterations: int = 6
) -> str:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question},
    ]
    print(f"\nQuestion: {question}\n{'='*60}")

    for i in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            response_model=None,
            temperature=0,
            max_tokens=500,
            stop=["Observation:"],
        )
        text = response.choices[0].message.content.strip()
        print(f"\n[Step {i+1}]\n{text}")

        if "Final Answer:" in text:
            return text.split("Final Answer:")[-1].strip()

        tool_name, args = parse_action(text)
        if not tool_name:
            messages.append({"role": "assistant", "content": text})
            messages.append(
                {
                    "role": "user",
                    "content": "Follow the format: Thought / Action / Action Input.",
                }
            )
            continue

        obs = (
            call_tool(tool_name, args)
            if args is not None
            else f"Error: could not parse args for '{tool_name}'"
        )
        print(f"\nObservation: {obs}")

        messages.append({"role": "assistant", "content": text})
        messages.append({"role": "user", "content": f"Observation: {obs}"})

    return "Max iterations reached."


if __name__ == "__main__":
    router = Router(model_list=MODEL_LIST)
    client = instructor.from_litellm(router.completion)

    react_agent(
        client,
        "Which is performing better today Google or Microsoft? Give me a brief recommendation",
    )

Here, get_stock_price, get_company_news, and compare_stocks are the tools that can be called by the LLM to get the relevant information. Also, observations go in as user messages, not as assistant messages.

stop=["Observation:"] tells the LLM to wait for the tool output rather than making assumptions.

max_iterations=6 tells the LLM to iterate the cycle only up to 6 times to reduce cost by avoiding infinite loops.

Part 3: Building a ReAct Agent with LangChain

As you continue to build the ReAct loop from scratch, you might spend most of your time debugging it rather than formulating the logic. So in production, we use frameworks to facilitate the ReAct agent, such as LangChain or CrewAI. These frameworks give you the freedom to spend time implementing logic rather than building infrastructure like token management, debugging, and error recovery.

Here is the same example using LangChain to build the ReAct agent rather than building it from scratch. The code below uses create_react_agent from LangGraph. It is the current recommended approach.

import os

from dotenv import load_dotenv
from langchain.agents import create_agent
from langchain_core.messages import HumanMessage
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI

load_dotenv()


@tool
def get_stock_price(ticker: str) -> dict:
    """Get the current price and daily change for a stock ticker"""
    prices = {
        "AAPL": {"price": 189.25, "change_pct": 1.2, "volume": "52M"},
        "GOOGL": {"price": 141.80, "change_pct": -0.8, "volume": "21M"},
        "MSFT": {"price": 378.50, "change_pct": 0.5, "volume": "18M"},
    }
    ticker = ticker.upper()
    if ticker not in prices:
        return {
            "error": f"Ticker '{ticker}' not found. Supported: {list(prices.keys())}"
        }
    return {"ticker": ticker, **prices[ticker]}


@tool
def get_company_news(ticker: str) -> dict:
    """Get the latest news headlines for a company by stock ticker."""
    news = {
        "AAPL": [
            "Vision Pro sales exceeded Q1 targets",
            "iPhone 16 supply chain ramp-up confirmed",
        ],
        "MSFT": [
            "Copilot reaches 1M enterprise users",
            "Azure OpenAI expands to 12 regions",
        ],
        "GOOGL": [
            "Gemini Ultra integrated into Workspace",
            "Antitrust ruling puts ads under scrutiny",
        ],
    }
    ticker = ticker.upper()
    if ticker not in news:
        return {"error": f"No news for '{ticker}'"}
    return {"ticker": ticker, "headlines": news[ticker]}


@tool
def compare_stocks(tickers: list[str]) -> dict:
    """Compare daily performance across multiple stock tickers."""
    prices = {
        "AAPL": {"price": 189.25, "change_pct": 1.2},
        "GOOGL": {"price": 141.80, "change_pct": -0.8},
        "MSFT": {"price": 378.50, "change_pct": 0.5},
    }
    results = {t.upper(): prices[t.upper()] for t in tickers if t.upper() in prices}
    if not results:
        return {"error": "No valid tickers provided"}
    best = max(results, key=lambda t: results[t]["change_pct"])
    return {"comparison": results, "best_performer_today": best}


llm = ChatOpenAI(
    model="openai/gpt-4o-mini",
    temperature=0,
    openai_api_key=os.getenv("OPENROUTER_KEY"),
    openai_api_base="https://openrouter.ai/api/v1",
)

tools = [get_stock_price, get_company_news, compare_stocks]
agent = create_agent(llm, tools)


def run_agent_streamed(question: str) -> None:
    print(f"\nQuestion: {question}\n{'='*60}")
    for chunk in agent.stream({"messages": [HumanMessage(content=question)]}):
        if "model" in chunk:
            for msg in chunk["model"]["messages"]:
                if msg.tool_calls:
                    for tc in msg.tool_calls:
                        print(f"  → Calling: {tc['name']}({tc['args']})")
                elif msg.content:
                    print(f"\nFinal Answer:\n{msg.content}")
        elif "tools" in chunk:
            for msg in chunk["tools"]["messages"]:
                print(f"  ← Result  : {msg.content[:120]}...")


if __name__ == "__main__":
    run_agent_streamed("Which is performing better today Google or Microsoft?")
    run_agent_streamed("What's the latest on Apple and is the stock up or down?")

As you can see from the code above, tool schemas are auto-generated from type hints and docstrings. The model gets a precise description of every tool, which reduces wrong-tool calls in production. Using agent.stream() pushes each step of analysis in real time, which is useful for presenting it in the UI. Using these types of frameworks is great in production. However, they also add abstraction layers and occasionally obscure costs, and when something breaks, you're debugging through them. So use them wisely and understand the trade-offs before jumping to code.

Questions or feedback? Open an issue on Github or reach out on LinkedIn

Validation & Testing LLM Outputs

Aashish Chapain — Fri, 03 Apr 2026 09:58:03 GMT

The moment we interact with LLMs, we get probabilistic output. They'll return "price": "$45.99" one time and "price": 45.99 the next. Sometimes, they even forget required fields. This might not look like a big deal in a general chatbot, but in production, these inconsistencies crash pipelines and corrupt data. That's why we need validation to make sure LLM outputs are uniform and clean.

1. Establishing Boundaries with Enums

First of all, the simplest way to make LLM outputs strict is with enums. Enums are used when we want the LLM to pick from specific options:

from enum import Enum

class ExperienceLevel(str, Enum):
    ENTRY = "entry"
    MID = "mid"
    SENIOR = "senior"

Now the LLM returns only these exact values. Nothing like "Senior level" or "Sr."

2. Managing Complexity with Nested Pydantic Models

But real data is more complex than single values. So, when we work with nested structures, we need nested models and validation functions. For example, a salary should always be between a minimum and maximum point:

from pydantic import BaseModel, Field, model_validator
from typing import Optional

class SalaryRange(BaseModel):
    min_amount: Optional[float] = Field(default=None, ge=0)
    max_amount: Optional[float] = Field(default=None, ge=0)
    currency: str = "USD"

    @model_validator(mode="after")
    def validate_range(self) -> "SalaryRange":
        if self.min_amount is not None and self.max_amount is not None:
            if self.min_amount > self.max_amount:
                raise ValueError(
                    f"min_amount ({self.min_amount}) cannot exceed max_amount ({self.max_amount})"
                )
        return self


class JobPosting(BaseModel):
    job_title: str
    company_name: str
    experience_level: Optional[ExperienceLevel] = None
    salary: Optional[SalaryRange] = None

Here we are using the SalaryRange model inside the JobPosting model, and SalaryRange has its own rules. If those rules fail, the validation also fails, and we can queue this for retry.

3. High-Fidelity Classification and Evidence Tracking

We can also follow the same pattern in classification tasks. Here, we combine enums with confidence scores and evidence to get multi-label classification:

class CategoryLabel(str, Enum):
    ENGINEERING = "engineering"
    DATA = "data"
    PRODUCT = "product"
    DESIGN = "design"

class JobCategory(BaseModel):
    label: CategoryLabel
    confidence: float = Field(ge=0.0, le=1.0)
    evidence: str = Field(description="Why this category applies")

4. Advanced Validation: Deduplication and Honesty Checks

Things don't stop here. Real-world data is messy, so we should support partial extraction by making fields optional. On top of that, LLMs seem to be overconfident without having the evidence. So, we should also add validators to catch overconfident LLMs:

class Skill(BaseModel):
    name: str
    is_required: bool = Field(description="True if required, False if nice-to-have")


class JobPostingAnalysis(BaseModel):
    job_title: Optional[str] = None
    company_name: Optional[str] = None
    experience_level: Optional[ExperienceLevel] = None
    salary: Optional[SalaryRange] = None
    
    skills: list[Skill] = Field(default_factory=list)
    categories: list[JobCategory] = Field(default_factory=list)
    
    overall_confidence: float = Field(ge=0.0, le=1.0)
    missing_fields: list[str] = Field(default_factory=list)
    
    @field_validator('skills')
    @classmethod
    def deduplicate_skills(cls, v: list[Skill]) -> list[Skill]:
        seen = set()
        return [s for s in v if not (s.name.lower() in seen or seen.add(s.name.lower()))]
    
    @model_validator(mode='after')
    def check_confidence_honesty(self) -> 'JobPostingAnalysis':
        key_fields = [self.job_title, self.experience_level]
        filled = sum(1 for f in key_fields if f is not None)
        completeness = filled / len(key_fields)
        
        if self.overall_confidence > 0.8 and completeness < 0.5:
            raise ValueError(
                f"Confidence {self.overall_confidence} too high when only "
                f"{filled}/{len(key_fields)} key fields extracted"
            )
        return self
    
    @property
    def required_skills(self) -> list[str]:
        return [s.name for s in self.skills if s.is_required]

Skill model is partial data here, which the LLM should parse if available. Moreover, we also added a deduplicate_skills function so that we have a unique list of skills only. Besides this, the check_confidence_honesty validator is useful if the LLM claims 95% confidence but only extracted 1 of 2 key fields.

5. Unit Testing: Validating Logic Without API Costs

Now that we have our models, we need to test them. Let's first do unit testing on our models directly, which doesn't need an API key:

import pytest

def test_salary_validation_triggers_retry():
    with pytest.raises(ValueError, match="cannot exceed"):
        SalaryRange(min_amount=200000, max_amount=100000)

def test_confidence_honesty_check():
    with pytest.raises(ValueError, match="too high"):
        JobPostingAnalysis(overall_confidence=0.95)

def test_skill_deduplication():
    analysis = JobPostingAnalysis(
        overall_confidence=0.5,
        skills=[
            Skill(name="Python", is_required=True),
            Skill(name="python", is_required=False),
        ]
    )
    assert len(analysis.skills) == 1

6. Integration Testing: Verifying Real-World Extraction

After we are confident with our models, it's time to do integration testing with the LLM to test actual extraction:

@pytest.mark.integration
class TestAnalyzerIntegration:
    @pytest.fixture
    def analyzer(self):
        return JobAnalyzer()
    
    def test_complete_extraction(self, analyzer):
        posting = """
        Senior Software Engineer at TechCorp
        Salary: \(150,000 - \)200,000
        Requirements: Python, Kubernetes, AWS
        """
        result = analyzer.analyze(posting)
        
        assert result["success"]
        assert result["data"].experience_level == ExperienceLevel.SENIOR
        assert result["data"].overall_confidence >= 0.7
    
    def test_partial_extraction(self, analyzer):
        result = analyzer.analyze("Python developer needed.")
        assert result["success"]
        assert len(result["data"].missing_fields) > 0

Alright, now we can be confident in our models and LLM output. You should always run unit tests in CI and integration tests before deployment, as they cost LLM tokens.

Resources

Github repository

Get the working code on github: https://github.com/chapainaashish/production-ai-engineering

Questions or feedback? Open an issue on Github or reach out on LinkedIn

Observability in AI application

Aashish Chapain — Wed, 26 Nov 2025 08:44:06 GMT

Observability is an essential topic in modern software development. It is particularly useful when you move the application to the production. Observability in AI application is more extended than traditional application because you need to observe the model calls, tokens, error and latency beside normal application operations.

In this post, we will deep dive into production ready observability stack that we should use in our AI application. We will mainly focus on LangSmith for detailed tracing and OpenLLMetry as vendor-neutral alternative. Beside this, we will also understand how to use modern librarly like LiteLLM for multi-provider routing and budget controls.

Now, let’s look at the problems and how these tools solve them.

Problem 1: You Can't See What's Happening

When your LLM API fails, you need answers like:

Which model was being called?
How long did the request take?
What was the specific error message?

Without observability, you basically need to guess the problem. So, you need to add observability tools like LangSmith or OpenLLMetry.

Solution 1: Observability with LangSmith

LangSmith is a managed SaaS platform from LangChain that gives you complete observability. It's very simple to get started like shown in the code below:

import asyncio
import os

from dotenv import load_dotenv
from langsmith import traceable
from openai import APIConnectionError, APIError, AsyncOpenAI, RateLimitError, Timeout

load_dotenv()

# Initialize OpenRouter client
client = AsyncOpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.getenv("OPENROUTER_KEY"),
    max_retries=3,  # Automatic exponential backoff
    timeout=30.0,
)


@traceable(run_type="llm")
async def call_llm(
    messages: list[dict], model: str = "gpt-4o-mini", temperature: float = 0.7
):
    """
    Call LLM with messages and automatically log to LangSmith.
    """
    try:
        response = await client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
        )
        return response
    except (APIConnectionError, APIError, RateLimitError, Timeout) as e:
        print(f"LLM API error: {e}")
        return None


async def main():
    messages = [{"role": "user", "content": "Where is Nepal located?"}]

    response = await call_llm(messages)

    if response:
        content = response.choices[0].message.content
        print("Response:", content)
    else:
        print("Failed to get response from LLM.")


if __name__ == "__main__":
    asyncio.run(main())

After you run this, head over to smith.langchain.com and you'll see:

The complete trace with full request and response
Token counts broken down by input and output
Latency measured in milliseconds
Cost calculated automatically for you

That @traceable decorator basically does all this under the hood. The tradeoff of using LangSmith in the long run on big projects is you will be vendor-locked with this platform.

Solution 2: Vendor-Neutral Observability with OpenLLMetry

If you already have observability infrastructure (Datadog, Grafana, Honeycomb, etc.) or need complete control over your data, OpenLLMetry is your answer. It's an open-source SDK built on OpenTelemetry that sends traces to any backend you want, and like LangSmith, you are not left with only one choice.

import asyncio
import os

from dotenv import load_dotenv
from openai import AsyncOpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow

load_dotenv()

Traceloop.init(app_name="production_api", api_key=os.getenv("TRACELOOP_APIKEY"))

client = AsyncOpenAI(
    api_key=os.getenv("OPENROUTER_KEY"),
    base_url="https://openrouter.ai/api/v1",
)


@workflow(name="llm_completion")
async def call_llm(messages: list[dict]):
    response = await client.chat.completions.create(
        model="gpt-4o-mini", messages=messages
    )
    return response


async def main():
    messages = [{"role": "user", "content": "Explain Nepal in 50 words"}]

    try:
        response = await call_llm(messages)
        print("Response:", response.choices[0].message.content)
    except Exception as e:
        print("Error:", e)


asyncio.run(main())

For now, I am sending the data to traceloop but you can customize it to send to your any observability back-end.

Problem 2: You Need Cost Visibility and Control

Alright, most developers stop at the observability part, but in production you need more than that. Sure, observability tells you what is happening inside your system like latency and errors, but it doesn't stop things from going wrong. So, you need some kind of way to act on that data. One tool you can use to complement observability is LiteLLM.

Observability reveals cost spikes and provider failures, and LiteLLM prevents them and routes around them. They give you both visibility and control, which is what a real production-ready LLM stack needs.

Solution 1: LiteLLM for Cost Control

LiteLLM lets you control the budget and limits in production. This is very useful when you have a small budget and multiple operations. Here is the sample code that you can use for budget controlling using LiteLLM:

import asyncio
import os

from dotenv import load_dotenv
from langsmith import traceable
from litellm import Router

load_dotenv()

model_list = [
    {
        "model_name": "gpt-4o-mini",
        "litellm_params": {
            "model": "gpt-4o-mini",
            "api_key": os.getenv("OPENROUTER_KEY"),
            "base_url": "https://openrouter.ai/api/v1",
            "rpm": 6,
        },
    },
]

router = Router(
    model_list=model_list,
    fallbacks=[],
)

customer_budgets = {
    "customer_123": {"limit": 50.00, "spent": 0.00},
    "customer_456": {"limit": 100.00, "spent": 0.00},
}


@traceable(run_type="llm")
async def call_with_budget(
    customer_id: str, messages: list[dict], model: str = "gpt-4o-mini"
):
    budget = customer_budgets.get(customer_id)

    if budget["spent"] >= budget["limit"]:
        raise Exception(
            f"Budget exceeded for {customer_id}: "
            f"${budget['spent']:.2f} / ${budget['limit']:.2f}"
        )

    response = await router.acompletion(
        model=model,
        messages=messages,
    )

    cost = response._hidden_params.get("response_cost", 0)
    customer_budgets[customer_id]["spent"] += cost

    return response


async def main():
    messages = [{"role": "user", "content": "Write a short poem about Kathmandu."}]
    customer_id = "customer_123"

    try:
        response = await call_with_budget(
            customer_id=customer_id, messages=messages, model="gpt-4o-mini"
        )
        print(response)
    except Exception as e:
        print("Error:", e)


asyncio.run(main())

In production, you should store these budgets in some kind of database instead of a dict. Reset them monthly or per billing cycle and set relevant alerts.

Solution 2: High Availability Through Multi-Provider Fallbacks

Since we already implemented LiteLLM, we can leverage it further to reduce our application downtime. So, if your application depends on one provider only, your entire application goes down with them. To solve this, we can use fallback providers without any significant code changes.

See the below example, when OpenAI goes down, LiteLLM will automatically route to Anthropic:

import asyncio
import os

from dotenv import load_dotenv
from langsmith import traceable
from litellm import Router

load_dotenv()

model_list = [
    {
        "model_name": "gpt-4o-mini",
        "litellm_params": {
            "model": "gpt-4o-mini",
            "api_key": os.getenv("OPENROUTER_KEY"),
            "base_url": "https://openrouter.ai/api/v1",
            "rpm": 6,
        },
    },
    {
        "model_name": "claude-haiku-4.5",
        "litellm_params": {
            "model": "claude-haiku-4-5",
            "api_key": os.getenv("ANTHROPIC_API_KEY"),
            "rpm": 6,
        },
    },
]

router = Router(
    model_list=model_list,
    fallbacks=[{"gpt-4o-mini": ["claude-haiku-4.5"]}],
    retry_after=10,
    allowed_fails=3,
)


@traceable(run_type="llm")
async def call_model(messages: list[dict], model: str = "gpt-4o-mini"):
    response = await router.acompletion(
        model=model,
        messages=messages,
    )
    return response


async def main():
    messages = [{"role": "user", "content": "Write a short poem about Kathmandu."}]

    try:
        response = await call_model(messages=messages, model="gpt-4o-mini")
        print(response)
    except Exception as e:
        print("Error:", e)


asyncio.run(main())

Resources

Github repository

Get the working code on github: https://github.com/chapainaashish/production-ai-engineering

Documentation:

Questions or feedback? Open an issue on Github or reach out on LinkedIn

Think Before You Call LLM API

Aashish Chapain — Fri, 07 Nov 2025 06:29:41 GMT

Overview

The first way to interact with any LLM is through the API in development. This is pretty basic which you will master easily but you can miss some critical aspect which I will cover in this post.

Part 1: Understand Tokens Before Getting Excited

Before calling the LLM, tokens are important to understand. So, when you send "Hello, world!" to GPT, the model doesn't process it as two words. It breaks it down into smaller units called tokens. Tokens directly impact everything in LLM development.

OpenAI uses a tokenization algorithm called Byte Pair Encoding (BPE), implemented in their tiktoken library. BPE algorithm doesn't split text into tokens using spaces or punctuation, It's more complex. You can read more about it here

Some approximations:

1 token ≈ 4 characters in English
Non-English text uses significantly more tokens (sometimes 2-3x more)

Tokens are Currency

1. Pricing is Per-Token

Every API call costs money based on token count. So, if you miscalculate tokens by 50%, you're miscalculating your budget by 50%. At scale, this means thousands of dollars in unexpected costs.

2. Context Windows Are Token-Limited

When you see "GPT-4 has an 8K context window," that means 8,000 tokens total for your input AND the response. If you run out of tokens mid-conversation, the API fails. Your application breaks and god knows what will happen next.

3. Rate Limits Are Token-Based

OpenAI limits you by tokens per minute (TPM), more than requests per minute. A single large request with 10K tokens counts the same as 10 small requests with 1K tokens each.

Okay enough theory, Let's write code to see how tokenization works:

import tiktoken


def count_tokens(text: str, model: str = "gpt-4") -> int:
    """Count tokens in text for a specific model"""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


# Test different inputs
examples = ["Hello", "Hello, world!", "Artificial Intelligence", "AI"]

for text in examples:
    tokens = count_tokens(text)
    print(f"'{text}' = {tokens} tokens")

Output:

'Hello' = 1 tokens
'Hello, world!' = 4 tokens
'Artificial Intelligence' = 3 tokens
'AI' = 1 tokens

Look "Hello, world!" is 4 tokens, not 2. The comma and space are separate tokens. This is why you can't estimate tokens by counting words.

Part 2: Calling the LLM API Right Way

Hide Secrets in .env

Before we call the OpenAI API, we need to set up our API key properly. Never hardcode API keys in your code. Sounds pretty basic but most of the people miss it.

Create a .env file in your project root:

OPENAI_API_KEY=sk-your-actual-api-key-here

Now add this file to .gitignore so it never gets committed:

echo ".env" >> .gitignore

Look, it is that simple, ALWAYS DO IT….

Get Advantage of Async Call

OpenAI can be called asynchronously. You can send multiple requests in parallel and get advantage of async call.

import asyncio
import os

from dotenv import load_dotenv
from openai import AsyncOpenAI

# Load env vars
load_dotenv()


async def simple_llm_call():
    """Simple API call"""

    # Using openrouter here as they provide free models (upto certain limits)
    client = AsyncOpenAI(
        base_url="https://openrouter.ai/api/v1", api_key=os.getenv("OPENROUTER_KEY")
    )

    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Where is Nepal located ?"}],
        temperature=0.7,
        timeout=30.0,
    )

    print("Response:", response.choices[0].message.content)
    print("\nToken Usage:")
    print(f"  Prompt: {response.usage.prompt_tokens}")
    print(f"  Completion: {response.usage.completion_tokens}")
    print(f"  Total: {response.usage.total_tokens}")


# Run the async function
asyncio.run(simple_llm_call())

When you run this, you'll get a response from GPT along with token usage statistics.

Response: Nepal is a landlocked country located in South Asia, situated mainly in the Himalayas. It is bordered by China to the north and India to the south, east, and west. Nepal is known for its diverse geography, which includes plains, hills, and the towering peaks of the Himalayas, including Mount Everest, the highest point on Earth. The capital city of Nepal is Kathmandu.

Token Usage:
  Prompt: 12
  Completion: 79
  Total: 91

Understanding the Response

If you view the API response on details, it returns a structured object with the following key fields:

choices[0].message.content: The actual text response from the model
usage.prompt_tokens: How many tokens your input used
usage.completion_tokens: How many tokens the response used
usage.total_tokens: Sum of both (this is what you pay for)
finish_reason:
- "stop": Model completed the response naturally
- "length": Hit the token limit (response was cut off)
- "content_filter": Response was blocked by safety filters

That finish_reason field matters in production. if you see "length", the response was truncated and your application needs to handle that.

Part 3: Calculating The Cost

Before deploying your app into production, understand token-based pricing first because LLM API calls can get expensive as you scale.

Current Pricing

As of November 2025, OpenAI charges:

Model	Input (per 1M tokens)	Output (per 1M tokens)
gpt-5-nano	$0.05	$0.40
gpt-5	$1.25	$10.00

What you can understand from this table:

Output tokens cost more than input tokens (8x more for both models)
GPT-5 is 25x more expensive than GPT-5 nano for both input and output

Real-World Cost Projection

Let's calculate what a real production system would cost. Imagine you're building some kind of customer support chatbot with this specific scenario:

Scenario:

10,000 requests per day
Average 300 input tokens and 200 output tokens per request

daily_requests = 10_000
input_tokens_per_request = 300
output_tokens_per_request = 200

# Calculate monthly costs
monthly_requests = daily_requests * 30
monthly_input_tokens = monthly_requests * input_tokens_per_request
monthly_output_tokens = monthly_requests * output_tokens_per_request

# GPT-5 nano costs
nano_input_cost = (monthly_input_tokens / 1_000_000) * 0.05
nano_output_cost = (monthly_output_tokens / 1_000_000) * 0.40
nano_total = nano_input_cost + nano_output_cost

# GPT-5 costs
gpt5_input_cost = (monthly_input_tokens / 1_000_000) * 1.25
gpt5_output_cost = (monthly_output_tokens / 1_000_000) * 10.00
gpt5_total = gpt5_input_cost + gpt5_output_cost

print(f"GPT-5 Nano Monthly: ${nano_total:,.2f}")
print(f"GPT-5 Monthly: ${gpt5_total:,.2f}")
print(f"Savings with GPT-5 Nano: ${gpt5_total - nano_total:,.2f}/month")

Output:

GPT-5 Nano Monthly: $28.50
GPT-5 Monthly: $712.50
Savings with GPT-5 Nano: $684.00/month

Alright, Now you have seen the cost, the model choice depends on your use-case. For straightforward tasks like summarization and classification, you can use GPT nano version and for complex task, you can use GPT 5 version. Furthermore, you can use Openrouter to compare models according to your use-case.

Part 4: Error Handling & Retries are Important

LLM APIs might fail. So, you shouldn't assume they are immune to errors. You should treat them like any other third party APIs or even give more importance if you are building the system around it.

Some Common API Errors

1. Rate Limit Errors (HTTP 429)

2. Timeout Errors

3. API Errors (HTTP 500-599)

4. Authentication Errors (HTTP 401)

5. Invalid Request Errors (HTTP 400)

Implementing Exponential Backoff and Timeout

The standard approach to handling retries is exponential backoff with jitter. OpenAI standard SDK offers in-built retries and timeout. So, you can leverage that.

import asyncio
import os
from dotenv import load_dotenv
from openai import AsyncOpenAI
from openai import APIError, APIConnectionError, RateLimitError, Timeout

load_dotenv()

client = AsyncOpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),
    max_retries=3,  # Automatic exponential backoff with jitter
    timeout=30.0
)

async def simple_llm_call():
    """OpenAI API Call with Safe Error Handling"""
    try:
        response = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": "Where is Nepal located?"}],
            temperature=0.7,
        )

        print("Response:", response.choices[0].message.content)
        print("\nToken Usage:")
        print(f"  Prompt: {response.usage.prompt_tokens}")
        print(f"  Completion: {response.usage.completion_tokens}")
        print(f"  Total: {response.usage.total_tokens}")
        print(f"  Finish Reason: {response.choices[0].finish_reason}")

    except RateLimitError as e:
        print("Rate limit exceeded after retries:", e)

    except Timeout as e:
        print("Request timed out after retries:", e)

    except APIConnectionError as e:
        print("Connection error after retries:", e)

    except APIError as e:
        print("API returned an error even after retries:", e)

    except Exception as e:
        print("Unexpected Error:", type(e).__name__, str(e))


asyncio.run(simple_llm_call())

You might wonder about how max_retires works under the hood. It use something call exponential backoff with jitter which retry the API every few random seconds. This few random seconds are calculated by multiplying the attempt with two and adding jitter;

wait_time = (2 ** attempt) + random.uniform(0, 1)

The first expression give the result like (1s, 2s, 4s) and second expression is jitter which is random value between 0 and 1. You might be wondering why not just use 2 ** attempt and get rid of jitter?

Without jitter:

Retry after 1s, 2s, 4s, 8s...
If 100 requests hit rate limits simultaneously, they all retry at the same time
This creates a thundering herd which means all requests slam the API at once
The API gets overwhelmed again, causing another round of failures

With jitter:

Retry after 1.0-2.0s, 2.0-3.0s, 4.0-5.0s...
Requests spread out over time and API load distributes evenly

For more details on this pattern, see AWS's exponential backoff and jitter article.

Resources

Github repository

Get the working code on github: https://github.com/chapainaashish/production-ai-engineering

Documentation:

Questions or feedback? Open an issue on Github or reach out on LinkedIn