<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Code with Aashish]]></title><description><![CDATA[I share insights and practical stuff from my journey as a software engineer.]]></description><link>https://blog.chapainaashish.com.np</link><generator>RSS for Node</generator><lastBuildDate>Tue, 14 Apr 2026 23:59:50 GMT</lastBuildDate><atom:link href="https://blog.chapainaashish.com.np/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Reasoning Patterns in LLM: CoT & ReAct]]></title><description><![CDATA[A direct prompt works well with LLM for a simple task, but when a task needs multiple steps to arrive at a particular solution, a direct prompt is likely to fail because of LLM hallucination and assum]]></description><link>https://blog.chapainaashish.com.np/reasoning-patterns-in-llm-cot-react</link><guid isPermaLink="true">https://blog.chapainaashish.com.np/reasoning-patterns-in-llm-cot-react</guid><category><![CDATA[llm]]></category><category><![CDATA[reasonml]]></category><category><![CDATA[reasoning]]></category><category><![CDATA[React]]></category><category><![CDATA[openai]]></category><dc:creator><![CDATA[Aashish Chapain]]></dc:creator><pubDate>Mon, 13 Apr 2026 08:34:57 GMT</pubDate><content:encoded><![CDATA[<p>A direct prompt works well with LLM for a simple task, but when a task needs multiple steps to arrive at a particular solution, a direct prompt is likely to fail because of LLM hallucination and assumptions. To reduce this failure, we need to force the model to <em>reason before it acts</em>. The concepts of Chain-of-Thought and ReAct try to solve this kind of problem differently. But let's be clear on when to use each.</p>
<table>
<thead>
<tr>
<th>Pattern</th>
<th>Reasoning</th>
<th>External Data</th>
<th>Best For</th>
</tr>
</thead>
<tbody><tr>
<td>Direct prompt</td>
<td>❌</td>
<td>❌</td>
<td>Simple Q&amp;A</td>
</tr>
<tr>
<td>Chain-of-Thought</td>
<td>✅</td>
<td>❌</td>
<td>Classification, triage, scoring</td>
</tr>
<tr>
<td>ReAct</td>
<td>✅</td>
<td>✅</td>
<td>Research, workflows, agents</td>
</tr>
</tbody></table>
<h2>Part 1: Chain-of-Thought - When Your Prompt Has Everything It Needs</h2>
<p>Chain-of-Thought forces the model to reason step-by-step before arriving at the solution. It doesn't call any new tools or external data, but forces the LLM to think logically before giving the final output. Let's see this with a simple example of support ticket triage, which categorizes and prioritizes tickets based on urgency.</p>
<pre><code class="language-python">import os

import instructor
from dotenv import load_dotenv
from litellm import Router
from pydantic import BaseModel

load_dotenv()

MODEL_LIST = [
    {
        "model_name": "gpt-4o-mini",
        "litellm_params": {
            "model": "openai/gpt-4o-mini",
            "api_key": os.getenv("OPENROUTER_KEY"),
            "base_url": "https://openrouter.ai/api/v1",
            "rpm": 6,
        },
    },
]


class SupportTriage(BaseModel):
    reasoning: str
    priority: str  # critical / high / medium / low
    category: str  # billing / technical / account / general


TRIAGE_PROMPT = """You are a senior support engineer triaging incoming tickets.
Think through each ticket before classifying it:
- What is the user actually experiencing?
- What is the business impact if unresolved?
- Which team has the expertise to resolve this?
- How urgently does this need attention?
Reason through these questions explicitly before producing your output.
"""


def triage_ticket(client: instructor.Instructor, ticket: str) -&gt; SupportTriage:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": TRIAGE_PROMPT},
            {"role": "user", "content": f"Ticket:\n{ticket}"},
        ],
        response_model=SupportTriage,
        temperature=0,
        max_retries=3,
    )


if __name__ == "__main__":
    router = Router(model_list=MODEL_LIST)
    client = instructor.from_litellm(router.completion)

    tickets = [
        "Our entire team is locked out since the deployment 20 minutes ago. Client demo in 2 hours.",
        "I was charged twice for my subscription this month. Please refund the duplicate.",
    ]

    for ticket in tickets:
        result = triage_ticket(client, ticket)
        print(f"Ticket   : {ticket[:60]}...")
        print(f"Reasoning: {result.reasoning}")
        print(f"Priority : {result.priority}")
        print(f"Category : {result.category}")
        print()
</code></pre>
<p>Here, the <code>reasoning</code> field is the audit trail. So when a ticket gets mis-triaged, you can see exactly where the model's logic broke down and fix the prompt.</p>
<p>CoT is enough when the model already has all the facts, but the moment it needs to <em>fetch</em> something — like prices, documents, or database records — we need something called ReAct.</p>
<h2>Part 2: ReAct - When the Model Needs to Get Data</h2>
<p>ReAct stands for <strong>Reason + Act</strong>. Here, the model has the ability to call an external service or tool to get the final output. So the model enters a loop instead of producing a single-shot output.</p>
<p>The workflow of ReAct looks like this:</p>
<pre><code class="language-plaintext">Thought → Action → Observation → Thought → Action → ... → Final Answer
</code></pre>
<p>In each cycle, the model thinks about what it needs, calls a tool, reads the result, and decides what to do next.</p>
<p>Let's understand this concept with an example that analyses and compares different stock prices and news, then gives a final recommendation:</p>
<pre><code class="language-python">import json
import os
import re

import instructor
from dotenv import load_dotenv
from litellm import Router

load_dotenv()

MODEL_LIST = [
    {
        "model_name": "gpt-4o-mini",
        "litellm_params": {
            "model": "openai/gpt-4o-mini",
            "api_key": os.getenv("OPENROUTER_KEY"),
            "base_url": "https://openrouter.ai/api/v1",
            "rpm": 6,
        },
    },
]


# Tools
def get_stock_price(ticker: str) -&gt; dict:
    prices = {
        "AAPL": {"price": 189.25, "change_pct": 1.2, "volume": "52M"},
        "GOOGL": {"price": 141.80, "change_pct": -0.8, "volume": "21M"},
        "MSFT": {"price": 378.50, "change_pct": 0.5, "volume": "18M"},
    }
    ticker = ticker.upper()
    if ticker not in prices:
        return {"error": f"Ticker '{ticker}' not found"}
    return {"ticker": ticker, **prices[ticker]}


def get_company_news(ticker: str) -&gt; dict:
    news = {
        "AAPL": [
            "Vision Pro sales exceeded Q1 targets",
            "iPhone 16 supply chain ramp-up confirmed",
        ],
        "MSFT": [
            "Copilot reaches 1M enterprise users",
            "Azure OpenAI expands to 12 regions",
        ],
        "GOOGL": [
            "Gemini Ultra integrated into Workspace",
            "Antitrust ruling puts ads under scrutiny",
        ],
    }
    ticker = ticker.upper()
    if ticker not in news:
        return {"error": f"No news for '{ticker}'"}
    return {"ticker": ticker, "headlines": news[ticker]}


def compare_stocks(tickers: list[str]) -&gt; dict:
    results = {
        t: get_stock_price(t) for t in tickers if "error" not in get_stock_price(t)
    }
    if not results:
        return {"error": "No valid tickers"}
    best = max(results, key=lambda t: results[t]["change_pct"])
    return {"comparison": results, "best_performer_today": best}


TOOLS = {
    "get_stock_price": get_stock_price,
    "get_company_news": get_company_news,
    "compare_stocks": compare_stocks,
}

SYSTEM_PROMPT = """You are a stock research assistant with access to:

- get_stock_price(ticker: str) → price, change_pct, volume
- get_company_news(ticker: str) → recent headlines
- compare_stocks(tickers: list[str]) → side-by-side comparison

Respond ONLY in this format:

Thought: &lt;what you know and what you need&gt;
Action: &lt;tool_name&gt;
Action Input: &lt;valid JSON arguments&gt;

When done:

Thought: I have all the information needed.
Final Answer: &lt;your complete response&gt;

Rules: always think before acting, use exact tool names, Action Input must be valid JSON.
"""


def parse_action(text: str) -&gt; tuple[str | None, dict | None]:
    action_match = re.search(r"Action:\s*(\w+)", text)
    input_match = re.search(r"Action Input:\s*(\{.*?\}|\[.*?\])", text, re.DOTALL)
    if not action_match or not input_match:
        return None, None
    try:
        args = json.loads(input_match.group(1).strip())
    except json.JSONDecodeError:
        return action_match.group(1).strip(), None
    return action_match.group(1).strip(), args


def call_tool(tool_name: str, args: dict) -&gt; str:
    if tool_name not in TOOLS:
        return f"Error: Unknown tool '{tool_name}'. Available: {list(TOOLS.keys())}"
    try:
        result = (
            TOOLS[tool_name](**args)
            if isinstance(args, dict)
            else TOOLS[tool_name](args)
        )
        return json.dumps(result, indent=2)
    except Exception as e:
        return f"Error calling {tool_name}: {str(e)}"


def react_agent(
    client: instructor.Instructor, question: str, max_iterations: int = 6
) -&gt; str:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question},
    ]
    print(f"\nQuestion: {question}\n{'='*60}")

    for i in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            response_model=None,
            temperature=0,
            max_tokens=500,
            stop=["Observation:"],
        )
        text = response.choices[0].message.content.strip()
        print(f"\n[Step {i+1}]\n{text}")

        if "Final Answer:" in text:
            return text.split("Final Answer:")[-1].strip()

        tool_name, args = parse_action(text)
        if not tool_name:
            messages.append({"role": "assistant", "content": text})
            messages.append(
                {
                    "role": "user",
                    "content": "Follow the format: Thought / Action / Action Input.",
                }
            )
            continue

        obs = (
            call_tool(tool_name, args)
            if args is not None
            else f"Error: could not parse args for '{tool_name}'"
        )
        print(f"\nObservation: {obs}")

        messages.append({"role": "assistant", "content": text})
        messages.append({"role": "user", "content": f"Observation: {obs}"})

    return "Max iterations reached."


if __name__ == "__main__":
    router = Router(model_list=MODEL_LIST)
    client = instructor.from_litellm(router.completion)

    react_agent(
        client,
        "Which is performing better today Google or Microsoft? Give me a brief recommendation",
    )
</code></pre>
<p>Here, <code>get_stock_price</code>, <code>get_company_news</code>, and <code>compare_stocks</code> are the tools that can be called by the LLM to get the relevant information. Also, observations go in as user messages, not as assistant messages.</p>
<p><code>stop=["Observation:"]</code> tells the LLM to wait for the tool output rather than making assumptions.</p>
<p><code>max_iterations=6</code> tells the LLM to iterate the cycle only up to 6 times to reduce cost by avoiding infinite loops.</p>
<h2>Part 3: Building a ReAct Agent with LangChain</h2>
<p>As you continue to build the ReAct loop from scratch, you might spend most of your time debugging it rather than formulating the logic. So in production, we use frameworks to facilitate the ReAct agent, such as LangChain or CrewAI. These frameworks give you the freedom to spend time implementing logic rather than building infrastructure like token management, debugging, and error recovery.</p>
<p>Here is the same example using LangChain to build the ReAct agent rather than building it from scratch. The code below uses <code>create_react_agent</code> from LangGraph. It is the current recommended approach.</p>
<pre><code class="language-python">import os

from dotenv import load_dotenv
from langchain.agents import create_agent
from langchain_core.messages import HumanMessage
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI

load_dotenv()


@tool
def get_stock_price(ticker: str) -&gt; dict:
    """Get the current price and daily change for a stock ticker"""
    prices = {
        "AAPL": {"price": 189.25, "change_pct": 1.2, "volume": "52M"},
        "GOOGL": {"price": 141.80, "change_pct": -0.8, "volume": "21M"},
        "MSFT": {"price": 378.50, "change_pct": 0.5, "volume": "18M"},
    }
    ticker = ticker.upper()
    if ticker not in prices:
        return {
            "error": f"Ticker '{ticker}' not found. Supported: {list(prices.keys())}"
        }
    return {"ticker": ticker, **prices[ticker]}


@tool
def get_company_news(ticker: str) -&gt; dict:
    """Get the latest news headlines for a company by stock ticker."""
    news = {
        "AAPL": [
            "Vision Pro sales exceeded Q1 targets",
            "iPhone 16 supply chain ramp-up confirmed",
        ],
        "MSFT": [
            "Copilot reaches 1M enterprise users",
            "Azure OpenAI expands to 12 regions",
        ],
        "GOOGL": [
            "Gemini Ultra integrated into Workspace",
            "Antitrust ruling puts ads under scrutiny",
        ],
    }
    ticker = ticker.upper()
    if ticker not in news:
        return {"error": f"No news for '{ticker}'"}
    return {"ticker": ticker, "headlines": news[ticker]}


@tool
def compare_stocks(tickers: list[str]) -&gt; dict:
    """Compare daily performance across multiple stock tickers."""
    prices = {
        "AAPL": {"price": 189.25, "change_pct": 1.2},
        "GOOGL": {"price": 141.80, "change_pct": -0.8},
        "MSFT": {"price": 378.50, "change_pct": 0.5},
    }
    results = {t.upper(): prices[t.upper()] for t in tickers if t.upper() in prices}
    if not results:
        return {"error": "No valid tickers provided"}
    best = max(results, key=lambda t: results[t]["change_pct"])
    return {"comparison": results, "best_performer_today": best}


llm = ChatOpenAI(
    model="openai/gpt-4o-mini",
    temperature=0,
    openai_api_key=os.getenv("OPENROUTER_KEY"),
    openai_api_base="https://openrouter.ai/api/v1",
)

tools = [get_stock_price, get_company_news, compare_stocks]
agent = create_agent(llm, tools)


def run_agent_streamed(question: str) -&gt; None:
    print(f"\nQuestion: {question}\n{'='*60}")
    for chunk in agent.stream({"messages": [HumanMessage(content=question)]}):
        if "model" in chunk:
            for msg in chunk["model"]["messages"]:
                if msg.tool_calls:
                    for tc in msg.tool_calls:
                        print(f"  → Calling: {tc['name']}({tc['args']})")
                elif msg.content:
                    print(f"\nFinal Answer:\n{msg.content}")
        elif "tools" in chunk:
            for msg in chunk["tools"]["messages"]:
                print(f"  ← Result  : {msg.content[:120]}...")


if __name__ == "__main__":
    run_agent_streamed("Which is performing better today Google or Microsoft?")
    run_agent_streamed("What's the latest on Apple and is the stock up or down?")
</code></pre>
<p>As you can see from the code above, tool schemas are auto-generated from type hints and docstrings. The model gets a precise description of every tool, which reduces wrong-tool calls in production. Using <code>agent.stream()</code> pushes each step of analysis in real time, which is useful for presenting it in the UI. Using these types of frameworks is great in production. However, they also add abstraction layers and occasionally obscure costs, and when something breaks, you're debugging through them. So use them wisely and understand the trade-offs before jumping to code.</p>
<hr />
<p><em>Questions or feedback? Open an issue on</em> <a href="https://github.com/chapainaashish/production-ai-engineering"><em>Github</em></a> <em>or reach out on</em> <a href="https://www.linkedin.com/in/chapainaashish/"><em>LinkedIn</em></a></p>
]]></content:encoded></item><item><title><![CDATA[Validation & Testing LLM Outputs]]></title><description><![CDATA[The moment we interact with LLMs, we get probabilistic output. They'll return "price": "$45.99" one time and "price": 45.99 the next. Sometimes, they even forget required fields. This might not look l]]></description><link>https://blog.chapainaashish.com.np/validation-testing-llm-outputs</link><guid isPermaLink="true">https://blog.chapainaashish.com.np/validation-testing-llm-outputs</guid><category><![CDATA[AI]]></category><category><![CDATA[ai app development]]></category><category><![CDATA[llm]]></category><category><![CDATA[LLMTesting]]></category><dc:creator><![CDATA[Aashish Chapain]]></dc:creator><pubDate>Fri, 03 Apr 2026 09:58:03 GMT</pubDate><content:encoded><![CDATA[<p>The moment we interact with LLMs, we get probabilistic output. They'll return <code>"price": "$45.99"</code> one time and <code>"price": 45.99</code> the next. Sometimes, they even forget required fields. This might not look like a big deal in a general chatbot, but in production, these inconsistencies crash pipelines and corrupt data. That's why we need validation to make sure LLM outputs are uniform and clean.</p>
<h2>1. Establishing Boundaries with Enums</h2>
<p>First of all, the simplest way to make LLM outputs strict is with enums. Enums are used when we want the LLM to pick from specific options:</p>
<pre><code class="language-python">from enum import Enum

class ExperienceLevel(str, Enum):
    ENTRY = "entry"
    MID = "mid"
    SENIOR = "senior"
</code></pre>
<p>Now the LLM returns only these exact values. Nothing like "Senior level" or "Sr."</p>
<h2>2. Managing Complexity with Nested Pydantic Models</h2>
<p>But real data is more complex than single values. So, when we work with nested structures, we need nested models and validation functions. For example, a salary should always be between a minimum and maximum point:</p>
<pre><code class="language-python">from pydantic import BaseModel, Field, model_validator
from typing import Optional

class SalaryRange(BaseModel):
    min_amount: Optional[float] = Field(default=None, ge=0)
    max_amount: Optional[float] = Field(default=None, ge=0)
    currency: str = "USD"

    @model_validator(mode="after")
    def validate_range(self) -&gt; "SalaryRange":
        if self.min_amount is not None and self.max_amount is not None:
            if self.min_amount &gt; self.max_amount:
                raise ValueError(
                    f"min_amount ({self.min_amount}) cannot exceed max_amount ({self.max_amount})"
                )
        return self


class JobPosting(BaseModel):
    job_title: str
    company_name: str
    experience_level: Optional[ExperienceLevel] = None
    salary: Optional[SalaryRange] = None
</code></pre>
<p>Here we are using the <code>SalaryRange</code> model inside the <code>JobPosting</code> model, and <code>SalaryRange</code> has its own rules. If those rules fail, the validation also fails, and we can queue this for retry.</p>
<h2>3. High-Fidelity Classification and Evidence Tracking</h2>
<p>We can also follow the same pattern in classification tasks. Here, we combine enums with confidence scores and evidence to get multi-label classification:</p>
<pre><code class="language-python">class CategoryLabel(str, Enum):
    ENGINEERING = "engineering"
    DATA = "data"
    PRODUCT = "product"
    DESIGN = "design"

class JobCategory(BaseModel):
    label: CategoryLabel
    confidence: float = Field(ge=0.0, le=1.0)
    evidence: str = Field(description="Why this category applies")
</code></pre>
<h2>4. Advanced Validation: Deduplication and Honesty Checks</h2>
<p>Things don't stop here. Real-world data is messy, so we should support partial extraction by making fields optional. On top of that, LLMs seem to be overconfident without having the evidence. So, we should also add validators to catch overconfident LLMs:</p>
<pre><code class="language-python">class Skill(BaseModel):
    name: str
    is_required: bool = Field(description="True if required, False if nice-to-have")


class JobPostingAnalysis(BaseModel):
    job_title: Optional[str] = None
    company_name: Optional[str] = None
    experience_level: Optional[ExperienceLevel] = None
    salary: Optional[SalaryRange] = None
    
    skills: list[Skill] = Field(default_factory=list)
    categories: list[JobCategory] = Field(default_factory=list)
    
    overall_confidence: float = Field(ge=0.0, le=1.0)
    missing_fields: list[str] = Field(default_factory=list)
    
    @field_validator('skills')
    @classmethod
    def deduplicate_skills(cls, v: list[Skill]) -&gt; list[Skill]:
        seen = set()
        return [s for s in v if not (s.name.lower() in seen or seen.add(s.name.lower()))]
    
    @model_validator(mode='after')
    def check_confidence_honesty(self) -&gt; 'JobPostingAnalysis':
        key_fields = [self.job_title, self.experience_level]
        filled = sum(1 for f in key_fields if f is not None)
        completeness = filled / len(key_fields)
        
        if self.overall_confidence &gt; 0.8 and completeness &lt; 0.5:
            raise ValueError(
                f"Confidence {self.overall_confidence} too high when only "
                f"{filled}/{len(key_fields)} key fields extracted"
            )
        return self
    
    @property
    def required_skills(self) -&gt; list[str]:
        return [s.name for s in self.skills if s.is_required]
</code></pre>
<p><code>Skill</code> model is partial data here, which the LLM should parse if available. Moreover, we also added a <code>deduplicate_skills</code> function so that we have a unique list of skills only. Besides this, the <code>check_confidence_honesty</code> validator is useful if the LLM claims 95% confidence but only extracted 1 of 2 key fields.</p>
<h2>5. Unit Testing: Validating Logic Without API Costs</h2>
<p>Now that we have our models, we need to test them. Let's first do unit testing on our models directly, which doesn't need an API key:</p>
<pre><code class="language-python">import pytest

def test_salary_validation_triggers_retry():
    with pytest.raises(ValueError, match="cannot exceed"):
        SalaryRange(min_amount=200000, max_amount=100000)

def test_confidence_honesty_check():
    with pytest.raises(ValueError, match="too high"):
        JobPostingAnalysis(overall_confidence=0.95)

def test_skill_deduplication():
    analysis = JobPostingAnalysis(
        overall_confidence=0.5,
        skills=[
            Skill(name="Python", is_required=True),
            Skill(name="python", is_required=False),
        ]
    )
    assert len(analysis.skills) == 1
</code></pre>
<h2>6. Integration Testing: Verifying Real-World Extraction</h2>
<p>After we are confident with our models, it's time to do integration testing with the LLM to test actual extraction:</p>
<pre><code class="language-python">@pytest.mark.integration
class TestAnalyzerIntegration:
    @pytest.fixture
    def analyzer(self):
        return JobAnalyzer()
    
    def test_complete_extraction(self, analyzer):
        posting = """
        Senior Software Engineer at TechCorp
        Salary: \(150,000 - \)200,000
        Requirements: Python, Kubernetes, AWS
        """
        result = analyzer.analyze(posting)
        
        assert result["success"]
        assert result["data"].experience_level == ExperienceLevel.SENIOR
        assert result["data"].overall_confidence &gt;= 0.7
    
    def test_partial_extraction(self, analyzer):
        result = analyzer.analyze("Python developer needed.")
        assert result["success"]
        assert len(result["data"].missing_fields) &gt; 0
</code></pre>
<p>Alright, now we can be confident in our models and LLM output. You should always run unit tests in CI and integration tests before deployment, as they cost LLM tokens.</p>
<h2>Resources</h2>
<p><strong>Github repository</strong></p>
<ul>
<li>Get the working code on github: <a href="https://github.com/chapainaashish/production-ai-engineering">https://github.com/chapainaashish/production-ai-engineering</a></li>
</ul>
<p><em>Questions or feedback? Open an issue on</em> <a href="https://github.com/chapainaashish/production-ai-engineering"><em>Github</em></a> <em>or reach out on</em> <a href="https://www.linkedin.com/in/chapainaashish/"><em>LinkedIn</em></a></p>
]]></content:encoded></item><item><title><![CDATA[Observability in AI application]]></title><description><![CDATA[Observability is an essential topic in modern software development. It is particularly useful when you move the application to the production. Observability in AI application is more extended than traditional application because you need to observe t...]]></description><link>https://blog.chapainaashish.com.np/observability-in-ai-application</link><guid isPermaLink="true">https://blog.chapainaashish.com.np/observability-in-ai-application</guid><category><![CDATA[AI]]></category><category><![CDATA[AI Engineering]]></category><category><![CDATA[#ai-tools]]></category><category><![CDATA[backend developments]]></category><category><![CDATA[observability]]></category><category><![CDATA[llm]]></category><dc:creator><![CDATA[Aashish Chapain]]></dc:creator><pubDate>Wed, 26 Nov 2025 08:44:06 GMT</pubDate><content:encoded><![CDATA[<p>Observability is an essential topic in modern software development. It is particularly useful when you move the application to the production. Observability in AI application is more extended than traditional application because you need to observe the model calls, tokens, error and latency beside normal application operations.</p>
<p>In this post, we will deep dive into production ready observability stack that we should use in our AI application. We will mainly focus on LangSmith for detailed tracing and OpenLLMetry as vendor-neutral alternative. Beside this, we will also understand how to use modern librarly like LiteLLM for multi-provider routing and budget controls.</p>
<p>Now, let’s look at the problems and how these tools solve them.</p>
<h2 id="heading-problem-1-you-cant-see-whats-happening">Problem 1: You Can't See What's Happening</h2>
<p>When your LLM API fails, you need answers like:</p>
<ul>
<li><p>Which model was being called?</p>
</li>
<li><p>How long did the request take?</p>
</li>
<li><p>What was the specific error message?</p>
</li>
</ul>
<p>Without observability, you basically need to guess the problem. So, you need to add observability tools like LangSmith or OpenLLMetry.</p>
<h2 id="heading-solution-1-observability-with-langsmith">Solution 1: Observability with LangSmith</h2>
<p>LangSmith is a managed SaaS platform from LangChain that gives you complete observability. It's very simple to get started like shown in the code below:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> asyncio
<span class="hljs-keyword">import</span> os

<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> langsmith <span class="hljs-keyword">import</span> traceable
<span class="hljs-keyword">from</span> openai <span class="hljs-keyword">import</span> APIConnectionError, APIError, AsyncOpenAI, RateLimitError, Timeout

load_dotenv()

<span class="hljs-comment"># Initialize OpenRouter client</span>
client = AsyncOpenAI(
    base_url=<span class="hljs-string">"https://openrouter.ai/api/v1"</span>,
    api_key=os.getenv(<span class="hljs-string">"OPENROUTER_KEY"</span>),
    max_retries=<span class="hljs-number">3</span>,  <span class="hljs-comment"># Automatic exponential backoff</span>
    timeout=<span class="hljs-number">30.0</span>,
)


<span class="hljs-meta">@traceable(run_type="llm")</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">call_llm</span>(<span class="hljs-params">
    messages: list[dict], model: str = <span class="hljs-string">"gpt-4o-mini"</span>, temperature: float = <span class="hljs-number">0.7</span>
</span>):</span>
    <span class="hljs-string">"""
    Call LLM with messages and automatically log to LangSmith.
    """</span>
    <span class="hljs-keyword">try</span>:
        response = <span class="hljs-keyword">await</span> client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
        )
        <span class="hljs-keyword">return</span> response
    <span class="hljs-keyword">except</span> (APIConnectionError, APIError, RateLimitError, Timeout) <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">f"LLM API error: <span class="hljs-subst">{e}</span>"</span>)
        <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    messages = [{<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">"Where is Nepal located?"</span>}]

    response = <span class="hljs-keyword">await</span> call_llm(messages)

    <span class="hljs-keyword">if</span> response:
        content = response.choices[<span class="hljs-number">0</span>].message.content
        print(<span class="hljs-string">"Response:"</span>, content)
    <span class="hljs-keyword">else</span>:
        print(<span class="hljs-string">"Failed to get response from LLM."</span>)


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    asyncio.run(main())
</code></pre>
<p>After you run this, head over to <a target="_blank" href="https://smith.langchain.com">smith.langchain.com</a> and you'll see:</p>
<ul>
<li><p>The complete trace with full request and response</p>
</li>
<li><p>Token counts broken down by input and output</p>
</li>
<li><p>Latency measured in milliseconds</p>
</li>
<li><p>Cost calculated automatically for you</p>
</li>
</ul>
<p>That <code>@traceable</code> decorator basically does all this under the hood. The tradeoff of using LangSmith in the long run on big projects is you will be vendor-locked with this platform.</p>
<h2 id="heading-solution-2-vendor-neutral-observability-with-openllmetry">Solution 2: Vendor-Neutral Observability with OpenLLMetry</h2>
<p>If you already have observability infrastructure (Datadog, Grafana, Honeycomb, etc.) or need complete control over your data, OpenLLMetry is your answer. It's an open-source SDK built on OpenTelemetry that sends traces to any backend you want, and like LangSmith, you are not left with only one choice.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> asyncio
<span class="hljs-keyword">import</span> os

<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> openai <span class="hljs-keyword">import</span> AsyncOpenAI
<span class="hljs-keyword">from</span> traceloop.sdk <span class="hljs-keyword">import</span> Traceloop
<span class="hljs-keyword">from</span> traceloop.sdk.decorators <span class="hljs-keyword">import</span> workflow

load_dotenv()

Traceloop.init(app_name=<span class="hljs-string">"production_api"</span>, api_key=os.getenv(<span class="hljs-string">"TRACELOOP_APIKEY"</span>))

client = AsyncOpenAI(
    api_key=os.getenv(<span class="hljs-string">"OPENROUTER_KEY"</span>),
    base_url=<span class="hljs-string">"https://openrouter.ai/api/v1"</span>,
)


<span class="hljs-meta">@workflow(name="llm_completion")</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">call_llm</span>(<span class="hljs-params">messages: list[dict]</span>):</span>
    response = <span class="hljs-keyword">await</span> client.chat.completions.create(
        model=<span class="hljs-string">"gpt-4o-mini"</span>, messages=messages
    )
    <span class="hljs-keyword">return</span> response


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    messages = [{<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">"Explain Nepal in 50 words"</span>}]

    <span class="hljs-keyword">try</span>:
        response = <span class="hljs-keyword">await</span> call_llm(messages)
        print(<span class="hljs-string">"Response:"</span>, response.choices[<span class="hljs-number">0</span>].message.content)
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">"Error:"</span>, e)


asyncio.run(main())
</code></pre>
<p>For now, I am sending the data to traceloop but you can customize it to send to your any observability back-end.</p>
<h2 id="heading-problem-2-you-need-cost-visibility-and-control">Problem 2: You Need Cost Visibility and Control</h2>
<p>Alright, most developers stop at the observability part, but in production you need more than that. Sure, observability tells you what is happening inside your system like latency and errors, but it doesn't stop things from going wrong. So, you need some kind of way to act on that data. One tool you can use to complement observability is LiteLLM.</p>
<p>Observability reveals cost spikes and provider failures, and LiteLLM prevents them and routes around them. They give you both visibility and control, which is what a real production-ready LLM stack needs.</p>
<h2 id="heading-solution-1-litellm-for-cost-control">Solution 1: LiteLLM for Cost Control</h2>
<p>LiteLLM lets you control the budget and limits in production. This is very useful when you have a small budget and multiple operations. Here is the sample code that you can use for budget controlling using LiteLLM:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> asyncio
<span class="hljs-keyword">import</span> os

<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> langsmith <span class="hljs-keyword">import</span> traceable
<span class="hljs-keyword">from</span> litellm <span class="hljs-keyword">import</span> Router

load_dotenv()

model_list = [
    {
        <span class="hljs-string">"model_name"</span>: <span class="hljs-string">"gpt-4o-mini"</span>,
        <span class="hljs-string">"litellm_params"</span>: {
            <span class="hljs-string">"model"</span>: <span class="hljs-string">"gpt-4o-mini"</span>,
            <span class="hljs-string">"api_key"</span>: os.getenv(<span class="hljs-string">"OPENROUTER_KEY"</span>),
            <span class="hljs-string">"base_url"</span>: <span class="hljs-string">"https://openrouter.ai/api/v1"</span>,
            <span class="hljs-string">"rpm"</span>: <span class="hljs-number">6</span>,
        },
    },
]

router = Router(
    model_list=model_list,
    fallbacks=[],
)

customer_budgets = {
    <span class="hljs-string">"customer_123"</span>: {<span class="hljs-string">"limit"</span>: <span class="hljs-number">50.00</span>, <span class="hljs-string">"spent"</span>: <span class="hljs-number">0.00</span>},
    <span class="hljs-string">"customer_456"</span>: {<span class="hljs-string">"limit"</span>: <span class="hljs-number">100.00</span>, <span class="hljs-string">"spent"</span>: <span class="hljs-number">0.00</span>},
}


<span class="hljs-meta">@traceable(run_type="llm")</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">call_with_budget</span>(<span class="hljs-params">
    customer_id: str, messages: list[dict], model: str = <span class="hljs-string">"gpt-4o-mini"</span>
</span>):</span>
    budget = customer_budgets.get(customer_id)

    <span class="hljs-keyword">if</span> budget[<span class="hljs-string">"spent"</span>] &gt;= budget[<span class="hljs-string">"limit"</span>]:
        <span class="hljs-keyword">raise</span> Exception(
            <span class="hljs-string">f"Budget exceeded for <span class="hljs-subst">{customer_id}</span>: "</span>
            <span class="hljs-string">f"$<span class="hljs-subst">{budget[<span class="hljs-string">'spent'</span>]:<span class="hljs-number">.2</span>f}</span> / $<span class="hljs-subst">{budget[<span class="hljs-string">'limit'</span>]:<span class="hljs-number">.2</span>f}</span>"</span>
        )

    response = <span class="hljs-keyword">await</span> router.acompletion(
        model=model,
        messages=messages,
    )

    cost = response._hidden_params.get(<span class="hljs-string">"response_cost"</span>, <span class="hljs-number">0</span>)
    customer_budgets[customer_id][<span class="hljs-string">"spent"</span>] += cost

    <span class="hljs-keyword">return</span> response


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    messages = [{<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">"Write a short poem about Kathmandu."</span>}]
    customer_id = <span class="hljs-string">"customer_123"</span>

    <span class="hljs-keyword">try</span>:
        response = <span class="hljs-keyword">await</span> call_with_budget(
            customer_id=customer_id, messages=messages, model=<span class="hljs-string">"gpt-4o-mini"</span>
        )
        print(response)
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">"Error:"</span>, e)


asyncio.run(main())
</code></pre>
<p>In production, you should store these budgets in some kind of database instead of a dict. Reset them monthly or per billing cycle and set relevant alerts.</p>
<h2 id="heading-solution-2-high-availability-through-multi-provider-fallbacks">Solution 2: High Availability Through Multi-Provider Fallbacks</h2>
<p>Since we already implemented LiteLLM, we can leverage it further to reduce our application downtime. So, if your application depends on one provider only, your entire application goes down with them. To solve this, we can use fallback providers without any significant code changes.</p>
<p>See the below example, when OpenAI goes down, LiteLLM will automatically route to Anthropic:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> asyncio
<span class="hljs-keyword">import</span> os

<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> langsmith <span class="hljs-keyword">import</span> traceable
<span class="hljs-keyword">from</span> litellm <span class="hljs-keyword">import</span> Router

load_dotenv()

model_list = [
    {
        <span class="hljs-string">"model_name"</span>: <span class="hljs-string">"gpt-4o-mini"</span>,
        <span class="hljs-string">"litellm_params"</span>: {
            <span class="hljs-string">"model"</span>: <span class="hljs-string">"gpt-4o-mini"</span>,
            <span class="hljs-string">"api_key"</span>: os.getenv(<span class="hljs-string">"OPENROUTER_KEY"</span>),
            <span class="hljs-string">"base_url"</span>: <span class="hljs-string">"https://openrouter.ai/api/v1"</span>,
            <span class="hljs-string">"rpm"</span>: <span class="hljs-number">6</span>,
        },
    },
    {
        <span class="hljs-string">"model_name"</span>: <span class="hljs-string">"claude-haiku-4.5"</span>,
        <span class="hljs-string">"litellm_params"</span>: {
            <span class="hljs-string">"model"</span>: <span class="hljs-string">"claude-haiku-4-5"</span>,
            <span class="hljs-string">"api_key"</span>: os.getenv(<span class="hljs-string">"ANTHROPIC_API_KEY"</span>),
            <span class="hljs-string">"rpm"</span>: <span class="hljs-number">6</span>,
        },
    },
]

router = Router(
    model_list=model_list,
    fallbacks=[{<span class="hljs-string">"gpt-4o-mini"</span>: [<span class="hljs-string">"claude-haiku-4.5"</span>]}],
    retry_after=<span class="hljs-number">10</span>,
    allowed_fails=<span class="hljs-number">3</span>,
)


<span class="hljs-meta">@traceable(run_type="llm")</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">call_model</span>(<span class="hljs-params">messages: list[dict], model: str = <span class="hljs-string">"gpt-4o-mini"</span></span>):</span>
    response = <span class="hljs-keyword">await</span> router.acompletion(
        model=model,
        messages=messages,
    )
    <span class="hljs-keyword">return</span> response


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    messages = [{<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">"Write a short poem about Kathmandu."</span>}]

    <span class="hljs-keyword">try</span>:
        response = <span class="hljs-keyword">await</span> call_model(messages=messages, model=<span class="hljs-string">"gpt-4o-mini"</span>)
        print(response)
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">"Error:"</span>, e)


asyncio.run(main())
</code></pre>
<h2 id="heading-resources">Resources</h2>
<p><strong>Github repository</strong></p>
<ul>
<li>Get the working code on github: <a target="_blank" href="https://github.com/chapainaashish/production-ai-engineering#">https://github.com/chapainaashish/production-ai-engineering</a></li>
</ul>
<p><strong>Documentation:</strong></p>
<ul>
<li><p><a target="_blank" href="https://docs.smith.langchain.com">https://docs.smith.langchain.com</a></p>
</li>
<li><p><a target="_blank" href="https://smith.langchain.com/pricing">https://smith.langchain.com/pricing</a></p>
</li>
<li><p><a target="_blank" href="https://www.traceloop.com/docs/openllmetry">https://www.traceloop.com/docs/openllmetry</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/traceloop/openllmetry">https://github.com/traceloop/openllmetry</a></p>
</li>
<li><p><a target="_blank" href="https://www.traceloop.com/docs/openllmetry/integrations">https://www.traceloop.com/docs/openllmetry/integrations</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/BerriAI/litellm">https://github.com/BerriAI/litellm</a></p>
</li>
<li><p><a target="_blank" href="https://docs.litellm.ai">https://docs.litellm.ai</a></p>
</li>
<li><p><a target="_blank" href="https://openai.com/api/pricing">https://openai.com/api/pricing</a></p>
</li>
<li><p><a target="_blank" href="https://anthropic.com/pricing">https://anthropic.com/pricing</a></p>
</li>
<li><p><a target="_blank" href="https://opentelemetry.io/docs/concepts/observability-primer/">https://opentelemetry.io/docs/concepts/observability-primer/</a></p>
</li>
</ul>
<hr />
<p><em>Questions or feedback? Open an issue on</em> <a target="_blank" href="https://github.com/chapainaashish/production-ai-engineering"><em>Github</em></a> <em>or reach out on</em> <a target="_blank" href="https://www.linkedin.com/in/chapainaashish/"><em>LinkedIn</em></a></p>
]]></content:encoded></item><item><title><![CDATA[Think Before You Call LLM API]]></title><description><![CDATA[Overview
The first way to interact with any LLM is through the API in development. This is pretty basic which you will master easily but you can miss some critical aspect which I will cover in this post.
Part 1: Understand Tokens Before Getting Excit...]]></description><link>https://blog.chapainaashish.com.np/think-before-you-call-llm-api</link><guid isPermaLink="true">https://blog.chapainaashish.com.np/think-before-you-call-llm-api</guid><category><![CDATA[llm]]></category><category><![CDATA[AI Engineering]]></category><category><![CDATA[openai]]></category><dc:creator><![CDATA[Aashish Chapain]]></dc:creator><pubDate>Fri, 07 Nov 2025 06:29:41 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-overview">Overview</h2>
<p>The first way to interact with any LLM is through the API in development. This is pretty basic which you will master easily but you can miss some critical aspect which I will cover in this post.</p>
<h2 id="heading-part-1-understand-tokens-before-getting-excited">Part 1: Understand Tokens Before Getting Excited</h2>
<p>Before calling the LLM, tokens are important to understand. So, when you send "Hello, world!" to GPT, the model doesn't process it as two words. It breaks it down into smaller units called tokens. Tokens directly impact everything in LLM development.</p>
<p>OpenAI uses a tokenization algorithm called <strong>Byte Pair Encoding (BPE)</strong>, implemented in their <code>tiktoken</code> library. BPE algorithm doesn't split text into tokens using spaces or punctuation, It's more complex. You can read more about it <a target="_blank" href="https://www.kaggle.com/code/qmarva/1-bpe-tokenization-algorithm-eng">here</a></p>
<p><strong>Some approximations:</strong></p>
<ul>
<li><p>1 token ≈ 4 characters in English</p>
</li>
<li><p>Non-English text uses significantly more tokens (sometimes 2-3x more)</p>
</li>
</ul>
<h3 id="heading-tokens-are-currency">Tokens are Currency</h3>
<p><strong>1. Pricing is Per-Token</strong></p>
<p>Every API call costs money based on token count. So, if you miscalculate tokens by 50%, you're miscalculating your budget by 50%. At scale, this means thousands of dollars in unexpected costs.</p>
<p><strong>2. Context Windows Are Token-Limited</strong></p>
<p>When you see "GPT-4 has an 8K context window," that means 8,000 tokens total for your input AND the response. If you run out of tokens mid-conversation, the API fails. Your application breaks and god knows what will happen next.</p>
<p><strong>3. Rate Limits Are Token-Based</strong></p>
<p>OpenAI limits you by tokens per minute (TPM), more than requests per minute. A single large request with 10K tokens counts the same as 10 small requests with 1K tokens each.</p>
<p>Okay enough theory, Let's write code to see how tokenization works:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tiktoken


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">count_tokens</span>(<span class="hljs-params">text: str, model: str = <span class="hljs-string">"gpt-4"</span></span>) -&gt; int:</span>
    <span class="hljs-string">"""Count tokens in text for a specific model"""</span>
    encoding = tiktoken.encoding_for_model(model)
    <span class="hljs-keyword">return</span> len(encoding.encode(text))


<span class="hljs-comment"># Test different inputs</span>
examples = [<span class="hljs-string">"Hello"</span>, <span class="hljs-string">"Hello, world!"</span>, <span class="hljs-string">"Artificial Intelligence"</span>, <span class="hljs-string">"AI"</span>]

<span class="hljs-keyword">for</span> text <span class="hljs-keyword">in</span> examples:
    tokens = count_tokens(text)
    print(<span class="hljs-string">f"'<span class="hljs-subst">{text}</span>' = <span class="hljs-subst">{tokens}</span> tokens"</span>)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">'Hello' = 1 tokens
'Hello, world!' = 4 tokens
'Artificial Intelligence' = 3 tokens
'AI' = 1 tokens
</code></pre>
<p>Look "Hello, world!" is 4 tokens, not 2. The comma and space are separate tokens. This is why you can't estimate tokens by counting words.</p>
<hr />
<h2 id="heading-part-2-calling-the-llm-api-right-way">Part 2: Calling the LLM API Right Way</h2>
<h3 id="heading-hide-secrets-in-env">Hide Secrets in .env</h3>
<p>Before we call the OpenAI API, we need to set up our API key properly. Never hardcode API keys in your code. Sounds pretty basic but most of the people miss it.</p>
<p>Create a <code>.env</code> file in your project root:</p>
<pre><code class="lang-bash">OPENAI_API_KEY=sk-your-actual-api-key-here
</code></pre>
<p>Now add this file to <code>.gitignore</code> so it never gets committed:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">echo</span> <span class="hljs-string">".env"</span> &gt;&gt; .gitignore
</code></pre>
<p>Look, it is that simple, ALWAYS DO IT….</p>
<h3 id="heading-get-advantage-of-async-call">Get Advantage of Async Call</h3>
<p>OpenAI can be called asynchronously. You can send multiple requests in parallel and get advantage of async call.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> asyncio
<span class="hljs-keyword">import</span> os

<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> openai <span class="hljs-keyword">import</span> AsyncOpenAI

<span class="hljs-comment"># Load env vars</span>
load_dotenv()


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">simple_llm_call</span>():</span>
    <span class="hljs-string">"""Simple API call"""</span>

    <span class="hljs-comment"># Using openrouter here as they provide free models (upto certain limits)</span>
    client = AsyncOpenAI(
        base_url=<span class="hljs-string">"https://openrouter.ai/api/v1"</span>, api_key=os.getenv(<span class="hljs-string">"OPENROUTER_KEY"</span>)
    )

    response = <span class="hljs-keyword">await</span> client.chat.completions.create(
        model=<span class="hljs-string">"gpt-4o-mini"</span>,
        messages=[{<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">"Where is Nepal located ?"</span>}],
        temperature=<span class="hljs-number">0.7</span>,
        timeout=<span class="hljs-number">30.0</span>,
    )

    print(<span class="hljs-string">"Response:"</span>, response.choices[<span class="hljs-number">0</span>].message.content)
    print(<span class="hljs-string">"\nToken Usage:"</span>)
    print(<span class="hljs-string">f"  Prompt: <span class="hljs-subst">{response.usage.prompt_tokens}</span>"</span>)
    print(<span class="hljs-string">f"  Completion: <span class="hljs-subst">{response.usage.completion_tokens}</span>"</span>)
    print(<span class="hljs-string">f"  Total: <span class="hljs-subst">{response.usage.total_tokens}</span>"</span>)


<span class="hljs-comment"># Run the async function</span>
asyncio.run(simple_llm_call())
</code></pre>
<p>When you run this, you'll get a response from GPT along with token usage statistics.</p>
<pre><code class="lang-plaintext">Response: Nepal is a landlocked country located in South Asia, situated mainly in the Himalayas. It is bordered by China to the north and India to the south, east, and west. Nepal is known for its diverse geography, which includes plains, hills, and the towering peaks of the Himalayas, including Mount Everest, the highest point on Earth. The capital city of Nepal is Kathmandu.

Token Usage:
  Prompt: 12
  Completion: 79
  Total: 91
</code></pre>
<h3 id="heading-understanding-the-response">Understanding the Response</h3>
<p>If you view the API response on details, it returns a structured object with the following key fields:</p>
<ul>
<li><p><code>choices[0].message.content</code>: The actual text response from the model</p>
</li>
<li><p><code>usage.prompt_tokens</code>: How many tokens your input used</p>
</li>
<li><p><code>usage.completion_tokens</code>: How many tokens the response used</p>
</li>
<li><p><code>usage.total_tokens</code>: Sum of both (this is what you pay for)</p>
</li>
<li><p><code>finish_reason</code>:</p>
<ul>
<li><p><code>"stop"</code>: Model completed the response naturally</p>
</li>
<li><p><code>"length"</code>: Hit the token limit (response was cut off)</p>
</li>
<li><p><code>"content_filter"</code>: Response was blocked by safety filters</p>
</li>
</ul>
</li>
</ul>
<p>That <code>finish_reason</code> field matters in production. if you see <code>"length"</code>, the response was truncated and your application needs to handle that.</p>
<hr />
<h2 id="heading-part-3-calculating-the-cost">Part 3: Calculating The Cost</h2>
<p>Before deploying your app into production, understand token-based pricing first because LLM API calls can get expensive as you scale.</p>
<h3 id="heading-current-pricing">Current Pricing</h3>
<p>As of November 2025, OpenAI charges:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Model</td><td>Input (per 1M tokens)</td><td>Output (per 1M tokens)</td></tr>
</thead>
<tbody>
<tr>
<td>gpt-5-nano</td><td>$0.05</td><td>$0.40</td></tr>
<tr>
<td>gpt-5</td><td>$1.25</td><td>$10.00</td></tr>
</tbody>
</table>
</div><p><strong>What you can understand from this table:</strong></p>
<ul>
<li><p>Output tokens cost more than input tokens (8x more for both models)</p>
</li>
<li><p>GPT-5 is 25x more expensive than GPT-5 nano for both input and output</p>
</li>
</ul>
<h3 id="heading-real-world-cost-projection">Real-World Cost Projection</h3>
<p>Let's calculate what a real production system would cost. Imagine you're building some kind of customer support chatbot with this specific scenario:</p>
<p><strong>Scenario:</strong></p>
<ul>
<li><p>10,000 requests per day</p>
</li>
<li><p>Average 300 input tokens and 200 output tokens per request</p>
</li>
</ul>
<pre><code class="lang-python">daily_requests = <span class="hljs-number">10</span>_000
input_tokens_per_request = <span class="hljs-number">300</span>
output_tokens_per_request = <span class="hljs-number">200</span>

<span class="hljs-comment"># Calculate monthly costs</span>
monthly_requests = daily_requests * <span class="hljs-number">30</span>
monthly_input_tokens = monthly_requests * input_tokens_per_request
monthly_output_tokens = monthly_requests * output_tokens_per_request

<span class="hljs-comment"># GPT-5 nano costs</span>
nano_input_cost = (monthly_input_tokens / <span class="hljs-number">1</span>_000_000) * <span class="hljs-number">0.05</span>
nano_output_cost = (monthly_output_tokens / <span class="hljs-number">1</span>_000_000) * <span class="hljs-number">0.40</span>
nano_total = nano_input_cost + nano_output_cost

<span class="hljs-comment"># GPT-5 costs</span>
gpt5_input_cost = (monthly_input_tokens / <span class="hljs-number">1</span>_000_000) * <span class="hljs-number">1.25</span>
gpt5_output_cost = (monthly_output_tokens / <span class="hljs-number">1</span>_000_000) * <span class="hljs-number">10.00</span>
gpt5_total = gpt5_input_cost + gpt5_output_cost

print(<span class="hljs-string">f"GPT-5 Nano Monthly: $<span class="hljs-subst">{nano_total:,<span class="hljs-number">.2</span>f}</span>"</span>)
print(<span class="hljs-string">f"GPT-5 Monthly: $<span class="hljs-subst">{gpt5_total:,<span class="hljs-number">.2</span>f}</span>"</span>)
print(<span class="hljs-string">f"Savings with GPT-5 Nano: $<span class="hljs-subst">{gpt5_total - nano_total:,<span class="hljs-number">.2</span>f}</span>/month"</span>)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">GPT-5 Nano Monthly: $28.50
GPT-5 Monthly: $712.50
Savings with GPT-5 Nano: $684.00/month
</code></pre>
<p>Alright, Now you have seen the cost, the model choice depends on your use-case. For straightforward tasks like summarization and classification, you can use GPT nano version and for complex task, you can use GPT 5 version. Furthermore, you can use <a target="_blank" href="https://openrouter.ai/rankings">Openrouter</a> to compare models according to your use-case.</p>
<h2 id="heading-part-4-error-handling-amp-retries-are-important">Part 4: Error Handling &amp; Retries are Important</h2>
<p>LLM APIs might fail. So, you shouldn't assume they are immune to errors. You should treat them like any other third party APIs or even give more importance if you are building the system around it.</p>
<h3 id="heading-some-common-api-errors">Some Common API Errors</h3>
<p>1. Rate Limit Errors (HTTP 429)</p>
<p>2. Timeout Errors</p>
<p>3. API Errors (HTTP 500-599)</p>
<p>4. Authentication Errors (HTTP 401)</p>
<p>5. Invalid Request Errors (HTTP 400)</p>
<h3 id="heading-implementing-exponential-backoff-and-timeout">Implementing Exponential Backoff and Timeout</h3>
<p>The standard approach to handling retries is <strong>exponential backoff with jitter.</strong> OpenAI standard SDK offers in-built retries and timeout. So, you can leverage that.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> asyncio
<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> openai <span class="hljs-keyword">import</span> AsyncOpenAI
<span class="hljs-keyword">from</span> openai <span class="hljs-keyword">import</span> APIError, APIConnectionError, RateLimitError, Timeout

load_dotenv()

client = AsyncOpenAI(
    api_key=os.getenv(<span class="hljs-string">"OPENAI_API_KEY"</span>),
    max_retries=<span class="hljs-number">3</span>,  <span class="hljs-comment"># Automatic exponential backoff with jitter</span>
    timeout=<span class="hljs-number">30.0</span>
)

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">simple_llm_call</span>():</span>
    <span class="hljs-string">"""OpenAI API Call with Safe Error Handling"""</span>
    <span class="hljs-keyword">try</span>:
        response = <span class="hljs-keyword">await</span> client.chat.completions.create(
            model=<span class="hljs-string">"gpt-4o-mini"</span>,
            messages=[{<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">"Where is Nepal located?"</span>}],
            temperature=<span class="hljs-number">0.7</span>,
        )

        print(<span class="hljs-string">"Response:"</span>, response.choices[<span class="hljs-number">0</span>].message.content)
        print(<span class="hljs-string">"\nToken Usage:"</span>)
        print(<span class="hljs-string">f"  Prompt: <span class="hljs-subst">{response.usage.prompt_tokens}</span>"</span>)
        print(<span class="hljs-string">f"  Completion: <span class="hljs-subst">{response.usage.completion_tokens}</span>"</span>)
        print(<span class="hljs-string">f"  Total: <span class="hljs-subst">{response.usage.total_tokens}</span>"</span>)
        print(<span class="hljs-string">f"  Finish Reason: <span class="hljs-subst">{response.choices[<span class="hljs-number">0</span>].finish_reason}</span>"</span>)

    <span class="hljs-keyword">except</span> RateLimitError <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">"Rate limit exceeded after retries:"</span>, e)

    <span class="hljs-keyword">except</span> Timeout <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">"Request timed out after retries:"</span>, e)

    <span class="hljs-keyword">except</span> APIConnectionError <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">"Connection error after retries:"</span>, e)

    <span class="hljs-keyword">except</span> APIError <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">"API returned an error even after retries:"</span>, e)

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">"Unexpected Error:"</span>, type(e).__name__, str(e))


asyncio.run(simple_llm_call())
</code></pre>
<p>You might wonder about how <code>max_retires</code> works under the hood. It use something call <strong>exponential backoff with jitter</strong> which retry the API every few random seconds. This few random seconds are calculated by multiplying the attempt with two and adding jitter;</p>
<pre><code class="lang-python">wait_time = (<span class="hljs-number">2</span> ** attempt) + random.uniform(<span class="hljs-number">0</span>, <span class="hljs-number">1</span>)
</code></pre>
<p>The first expression give the result like (1s, 2s, 4s) and second expression is jitter which is random value between 0 and 1. You might be wondering why not just use <code>2 ** attempt</code> and get rid of jitter?</p>
<p><strong>Without jitter:</strong></p>
<ul>
<li><p>Retry after 1s, 2s, 4s, 8s...</p>
</li>
<li><p>If 100 requests hit rate limits simultaneously, they all retry at the same time</p>
</li>
<li><p>This creates a <strong>thundering herd</strong> which means all requests slam the API at once</p>
</li>
<li><p>The API gets overwhelmed again, causing another round of failures</p>
</li>
</ul>
<p><strong>With jitter:</strong></p>
<ul>
<li><p>Retry after 1.0-2.0s, 2.0-3.0s, 4.0-5.0s...</p>
</li>
<li><p>Requests spread out over time and API load distributes evenly</p>
</li>
</ul>
<p>For more details on this pattern, see <a target="_blank" href="https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/">AWS's exponential backoff and jitter article</a>.</p>
<hr />
<h2 id="heading-resources">Resources</h2>
<p><strong>Github repository</strong></p>
<ul>
<li>Get the working code on github: <a target="_blank" href="https://github.com/chapainaashish/production-ai-engineering">https://github.com/chapainaashish/production-ai-engineering</a></li>
</ul>
<p><strong>Documentation:</strong></p>
<ul>
<li><p><a target="_blank" href="https://platform.openai.com/docs/api-reference">https://platform.openai.com/docs/api-reference</a></p>
</li>
<li><p><a target="_blank" href="https://platform.openai.com/tokenizer">https://platform.openai.com/tokenizer</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/openai/tiktoken">https://github.com/openai/tiktoken</a></p>
</li>
<li><p><a target="_blank" href="https://platform.openai.com/account/usage">https://platform.openai.com/account/usage</a></p>
</li>
<li><p><a target="_blank" href="https://openai.com/api/pricing">https://openai.com/api/pricing</a></p>
</li>
<li><p><a target="_blank" href="https://platform.openai.com/docs/guides/best-practices/production">https://platform.openai.com/docs/guides/best-practices/production</a></p>
</li>
<li><p><a target="_blank" href="https://platform.openai.com/docs/guides/safety-best-practices">https://platform.openai.com/docs/guides/safety-best-practices</a></p>
</li>
<li><p><a target="_blank" href="https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/">https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/</a></p>
</li>
<li><p><a target="_blank" href="https://realpython.com/async-io-python/">https://realpython.com/async-io-python/</a></p>
</li>
</ul>
<hr />
<p><em>Questions or feedback? Open an issue on</em> <a target="_blank" href="https://github.com/chapainaashish/production-ai-engineering"><em>Github</em></a> <em>or reach out on</em> <a target="_blank" href="https://www.linkedin.com/in/chapainaashish/"><em>LinkedIn</em></a></p>
]]></content:encoded></item></channel></rss>