Deploy AI Agents with FastAPI

Every agent so far ran in a terminal. To make one usable by an app, we serve it over HTTP. In this blog, we deploy agents with FastAPI. First a simple chat endpoint, then a streaming server that pushes tokens as they are generated, plus a Streamlit chat client that consumes the stream, shows tool calls, and exports conversations to PDF.

Note

This lesson reuses the MCP servers and the scripts/ helpers from earlier projects. The streaming server connects Gmail, Yahoo Finance, and Google Sheets, so have those configured as shown in Build a Daily Briefing AI Agent.

Installation

Install the web stack:

BASH

pip install fastapi uvicorn httpx streamlit markdown2 xhtml2pdf

A Simple FastAPI Chat Server

Start with a non-streaming endpoint. It validates the request with Pydantic, builds an agent per call, and returns the final text. CORS is open so a browser client can reach it during development.

FastAPI validates the request, runs the agent, and returns the answer

PYTHON

import sys
import os

root_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_dir)

from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from dotenv import load_dotenv
load_dotenv()

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import create_agent
from langchain.messages import HumanMessage

app = FastAPI()

class ChatRequest(BaseModel):
    prompt: str = Field(..., min_length=2)
    model: str = 'gemini-2.5-flash'

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

@app.get("/")
async def read_root():
    return {"status": "FastAPI agent server is up!"}

@app.post("/chat")
async def chat(request: ChatRequest):
    if not request.prompt.strip():
        raise HTTPException(status_code=400, detail="Empty prompt!")
    try:
        model = ChatGoogleGenerativeAI(model=request.model)
        agent = create_agent(model=model)
        response = agent.invoke({'messages': [HumanMessage(request.prompt)]})
        return {'response': response['messages'][-1].text}
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Server error: {e}")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app=app, host='0.0.0.0', port=8001)

Run it:

POWERSHELL

python "03 AI Projects/07_deploy_agents_with_fastapi/01_fastapi_server.py"

Tip

During development we can also use fastapi dev 01_fastapi_server.py for auto-reload. The __main__ block lets us run the file directly with python.

Test it with curl:

BASH

# Health check
curl http://localhost:8001/

# Chat request
curl -X POST http://localhost:8001/chat \
  -H "Content-Type: application/json" \
  -d "{\"prompt\": \"What is the capital of France?\"}"

OUTPUT

{"response": "The capital of France is Paris."}

Note

The -d payload above uses escaped double quotes so it works in Windows PowerShell and CMD. On Linux/macOS, you can use single quotes instead: -d '{"prompt": "What is the capital of France?"}'.

A Streaming Server with Tools and Memory

A real assistant should stream its response and use tools. This server loads MCP tools once at startup with a FastAPI lifespan, then streams agent output as newline-delimited JSON (application/x-ndjson).

Tools load once at startup; tokens stream as JSON lines

Loading Tools at Startup

The lifespan context manager runs get_tools() before the server accepts requests, so tool loading happens once rather than per request. Destructive email and spreadsheet tools are filtered out:

PYTHON

from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
import json

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import create_agent
from langchain_mcp_adapters.client import MultiServerMCPClient
from langgraph.checkpoint.memory import InMemorySaver
from langchain.messages import HumanMessage, AIMessageChunk

from scripts import base_tools, prompts, utils

checkpointer = InMemorySaver()
tools = None

class ChatRequest(BaseModel):
    query: str = Field(..., min_length=2)
    model: str = "gemini-2.5-flash"
    thread_id: str = "default"

async def get_tools():
    mcp_config = utils.load_mcp_config("gmail", "yahoo-finance", "google-sheets")
    client = MultiServerMCPClient(mcp_config)
    mcp_tools = await client.get_tools()
    tools = mcp_tools + [base_tools.web_search, base_tools.get_weather]

    filter_tools = ["delete_email", "batch_modify_emails", "batch_delete_emails",
                    "delete_label", "delete_filter", "update_cells"]
    safe_tools = [tool for tool in tools if tool.name not in filter_tools]

    print(f"Loaded {len(safe_tools)} Tools")
    return safe_tools

@asynccontextmanager
async def lifespan(app: FastAPI):
    global tools
    tools = await get_tools()
    print("Tools are loaded. Ready to create agent!")
    yield

app = FastAPI(lifespan=lifespan)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"], allow_methods=["*"], allow_headers=["*"],
)

The Streaming Generator

stream_response builds the agent with the preloaded tools and a checkpointer, then iterates agent.astream(..., stream_mode='messages'). Each chunk is serialized to JSON, including any tool calls, and yielded as one line:

PYTHON

async def stream_response(query, model_name, thread_id):
    system_prompt = prompts.get_assistant_prompt()

    model = ChatGoogleGenerativeAI(model=model_name)
    agent = create_agent(model=model, tools=tools,
                         system_prompt=system_prompt, checkpointer=checkpointer)

    config = {"configurable": {"thread_id": thread_id}}

    async for chunk, metadata in agent.astream(
            {'messages': [HumanMessage(query)]},
            stream_mode='messages', config=config):

        data = {"type": chunk.__class__.__name__, "content": chunk.text}

        if isinstance(chunk, AIMessageChunk) and chunk.tool_calls:
            data['tool_calls'] = chunk.tool_calls

        yield (json.dumps(data) + "\n").encode()

The thread_id in the config is what gives the deployed agent memory. Repeated requests with the same thread_id continue the same conversation.

Note

get_assistant_prompt() returns a personal-assistant system prompt with the current date and tool guidance (Gmail, Yahoo Finance, Google Sheets, web, weather). It instructs the agent to always call a tool before answering, to never send email without confirmation, and to confirm before writing to a sheet. Customize the persona and defaults for your own user.

The Streaming Endpoint

Wrap the generator in a StreamingResponse:

PYTHON

@app.get("/")
async def read_root():
    return {"status": "Streaming agent server is up!"}

@app.post("/chat_stream")
async def chat_stream(request: ChatRequest):
    if not request.query.strip():
        raise HTTPException(status_code=400, detail="Empty prompt!")
    try:
        return StreamingResponse(
            stream_response(request.query, request.model, request.thread_id),
            media_type="application/x-ndjson")
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Server error: {e}")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app=app, host="0.0.0.0", port=8002)

Run the streaming server and test it with curl. Reuse a thread_id to keep memory across calls:

POWERSHELL

python "03 AI Projects/07_deploy_agents_with_fastapi/02_stream_server.py"

BASH

curl -X POST http://localhost:8002/chat_stream \
  -H "Content-Type: application/json" \
  -d "{\"query\": \"What is the weather in London?\", \"thread_id\": \"user-123\"}"

OUTPUT

{"type": "AIMessageChunk", "content": "", "tool_calls": [{"name": "get_weather", "args": {"location": "London"}, ...}]}
{"type": "AIMessageChunk", "content": "The weather in London"}
{"type": "AIMessageChunk", "content": " is 14°C and cloudy..."}

Here, we can see each line is a JSON object. Tool-call chunks announce which tool ran, and content chunks stream the answer token by token.

A Streamlit Chat Client

The front end is a Streamlit app that posts to the streaming endpoint, renders tokens live, shows tool calls in a status box, and can export the last answer to PDF.

The Streamlit client streams tokens, shows tool calls, and exports PDF

Tip

New to Streamlit? This companion playlist walks through the basics:

https://www.youtube.com/watch?v=hff2tHUzxJM&list=PLc2rvfiptPSSpZ99EnJbH5LjTJ_nOoSWW

The client keeps chat history in st.session_state, with a sidebar to set the thread_id, clear messages, and download a PDF:

PYTHON

import streamlit as st
import httpx
import json
import os
from datetime import datetime
import markdown2
from xhtml2pdf import pisa

st.title("Personal Assistant")

thread_id = st.sidebar.text_input("Thread ID", value="default")

if "messages" not in st.session_state:
    st.session_state.messages = []

if st.sidebar.button("Clear Messages"):
    st.session_state.messages = []
    st.rerun()

When the user submits a query, the client opens a streaming POST with httpx, iterates the response lines, renders tool calls in a status container, and appends AI content to a live placeholder:

PYTHON

query = st.chat_input("Ask anything...")
if query:
    st.session_state.messages.append({"role": "user", "content": query})
    with st.chat_message("user"):
        st.markdown(query)

    with st.chat_message("assistant"):
        tool_container = st.container()
        placeholder = st.empty()
        full_response = ""

        with httpx.Client(timeout=None) as client:
            with client.stream(
                "POST",
                "http://localhost:8002/chat_stream",
                json={"query": query, "thread_id": thread_id},
            ) as response:
                for line in response.iter_lines():
                    if not line.strip():
                        continue
                    chunk = json.loads(line)
                    msg_type = chunk.get("type", "")
                    content = chunk.get("content", "")

                    if chunk.get("tool_calls"):
                        for tc in chunk["tool_calls"]:
                            with tool_container:
                                st.status(f"🔧 {tc['name']}", state="complete").write(
                                    f"```json\n{json.dumps(tc['args'], indent=2)}\n```"
                                )

                    if content and "AI" in msg_type:
                        full_response += content
                        placeholder.markdown(full_response.replace("$", "\\$"))

    st.session_state.messages.append({"role": "assistant", "content": full_response})

Note

Dollar signs are escaped (replace(" $", "$ ")) so Streamlit's Markdown does not interpret them as LaTeX math, important when the agent returns prices like $260.00.

Exporting to PDF

The sidebar's Download PDF button converts the last assistant message from Markdown to styled HTML with markdown2, then renders it to a PDF in the user's Downloads folder with xhtml2pdf:

PYTHON

if st.sidebar.button("Download PDF"):
    assistant_msgs = [m for m in st.session_state.messages if m["role"] == "assistant"]
    if assistant_msgs:
        last_msg = assistant_msgs[-1]
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"{timestamp}_response.pdf"
        downloads_dir = os.path.join(os.path.expanduser("~"), "Downloads")
        filepath = os.path.join(downloads_dir, filename)

        html = markdown2.markdown(
            last_msg["content"],
            extras=["tables", "fenced-code-blocks", "cuddled-lists"],
        )
        styled_html = f"<html><head><meta charset='utf-8'></head><body>{html}</body></html>"

        with open(filepath, "wb") as f:
            pisa.CreatePDF(styled_html, dest=f)
        st.sidebar.success(f"Saved to {filename}")

Running Everything Together

Start the streaming server, then launch the Streamlit client in a second terminal:

POWERSHELL

# Terminal 1, backend
python "03 AI Projects/07_deploy_agents_with_fastapi/02_stream_server.py"

# Terminal 2, front end
streamlit run "03 AI Projects/07_deploy_agents_with_fastapi/03_streamlit_client.py"

Streamlit opens a browser tab. Ask "Summarize my unread emails and today's calendar." We watch tool-call chips appear as the agent works, the answer stream in token by token, and the whole exchange stay in memory under our chosen thread_id.

Tip

To expose the API publicly, deploy the FastAPI server behind a production ASGI setup (for example uvicorn with multiple workers, or a managed container on a cloud VM) and point the Streamlit client's URL at the deployed host instead of localhost.

This is how we deploy an agent over HTTP with a real chat client. We now have a streaming agent served over the network. Next, we connect an agent to a production cloud database and stream answers over real-world data in Real-World Agent Project: MySQL Data & Streaming.

Deploy AI Agents with FastAPI

Installation

A Simple FastAPI Chat Server

A Streaming Server with Tools and Memory

Loading Tools at Startup

The Streaming Generator

The Streaming Endpoint

A Streamlit Chat Client

Exporting to PDF

Running Everything Together

Found this useful? Keep building with me.

Latest recommendations you might like

Real-World Agent Project: MySQL & Streaming

Build a Daily Briefing AI Agent

Build a Google Sheets Analysis Agent with MCP

Build a Code Execution Agent with E2B

Find this tutorial useful?

Discussion & Comments