Deploy AI Agents with FastAPI

Serve your agent through FastAPI streaming endpoints and build a Streamlit chat client that consumes the live token stream with memory and PDF export.

Jun 19, 202612 min readFollow

Topics You Will Master

Wrapping an agent in a FastAPI endpoint with validation and CORS
Loading MCP tools once at startup with a FastAPI lifespan
Streaming agent tokens as newline-delimited JSON with memory
Building a Streamlit chat client with tool display and PDF export

Every agent so far ran in a terminal. To make an agent usable by an app, you serve it over HTTP. This lesson deploys agents with FastAPI — first a simple chat endpoint, then a streaming server that pushes tokens as they are generated — and builds a Streamlit chat client that consumes the stream, shows tool calls, and exports conversations to PDF.

Note

This lesson reuses the MCP servers and the scripts/ helpers from earlier projects. The streaming server connects Gmail, Yahoo Finance, and Google Sheets, so have those configured as shown in Build a Daily Briefing AI Agent.

Installation

Install the web stack:

BASH
pip install fastapi uvicorn httpx streamlit markdown2 xhtml2pdf

A Simple FastAPI Chat Server

Start with a non-streaming endpoint. It validates the request with Pydantic, builds an agent per call, and returns the final text. CORS is open so a browser client can reach it during development.

PYTHON
import sys
import os

root_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_dir)

from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from dotenv import load_dotenv
load_dotenv()

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import create_agent
from langchain.messages import HumanMessage

app = FastAPI()

class ChatRequest(BaseModel):
    prompt: str = Field(..., min_length=2)
    model: str = 'gemini-2.5-flash'

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

@app.get("/")
async def read_root():
    return {"status": "FastAPI agent server is up!"}

@app.post("/chat")
async def chat(request: ChatRequest):
    if not request.prompt.strip():
        raise HTTPException(status_code=400, detail="Empty prompt!")
    try:
        model = ChatGoogleGenerativeAI(model=request.model)
        agent = create_agent(model=model)
        response = agent.invoke({'messages': [HumanMessage(request.prompt)]})
        return {'response': response['messages'][-1].text}
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Server error: {e}")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app=app, host='0.0.0.0', port=8001)

Run it:

POWERSHELL
python "03 AI Projects/07_deploy_agents_with_fastapi/01_fastapi_server.py"

Tip

During development you can also use fastapi dev 01_fastapi_server.py for auto-reload. The __main__ block lets you run the file directly with python.

Test it with curl:

BASH
# Health check
curl http://localhost:8001/

# Chat request
curl -X POST http://localhost:8001/chat \
  -H "Content-Type: application/json" \
  -d "{\"prompt\": \"What is the capital of France?\"}"
OUTPUT
{"response": "The capital of France is Paris."}

Note

The -d payload above uses escaped double quotes so it works in Windows PowerShell and CMD. On Linux/macOS, you can use single quotes instead: -d '{"prompt": "What is the capital of France?"}'.

A Streaming Server with Tools and Memory

A real assistant should stream its response and use tools. This server loads MCP tools once at startup with a FastAPI lifespan, then streams agent output as newline-delimited JSON (application/x-ndjson).

Loading Tools at Startup

The lifespan context manager runs get_tools() before the server accepts requests, so tool loading happens once rather than per request. Destructive email and spreadsheet tools are filtered out:

PYTHON
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
import json

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import create_agent
from langchain_mcp_adapters.client import MultiServerMCPClient
from langgraph.checkpoint.memory import InMemorySaver
from langchain.messages import HumanMessage, AIMessageChunk

from scripts import base_tools, prompts, utils

checkpointer = InMemorySaver()
tools = None

class ChatRequest(BaseModel):
    query: str = Field(..., min_length=2)
    model: str = "gemini-2.5-flash"
    thread_id: str = "default"

async def get_tools():
    mcp_config = utils.load_mcp_config("gmail", "yahoo-finance", "google-sheets")
    client = MultiServerMCPClient(mcp_config)
    mcp_tools = await client.get_tools()
    tools = mcp_tools + [base_tools.web_search, base_tools.get_weather]

    filter_tools = ["delete_email", "batch_modify_emails", "batch_delete_emails",
                    "delete_label", "delete_filter", "update_cells"]
    safe_tools = [tool for tool in tools if tool.name not in filter_tools]

    print(f"Loaded {len(safe_tools)} Tools")
    return safe_tools

@asynccontextmanager
async def lifespan(app: FastAPI):
    global tools
    tools = await get_tools()
    print("Tools are loaded. Ready to create agent!")
    yield

app = FastAPI(lifespan=lifespan)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"], allow_methods=["*"], allow_headers=["*"],
)

The Streaming Generator

stream_response builds the agent with the preloaded tools and a checkpointer, then iterates agent.astream(..., stream_mode='messages'). Each chunk is serialized to JSON — including any tool calls — and yielded as one line:

PYTHON
async def stream_response(query, model_name, thread_id):
    system_prompt = prompts.get_assistant_prompt()

    model = ChatGoogleGenerativeAI(model=model_name)
    agent = create_agent(model=model, tools=tools,
                         system_prompt=system_prompt, checkpointer=checkpointer)

    config = {"configurable": {"thread_id": thread_id}}

    async for chunk, metadata in agent.astream(
            {'messages': [HumanMessage(query)]},
            stream_mode='messages', config=config):

        data = {"type": chunk.__class__.__name__, "content": chunk.text}

        if isinstance(chunk, AIMessageChunk) and chunk.tool_calls:
            data['tool_calls'] = chunk.tool_calls

        yield (json.dumps(data) + "\n").encode()

The thread_id in the config is what gives the deployed agent memory — repeated requests with the same thread_id continue the same conversation.

Note

get_assistant_prompt() returns a personal-assistant system prompt with the current date and tool guidance (Gmail, Yahoo Finance, Google Sheets, web, weather). It instructs the agent to always call a tool before answering, to never send email without confirmation, and to confirm before writing to a sheet. Customize the persona and defaults for your own user.

The Streaming Endpoint

Wrap the generator in a StreamingResponse:

PYTHON
@app.get("/")
async def read_root():
    return {"status": "Streaming agent server is up!"}

@app.post("/chat_stream")
async def chat_stream(request: ChatRequest):
    if not request.query.strip():
        raise HTTPException(status_code=400, detail="Empty prompt!")
    try:
        return StreamingResponse(
            stream_response(request.query, request.model, request.thread_id),
            media_type="application/x-ndjson")
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Server error: {e}")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app=app, host="0.0.0.0", port=8002)

Run the streaming server and test it with curl. Reuse a thread_id to keep memory across calls:

POWERSHELL
python "03 AI Projects/07_deploy_agents_with_fastapi/02_stream_server.py"
BASH
curl -X POST http://localhost:8002/chat_stream \
  -H "Content-Type: application/json" \
  -d "{\"query\": \"What is the weather in London?\", \"thread_id\": \"user-123\"}"
OUTPUT
{"type": "AIMessageChunk", "content": "", "tool_calls": [{"name": "get_weather", "args": {"location": "London"}, ...}]}
{"type": "AIMessageChunk", "content": "The weather in London"}
{"type": "AIMessageChunk", "content": " is 14°C and cloudy..."}

Each line is a JSON object: tool-call chunks announce which tool ran, and content chunks stream the answer token by token.

A Streamlit Chat Client

The front end is a Streamlit app that posts to the streaming endpoint, renders tokens live, shows tool calls in a status box, and can export the last answer to PDF.

Tip

New to Streamlit? This companion playlist walks through the basics:

https://www.youtube.com/watch?v=hff2tHUzxJM&list=PLc2rvfiptPSSpZ99EnJbH5LjTJ_nOoSWW

The client keeps chat history in st.session_state, with a sidebar to set the thread_id, clear messages, and download a PDF:

PYTHON
import streamlit as st
import httpx
import json
import os
from datetime import datetime
import markdown2
from xhtml2pdf import pisa

st.title("Personal Assistant")

thread_id = st.sidebar.text_input("Thread ID", value="default")

if "messages" not in st.session_state:
    st.session_state.messages = []

if st.sidebar.button("Clear Messages"):
    st.session_state.messages = []
    st.rerun()

When the user submits a query, the client opens a streaming POST with httpx, iterates the response lines, renders tool calls in a status container, and appends AI content to a live placeholder:

PYTHON
query = st.chat_input("Ask anything...")
if query:
    st.session_state.messages.append({"role": "user", "content": query})
    with st.chat_message("user"):
        st.markdown(query)

    with st.chat_message("assistant"):
        tool_container = st.container()
        placeholder = st.empty()
        full_response = ""

        with httpx.Client(timeout=None) as client:
            with client.stream(
                "POST",
                "http://localhost:8002/chat_stream",
                json={"query": query, "thread_id": thread_id},
            ) as response:
                for line in response.iter_lines():
                    if not line.strip():
                        continue
                    chunk = json.loads(line)
                    msg_type = chunk.get("type", "")
                    content = chunk.get("content", "")

                    if chunk.get("tool_calls"):
                        for tc in chunk["tool_calls"]:
                            with tool_container:
                                st.status(f"🔧 {tc['name']}", state="complete").write(
                                    f"```json\n{json.dumps(tc['args'], indent=2)}\n```"
                                )

                    if content and "AI" in msg_type:
                        full_response += content
                        placeholder.markdown(full_response.replace("$", "\\$"))

    st.session_state.messages.append({"role": "assistant", "content": full_response})

Note

Dollar signs are escaped (replace("")) so Streamlit's Markdown does not interpret them as LaTeX math — important when the agent returns prices like $260.00.

Exporting to PDF

The sidebar's Download PDF button converts the last assistant message from Markdown to styled HTML with markdown2, then renders it to a PDF in the user's Downloads folder with xhtml2pdf:

PYTHON
if st.sidebar.button("Download PDF"):
    assistant_msgs = [m for m in st.session_state.messages if m["role"] == "assistant"]
    if assistant_msgs:
        last_msg = assistant_msgs[-1]
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"{timestamp}_response.pdf"
        downloads_dir = os.path.join(os.path.expanduser("~"), "Downloads")
        filepath = os.path.join(downloads_dir, filename)

        html = markdown2.markdown(
            last_msg["content"],
            extras=["tables", "fenced-code-blocks", "cuddled-lists"],
        )
        styled_html = f"<html><head><meta charset='utf-8'></head><body>{html}</body></html>"

        with open(filepath, "wb") as f:
            pisa.CreatePDF(styled_html, dest=f)
        st.sidebar.success(f"Saved to {filename}")

Running Everything Together

Start the streaming server, then launch the Streamlit client in a second terminal:

POWERSHELL
# Terminal 1 — backend
python "03 AI Projects/07_deploy_agents_with_fastapi/02_stream_server.py"

# Terminal 2 — front end
streamlit run "03 AI Projects/07_deploy_agents_with_fastapi/03_streamlit_client.py"

Streamlit opens a browser tab. Ask "Summarize my unread emails and today's calendar" — you will watch tool-call chips appear as the agent works, the answer stream in token by token, and the whole exchange stay in memory under your chosen thread_id.

Tip

To expose the API publicly, deploy the FastAPI server behind a production ASGI setup (for example uvicorn with multiple workers, or a managed container on a cloud VM) and point the Streamlit client's URL at the deployed host instead of localhost.

You now have a streaming agent served over HTTP with a real chat client. In the final lesson we connect an agent to a production cloud database and stream answers over real-world data in Real-World Agent Project: MySQL Data & Streaming.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments