Every agent so far ran in a terminal. To make an agent usable by an app, you serve it over HTTP. This lesson deploys agents with FastAPI — first a simple chat endpoint, then a streaming server that pushes tokens as they are generated — and builds a Streamlit chat client that consumes the stream, shows tool calls, and exports conversations to PDF.
Note
This lesson reuses the MCP servers and the scripts/ helpers from earlier projects. The streaming server connects Gmail, Yahoo Finance, and Google Sheets, so have those configured as shown in Build a Daily Briefing AI Agent.
Installation
Install the web stack:
pip install fastapi uvicorn httpx streamlit markdown2 xhtml2pdf
A Simple FastAPI Chat Server
Start with a non-streaming endpoint. It validates the request with Pydantic, builds an agent per call, and returns the final text. CORS is open so a browser client can reach it during development.
import sys
import os
root_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_dir)
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from dotenv import load_dotenv
load_dotenv()
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import create_agent
from langchain.messages import HumanMessage
app = FastAPI()
class ChatRequest(BaseModel):
prompt: str = Field(..., min_length=2)
model: str = 'gemini-2.5-flash'
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
@app.get("/")
async def read_root():
return {"status": "FastAPI agent server is up!"}
@app.post("/chat")
async def chat(request: ChatRequest):
if not request.prompt.strip():
raise HTTPException(status_code=400, detail="Empty prompt!")
try:
model = ChatGoogleGenerativeAI(model=request.model)
agent = create_agent(model=model)
response = agent.invoke({'messages': [HumanMessage(request.prompt)]})
return {'response': response['messages'][-1].text}
except Exception as e:
raise HTTPException(status_code=500, detail=f"Server error: {e}")
if __name__ == "__main__":
import uvicorn
uvicorn.run(app=app, host='0.0.0.0', port=8001)
Run it:
python "03 AI Projects/07_deploy_agents_with_fastapi/01_fastapi_server.py"
Tip
During development you can also use fastapi dev 01_fastapi_server.py for auto-reload. The __main__ block lets you run the file directly with python.
Test it with curl:
# Health check
curl http://localhost:8001/
# Chat request
curl -X POST http://localhost:8001/chat \
-H "Content-Type: application/json" \
-d "{\"prompt\": \"What is the capital of France?\"}"
{"response": "The capital of France is Paris."}
Note
The -d payload above uses escaped double quotes so it works in Windows PowerShell and CMD. On Linux/macOS, you can use single quotes instead: -d '{"prompt": "What is the capital of France?"}'.
A Streaming Server with Tools and Memory
A real assistant should stream its response and use tools. This server loads MCP tools once at startup with a FastAPI lifespan, then streams agent output as newline-delimited JSON (application/x-ndjson).
Loading Tools at Startup
The lifespan context manager runs get_tools() before the server accepts requests, so tool loading happens once rather than per request. Destructive email and spreadsheet tools are filtered out:
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
import json
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import create_agent
from langchain_mcp_adapters.client import MultiServerMCPClient
from langgraph.checkpoint.memory import InMemorySaver
from langchain.messages import HumanMessage, AIMessageChunk
from scripts import base_tools, prompts, utils
checkpointer = InMemorySaver()
tools = None
class ChatRequest(BaseModel):
query: str = Field(..., min_length=2)
model: str = "gemini-2.5-flash"
thread_id: str = "default"
async def get_tools():
mcp_config = utils.load_mcp_config("gmail", "yahoo-finance", "google-sheets")
client = MultiServerMCPClient(mcp_config)
mcp_tools = await client.get_tools()
tools = mcp_tools + [base_tools.web_search, base_tools.get_weather]
filter_tools = ["delete_email", "batch_modify_emails", "batch_delete_emails",
"delete_label", "delete_filter", "update_cells"]
safe_tools = [tool for tool in tools if tool.name not in filter_tools]
print(f"Loaded {len(safe_tools)} Tools")
return safe_tools
@asynccontextmanager
async def lifespan(app: FastAPI):
global tools
tools = await get_tools()
print("Tools are loaded. Ready to create agent!")
yield
app = FastAPI(lifespan=lifespan)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], allow_methods=["*"], allow_headers=["*"],
)
The Streaming Generator
stream_response builds the agent with the preloaded tools and a checkpointer, then iterates agent.astream(..., stream_mode='messages'). Each chunk is serialized to JSON — including any tool calls — and yielded as one line:
async def stream_response(query, model_name, thread_id):
system_prompt = prompts.get_assistant_prompt()
model = ChatGoogleGenerativeAI(model=model_name)
agent = create_agent(model=model, tools=tools,
system_prompt=system_prompt, checkpointer=checkpointer)
config = {"configurable": {"thread_id": thread_id}}
async for chunk, metadata in agent.astream(
{'messages': [HumanMessage(query)]},
stream_mode='messages', config=config):
data = {"type": chunk.__class__.__name__, "content": chunk.text}
if isinstance(chunk, AIMessageChunk) and chunk.tool_calls:
data['tool_calls'] = chunk.tool_calls
yield (json.dumps(data) + "\n").encode()
The thread_id in the config is what gives the deployed agent memory — repeated requests with the same thread_id continue the same conversation.
Note
get_assistant_prompt() returns a personal-assistant system prompt with the current date and tool guidance (Gmail, Yahoo Finance, Google Sheets, web, weather). It instructs the agent to always call a tool before answering, to never send email without confirmation, and to confirm before writing to a sheet. Customize the persona and defaults for your own user.
The Streaming Endpoint
Wrap the generator in a StreamingResponse:
@app.get("/")
async def read_root():
return {"status": "Streaming agent server is up!"}
@app.post("/chat_stream")
async def chat_stream(request: ChatRequest):
if not request.query.strip():
raise HTTPException(status_code=400, detail="Empty prompt!")
try:
return StreamingResponse(
stream_response(request.query, request.model, request.thread_id),
media_type="application/x-ndjson")
except Exception as e:
raise HTTPException(status_code=500, detail=f"Server error: {e}")
if __name__ == "__main__":
import uvicorn
uvicorn.run(app=app, host="0.0.0.0", port=8002)
Run the streaming server and test it with curl. Reuse a thread_id to keep memory across calls:
python "03 AI Projects/07_deploy_agents_with_fastapi/02_stream_server.py"
curl -X POST http://localhost:8002/chat_stream \
-H "Content-Type: application/json" \
-d "{\"query\": \"What is the weather in London?\", \"thread_id\": \"user-123\"}"
{"type": "AIMessageChunk", "content": "", "tool_calls": [{"name": "get_weather", "args": {"location": "London"}, ...}]}
{"type": "AIMessageChunk", "content": "The weather in London"}
{"type": "AIMessageChunk", "content": " is 14°C and cloudy..."}
Each line is a JSON object: tool-call chunks announce which tool ran, and content chunks stream the answer token by token.
A Streamlit Chat Client
The front end is a Streamlit app that posts to the streaming endpoint, renders tokens live, shows tool calls in a status box, and can export the last answer to PDF.
Tip
New to Streamlit? This companion playlist walks through the basics:
https://www.youtube.com/watch?v=hff2tHUzxJM&list=PLc2rvfiptPSSpZ99EnJbH5LjTJ_nOoSWW
The client keeps chat history in st.session_state, with a sidebar to set the thread_id, clear messages, and download a PDF:
import streamlit as st
import httpx
import json
import os
from datetime import datetime
import markdown2
from xhtml2pdf import pisa
st.title("Personal Assistant")
thread_id = st.sidebar.text_input("Thread ID", value="default")
if "messages" not in st.session_state:
st.session_state.messages = []
if st.sidebar.button("Clear Messages"):
st.session_state.messages = []
st.rerun()
When the user submits a query, the client opens a streaming POST with httpx, iterates the response lines, renders tool calls in a status container, and appends AI content to a live placeholder:
query = st.chat_input("Ask anything...")
if query:
st.session_state.messages.append({"role": "user", "content": query})
with st.chat_message("user"):
st.markdown(query)
with st.chat_message("assistant"):
tool_container = st.container()
placeholder = st.empty()
full_response = ""
with httpx.Client(timeout=None) as client:
with client.stream(
"POST",
"http://localhost:8002/chat_stream",
json={"query": query, "thread_id": thread_id},
) as response:
for line in response.iter_lines():
if not line.strip():
continue
chunk = json.loads(line)
msg_type = chunk.get("type", "")
content = chunk.get("content", "")
if chunk.get("tool_calls"):
for tc in chunk["tool_calls"]:
with tool_container:
st.status(f"🔧 {tc['name']}", state="complete").write(
f"```json\n{json.dumps(tc['args'], indent=2)}\n```"
)
if content and "AI" in msg_type:
full_response += content
placeholder.markdown(full_response.replace("$", "\\$"))
st.session_state.messages.append({"role": "assistant", "content": full_response})
Note
Dollar signs are escaped (replace("")) so Streamlit's Markdown does not interpret them as LaTeX math — important when the agent returns prices like $260.00.
Exporting to PDF
The sidebar's Download PDF button converts the last assistant message from Markdown to styled HTML with markdown2, then renders it to a PDF in the user's Downloads folder with xhtml2pdf:
if st.sidebar.button("Download PDF"):
assistant_msgs = [m for m in st.session_state.messages if m["role"] == "assistant"]
if assistant_msgs:
last_msg = assistant_msgs[-1]
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{timestamp}_response.pdf"
downloads_dir = os.path.join(os.path.expanduser("~"), "Downloads")
filepath = os.path.join(downloads_dir, filename)
html = markdown2.markdown(
last_msg["content"],
extras=["tables", "fenced-code-blocks", "cuddled-lists"],
)
styled_html = f"<html><head><meta charset='utf-8'></head><body>{html}</body></html>"
with open(filepath, "wb") as f:
pisa.CreatePDF(styled_html, dest=f)
st.sidebar.success(f"Saved to {filename}")
Running Everything Together
Start the streaming server, then launch the Streamlit client in a second terminal:
# Terminal 1 — backend
python "03 AI Projects/07_deploy_agents_with_fastapi/02_stream_server.py"
# Terminal 2 — front end
streamlit run "03 AI Projects/07_deploy_agents_with_fastapi/03_streamlit_client.py"
Streamlit opens a browser tab. Ask "Summarize my unread emails and today's calendar" — you will watch tool-call chips appear as the agent works, the answer stream in token by token, and the whole exchange stay in memory under your chosen thread_id.
Tip
To expose the API publicly, deploy the FastAPI server behind a production ASGI setup (for example uvicorn with multiple workers, or a managed container on a cloud VM) and point the Streamlit client's URL at the deployed host instead of localhost.
You now have a streaming agent served over HTTP with a real chat client. In the final lesson we connect an agent to a production cloud database and stream answers over real-world data in Real-World Agent Project: MySQL Data & Streaming.