LangChain is a framework for building applications powered by language models. It provides a unified interface across model providers — cloud or local — so the same code that talks to GPT-4 can talk to a locally running Ollama model by swapping one line.
This tutorial walks through installation, environment setup, and the two core invocation patterns: a single blocking call and a streamed token-by-token response.
Prerequisites: Ollama installed and running locally (see the Ollama Setup Guide). Python 3.9+.
Installation and Environment Setup
Installing the Packages
Install all three packages in one command. langchain-ollama provides the ChatOllama integration; python-dotenv handles .env loading.
pip install langchain langchain-ollama python-dotenv
On Linux/macOS: same command — pip or pip3 depending on your Python setup.
Tip
Use a virtual environment to keep your project dependencies isolated:
python -m venv .venv
.venv\Scripts\activate
On Linux/macOS: source .venv/bin/activate
Loading Environment Variables
Store your API keys and configuration (LangSmith tracing, endpoints, etc.) in a .env file at the root of your project. Load it at the top of your script:
from dotenv import load_dotenv
load_dotenv('.env')
True
A return value of True confirms at least one variable was loaded from the file. False means the .env file was not found — check that it exists in the working directory.
Note
LangSmith provides tracing and observability for LangChain runs. To enable it, add the following keys to your .env file:
LANGCHAIN_API_KEY=your_key_here
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=my-project
Get your API key at smith.langchain.com.
Pulling a Model
Before running any code, make sure the target model is available locally:
ollama pull qwen3
On Linux/macOS: same command — ollama is cross-platform.
Important
ChatOllama will fail at runtime if the model hasn't been pulled. Run ollama list to confirm the model name appears in the list before executing your script.
Chat with a Local LLM
Creating a ChatOllama Instance
ChatOllama connects to the Ollama server running at localhost:11434. Configure the model name, temperature, and maximum output token count at initialization:
from langchain_ollama import ChatOllama
llm = ChatOllama(
base_url="http://localhost:11434",
model="qwen3",
temperature=0.8,
num_predict=256
)
Parameter guide:
| Parameter | Type | Effect |
|---|---|---|
base_url |
str |
Ollama server URL — default is http://localhost:11434 |
model |
str |
Model name exactly as it appears in ollama list |
temperature |
float |
Randomness (0 = deterministic, 1 = creative). 0.8 is a good general-purpose default |
num_predict |
int |
Maximum tokens to generate per response |
Note
base_url defaults to http://localhost:11434 if omitted. Include it explicitly when connecting to a remote Ollama instance or a custom port.
Invoking the Model
Pass any prompt string to llm.invoke(). It blocks until the full response is ready, then returns a message object — read the generated text from .content:
response = llm.invoke("What is the theory of relativity? Answer in 5 sentences.")
print(response.content)
The theory of relativity, developed by Albert Einstein, consists of two parts: special relativity (1905) and general relativity (1915). Special relativity introduced the idea that the laws of physics are the same for all non-accelerating observers and that the speed of light is constant regardless of the observer's motion. It also established the famous equation E=mc², showing that mass and energy are interchangeable. General relativity extended this to include gravity, describing it not as a force but as the curvature of spacetime caused by mass. Together, they revolutionized our understanding of space, time, and the structure of the universe.
Streaming Responses
For long outputs or real-time UIs, use llm.stream() to receive the response one token chunk at a time instead of waiting for the full generation:
for chunk in llm.stream("What is the theory of relativity? Answer in 5 sentences."):
print(chunk.content, end="", flush=True)
Tip
Use end="" and flush=True in print() so tokens appear inline as they arrive instead of buffering line-by-line. This gives the same feel as ChatGPT's typewriter effect.
Inspecting Response Metadata
The response object returned by llm.invoke() exposes runtime statistics via response_metadata. This is useful for debugging latency, tracking token usage, and profiling your prompts:
response.response_metadata
{
"model": "qwen3",
"created_at": "2025-10-22T12:16:31.9173234Z",
"done": true,
"done_reason": "stop",
"total_duration": 2630496300,
"load_duration": 2149414600,
"prompt_eval_count": 100,
"prompt_eval_duration": 17446400,
"eval_count": 140,
"eval_duration": 397223200,
"model_name": "qwen3",
"model_provider": "ollama"
}
Key fields explained:
| Field | Description |
|---|---|
total_duration |
End-to-end latency in nanoseconds (divide by 1e9 for seconds) |
load_duration |
Time spent loading the model into memory — high on first call, near-zero after warmup |
prompt_eval_count |
Number of tokens in the input prompt |
prompt_eval_duration |
Time spent processing the prompt (nanoseconds) |
eval_count |
Number of tokens generated in the response |
eval_duration |
Time spent on generation (nanoseconds) |
done_reason |
Why generation stopped — "stop" means a natural end; "length" means num_predict was hit |
Tip
To calculate tokens per second for generation:
tps = response.response_metadata['eval_count'] / (response.response_metadata['eval_duration'] / 1e9)
print(f"{tps:.1f} tokens/sec")
Quick Reference
Package Install
pip install langchain langchain-ollama python-dotenv
Minimal Working Example
from dotenv import load_dotenv
from langchain_ollama import ChatOllama
load_dotenv('.env')
llm = ChatOllama(
model="qwen3",
temperature=0.8,
num_predict=256
)
# Single blocking call
response = llm.invoke("Explain transformers in 3 sentences.")
print(response.content)
# Streamed call
for chunk in llm.stream("Explain transformers in 3 sentences."):
print(chunk.content, end="", flush=True)
ChatOllama Key Parameters
| Parameter | Default | Description |
|---|---|---|
model |
— | Model name (required) |
base_url |
http://localhost:11434 |
Ollama server address |
temperature |
0.8 |
Output randomness |
num_predict |
128 |
Max tokens to generate |
top_p |
0.9 |
Nucleus sampling threshold |
top_k |
40 |
Top-k sampling cutoff |