#langchain#ollama#local-llm#chatollama#python#dotenv#langchain-ollama

LangChain Getting Started

Install LangChain and langchain-ollama, configure environment variables, connect to a local Ollama model, and invoke and stream chat completions in Python.

Jun 4, 2026 at 10:30 AM5 min readFollowFollow (Hindi)

Topics You Will Master

Installing langchain, langchain-ollama, and python-dotenv
Loading environment variables from a .env file with python-dotenv
Creating a ChatOllama instance connected to a local Ollama server
Invoking a local LLM and reading the response content
Streaming responses token-by-token with llm.stream()
Inspecting response_metadata for model stats and token counts
Best For

Python developers who have Ollama running locally and want to integrate local LLMs into their applications using the LangChain framework.

Expected Outcome

A working LangChain setup that can send prompts to any locally running Ollama model, receive full or streamed responses, and inspect runtime metadata.

LangChain is a framework for building applications powered by language models. It provides a unified interface across model providers — cloud or local — so the same code that talks to GPT-4 can talk to a locally running Ollama model by swapping one line.

This tutorial walks through installation, environment setup, and the two core invocation patterns: a single blocking call and a streamed token-by-token response.

Prerequisites: Ollama installed and running locally (see the Ollama Setup Guide). Python 3.9+.

LangChain & Ollama — Local AI Development

Build production-ready LLM apps entirely on your own hardware. No API keys, no cloud costs.

Enroll on Udemy →

Installation and Environment Setup

Installing the Packages

Install all three packages in one command. langchain-ollama provides the ChatOllama integration; python-dotenv handles .env loading.

BASH
pip install langchain langchain-ollama python-dotenv

On Linux/macOS: same command — pip or pip3 depending on your Python setup.

Tip

Use a virtual environment to keep your project dependencies isolated:

BASH
python -m venv .venv
.venv\Scripts\activate

On Linux/macOS: source .venv/bin/activate

Loading Environment Variables

Store your API keys and configuration (LangSmith tracing, endpoints, etc.) in a .env file at the root of your project. Load it at the top of your script:

PYTHON
from dotenv import load_dotenv

load_dotenv('.env')
OUTPUT
True

A return value of True confirms at least one variable was loaded from the file. False means the .env file was not found — check that it exists in the working directory.

Note

LangSmith provides tracing and observability for LangChain runs. To enable it, add the following keys to your .env file:

PYTHON
LANGCHAIN_API_KEY=your_key_here
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=my-project

Get your API key at smith.langchain.com.


Pulling a Model

Before running any code, make sure the target model is available locally:

BASH
ollama pull qwen3

On Linux/macOS: same command — ollama is cross-platform.

Important

ChatOllama will fail at runtime if the model hasn't been pulled. Run ollama list to confirm the model name appears in the list before executing your script.


Chat with a Local LLM

Creating a ChatOllama Instance

ChatOllama connects to the Ollama server running at localhost:11434. Configure the model name, temperature, and maximum output token count at initialization:

PYTHON
from langchain_ollama import ChatOllama

llm = ChatOllama(
    base_url="http://localhost:11434",
    model="qwen3",
    temperature=0.8,
    num_predict=256
)

Parameter guide:

Parameter Type Effect
base_url str Ollama server URL — default is http://localhost:11434
model str Model name exactly as it appears in ollama list
temperature float Randomness (0 = deterministic, 1 = creative). 0.8 is a good general-purpose default
num_predict int Maximum tokens to generate per response

Note

base_url defaults to http://localhost:11434 if omitted. Include it explicitly when connecting to a remote Ollama instance or a custom port.

Invoking the Model

Pass any prompt string to llm.invoke(). It blocks until the full response is ready, then returns a message object — read the generated text from .content:

PYTHON
response = llm.invoke("What is the theory of relativity? Answer in 5 sentences.")
print(response.content)
OUTPUT
The theory of relativity, developed by Albert Einstein, consists of two parts: special relativity (1905) and general relativity (1915). Special relativity introduced the idea that the laws of physics are the same for all non-accelerating observers and that the speed of light is constant regardless of the observer's motion. It also established the famous equation E=mc², showing that mass and energy are interchangeable. General relativity extended this to include gravity, describing it not as a force but as the curvature of spacetime caused by mass. Together, they revolutionized our understanding of space, time, and the structure of the universe.

Streaming Responses

For long outputs or real-time UIs, use llm.stream() to receive the response one token chunk at a time instead of waiting for the full generation:

PYTHON
for chunk in llm.stream("What is the theory of relativity? Answer in 5 sentences."):
    print(chunk.content, end="", flush=True)

Tip

Use end="" and flush=True in print() so tokens appear inline as they arrive instead of buffering line-by-line. This gives the same feel as ChatGPT's typewriter effect.


Inspecting Response Metadata

The response object returned by llm.invoke() exposes runtime statistics via response_metadata. This is useful for debugging latency, tracking token usage, and profiling your prompts:

PYTHON
response.response_metadata
JSON
{
  "model": "qwen3",
  "created_at": "2025-10-22T12:16:31.9173234Z",
  "done": true,
  "done_reason": "stop",
  "total_duration": 2630496300,
  "load_duration": 2149414600,
  "prompt_eval_count": 100,
  "prompt_eval_duration": 17446400,
  "eval_count": 140,
  "eval_duration": 397223200,
  "model_name": "qwen3",
  "model_provider": "ollama"
}

Key fields explained:

Field Description
total_duration End-to-end latency in nanoseconds (divide by 1e9 for seconds)
load_duration Time spent loading the model into memory — high on first call, near-zero after warmup
prompt_eval_count Number of tokens in the input prompt
prompt_eval_duration Time spent processing the prompt (nanoseconds)
eval_count Number of tokens generated in the response
eval_duration Time spent on generation (nanoseconds)
done_reason Why generation stopped — "stop" means a natural end; "length" means num_predict was hit

Tip

To calculate tokens per second for generation:

PYTHON
tps = response.response_metadata['eval_count'] / (response.response_metadata['eval_duration'] / 1e9)
print(f"{tps:.1f} tokens/sec")

Quick Reference

Package Install

BASH
pip install langchain langchain-ollama python-dotenv

Minimal Working Example

PYTHON
from dotenv import load_dotenv
from langchain_ollama import ChatOllama

load_dotenv('.env')

llm = ChatOllama(
    model="qwen3",
    temperature=0.8,
    num_predict=256
)

# Single blocking call
response = llm.invoke("Explain transformers in 3 sentences.")
print(response.content)

# Streamed call
for chunk in llm.stream("Explain transformers in 3 sentences."):
    print(chunk.content, end="", flush=True)

ChatOllama Key Parameters

Parameter Default Description
model Model name (required)
base_url http://localhost:11434 Ollama server address
temperature 0.8 Output randomness
num_predict 128 Max tokens to generate
top_p 0.9 Nucleus sampling threshold
top_k 40 Top-k sampling cutoff

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments