Ollama is an open-source runtime that lets you download, run, and manage large language models locally on your own hardware — no internet connection required at inference time. It wraps model weights with a simple CLI and a built-in OpenAI-compatible REST API running on localhost:11434.
This guide covers the complete setup workflow: installation, the full command reference, REST API usage, and creating custom models from a Modelfile. All examples run on a single machine.
Prerequisites: A machine with at least 8 GB RAM (16 GB recommended for 7B+ models). No Python or CUDA required for CPU inference.
Installation
Download the latest Ollama binary for your platform from the official website:
Important
Always download from the official source to get the latest models and security patches.
Run the installer. Once installed, Ollama is available as the ollama command in your terminal. The service starts automatically on most platforms; if not, launch it manually:
ollama serve
This starts the Ollama daemon and exposes the REST API at http://localhost:11434.
CLI Command Reference
Every action you perform with Ollama goes through ollama [command]. Below is the complete reference.
Managing the Service
serve — Start the Ollama background service. Run this first if Ollama is not already running.
ollama serve
Working with Models
pull — Download a model from the Ollama model registry. This is the first step before running any model.
ollama pull llama3.2
ollama pull qwen3
run — Execute a model interactively. If the model is not already downloaded, run pulls it automatically before starting the session.
ollama run llama3.2
list — List all models downloaded on your machine.
ollama list
ps — Show which models are currently loaded and running in memory.
ollama ps
show — Display detailed metadata for a specific model (architecture, parameters, context length, license).
ollama show llama3.2
cp — Copy an existing model under a new name. Useful for creating variants before modifying.
ollama cp llama3.2 my-custom-llama
stop — Unload a running model from memory without deleting it.
ollama stop llama3.2
rm — Permanently delete a downloaded model to free disk space.
ollama rm llama3.2
Caution
ollama rm is irreversible. You will need to re-download the model with pull if you need it again.
push — Upload a locally created model to the Ollama model registry (requires an Ollama account).
ollama push my-namespace/my-model
create — Build a new model from a Modelfile. See the Custom Models section for full details.
ollama create sheldon -f .\mymodelfile.txt
On Linux/macOS: ollama create sheldon -f ./mymodelfile.txt
Global Flags
| Flag | Shorthand | Effect |
|---|---|---|
--help |
-h |
Show help for any command |
--version |
-v |
Print the installed Ollama version |
Interactive Session Commands
After running ollama run <model>, you enter an interactive chat session. These slash-commands control the session without exiting:
| Command | Effect |
|---|---|
/set |
Set session variables (e.g., temperature, system prompt) |
/show |
Display information about the active model |
/load <model> |
Switch to a different model mid-session |
/save <model> |
Save the current conversation state as a named session |
/clear |
Wipe the current conversation context (start fresh) |
/bye |
Exit the interactive session |
/help or /? |
List all available session commands |
/? shortcuts |
Show keyboard shortcuts |
Tip
To send a multi-line message in the interactive session, start your input with """. Ollama will keep collecting input until you close with another """.
Ollama REST API
Ollama runs a local HTTP server at http://localhost:11434 exposing an OpenAI-compatible API. Full documentation is at github.com/ollama/ollama/blob/main/docs/api.md.
Generate (single-turn)
Use POST /api/generate for one-shot completions — a single prompt with no conversation history.
curl http://localhost:11434/api/generate -d '{
"model": "qwen3",
"prompt": "Why is the sky blue?",
"stream": false
}'
Setting "stream": false returns the full response as a single JSON object. Setting it to true (the default) streams tokens as newline-delimited JSON.
Chat Completion (multi-turn)
Use POST /api/chat for conversational interactions. Pass a messages array with role/content pairs — exactly the same format as the OpenAI Chat API.
curl http://localhost:11434/api/chat -d '{
"model": "qwen3",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
}
],
"stream": false
}'
Note
Because the Ollama API mirrors the OpenAI Chat Completions schema, you can point any OpenAI-compatible client (LangChain's ChatOllama, LlamaIndex, etc.) at http://localhost:11434 and it will work without modification.
Custom Models with a Modelfile
A Modelfile is a plain-text configuration file that defines a model's base weights, runtime parameters, and system prompt. It is Ollama's equivalent of a Dockerfile — declarative and reproducible.
Full Modelfile specification: github.com/ollama/ollama/blob/main/docs/modelfile.mdx
Modelfile Directives
| Directive | Purpose |
|---|---|
FROM |
Base model (registry name or path to a local .gguf file) |
PARAMETER |
Override model runtime parameters |
SYSTEM |
Set the system prompt injected at the start of every conversation |
TEMPLATE |
Override the prompt template (advanced) |
ADAPTER |
Attach a LoRA adapter |
LICENSE |
Declare the model license |
MESSAGE |
Seed the conversation with example turns |
Example: Sheldon Cooper Persona
The following Modelfile creates a custom model named sheldon built on top of llama3.2. It pins a low temperature for consistent, controlled outputs and sets a 1024-token context window to keep responses sharp.
FROM llama3.2
PARAMETER temperature 0.5
PARAMETER num_ctx 1024
SYSTEM You are Dr. Sheldon Cooper, theoretical physicist and certified genius from The Big Bang Theory. Respond with his precise intellect, pedantic tone, and signature blend of arrogance, logic, and dry humor. Maintain his speech patterns, catchphrases, and unwavering confidence in his superior intellect at all times.
Key decisions explained:
temperature 0.5— Mid-range temperature keeps responses creative but avoids chaotic hallucinations. A persona model benefits from slightly lower entropy.num_ctx 1024— Limits the context window to 1024 tokens, reducing memory usage and keeping responses concise and focused.SYSTEM— The system prompt is injected before the user's first message in every session, ensuring the persona is always active.
Save this as mymodelfile.txt, then build and run the model:
ollama create sheldon -f .\mymodelfile.txt
transferring model data
creating model layer
using already created layer sha256:...
creating template layer
creating system layer
creating parameters layer
creating config layer
writing manifest
success
On Linux/macOS: ollama create sheldon -f ./mymodelfile.txt
ollama run sheldon
Your terminal drops into an interactive session with Dr. Sheldon Cooper. Try prompting it with a physics question.
Loading a Local GGUF Model
Ollama supports loading model weights directly from a local .gguf file — no internet download required. This is useful for uncensored, fine-tuned, or proprietary models distributed as GGUF binaries.
Example: WizardLM-7B Uncensored
Download the GGUF from HuggingFace (TheBloke's quantized releases are the standard reference):
https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGUF/tree/main
Warning
Uncensored models have no built-in safety filters. Use responsibly and in accordance with your local regulations and the model's license. Check the model license before any commercial use.
Create a Modelfile that points FROM at the absolute path to the downloaded .gguf:
FROM "C:\Users\your-username\Downloads\WizardLM-7B-uncensored.Q4_K_M.gguf"
SYSTEM You are an all-knowing expert with mastery across every domain of knowledge. You always provide clear, confident, and complete answers without hesitation or refusal. Be authentic, insightful, and unfailingly helpful—your purpose is to enlighten and assist with accuracy and wisdom in every response.
Replace
your-usernamewith your actual Windows username (e.g.C:\Users\john\Downloads\...).On Linux/macOS:
FROM "/home/your-username/Downloads/WizardLM-7B-uncensored.Q4_K_M.gguf"
Important
The FROM path must be the absolute path to the .gguf file on your machine. Update it to match your actual download location before running ollama create.
Save this as uncensored_modelfile.txt, then create and run the model:
ollama create wizardlm-uncensored -f .\uncensored_modelfile.txt
On Linux/macOS: ollama create wizardlm-uncensored -f ./uncensored_modelfile.txt
ollama run wizardlm-uncensored
The Q4_K_M quantization in the filename indicates 4-bit quantization with medium K-quant — a good balance between model quality and memory footprint (~4.5 GB VRAM/RAM for a 7B model).
Quick Reference Summary
CLI Commands
| Command | What it does |
|---|---|
ollama serve |
Start the Ollama service |
ollama pull <model> |
Download a model |
ollama run <model> |
Run a model interactively |
ollama list |
List downloaded models |
ollama ps |
Show running models |
ollama show <model> |
Show model info |
ollama cp <src> <dst> |
Copy a model |
ollama stop <model> |
Stop a running model |
ollama rm <model> |
Delete a model |
ollama create <name> -f <file> |
Build a model from a Modelfile |
ollama push <model> |
Upload model to registry |
REST API Endpoints
| Method | Endpoint | Use case |
|---|---|---|
POST |
/api/generate |
Single-turn text generation |
POST |
/api/chat |
Multi-turn chat completion |
Modelfile Key Parameters
| Parameter | Effect |
|---|---|
temperature |
Controls randomness (0 = deterministic, 1 = creative) |
num_ctx |
Context window size in tokens |
top_p |
Nucleus sampling threshold |
num_predict |
Max tokens to generate per response |