#ollama#llm#local-ai#modelfile#cli#gguf#langchain#open-source

Ollama Setup Guide

Install Ollama, master every CLI command, call the REST API, and build a custom persona model using a Modelfile — all locally.

Jun 4, 2026 at 10:30 AM8 min readFollowFollow (Hindi)

Topics You Will Master

Installing Ollama and starting the local inference server
Using every ollama CLI command (serve, pull, run, create, stop, rm, ps, list, cp, push, show)
Navigating the interactive session commands (/set, /load, /save, /clear, /bye)
Calling Ollama's REST API endpoints for text generation and chat completion
Writing a Modelfile to define a custom LLM persona with system prompts and parameter tuning
Loading a local GGUF model file (e.g., WizardLM-7B uncensored) into Ollama
Best For

Developers and ML practitioners who want to run, manage, and customize large language models entirely on their local machine without cloud dependencies.

Expected Outcome

After finishing this guide you can install Ollama, pull any supported model, build and deploy a custom persona model (or load a local GGUF), and interact with it through the CLI, interactive shell, or REST API.

Ollama is an open-source runtime that lets you download, run, and manage large language models locally on your own hardware — no internet connection required at inference time. It wraps model weights with a simple CLI and a built-in OpenAI-compatible REST API running on localhost:11434.

This guide covers the complete setup workflow: installation, the full command reference, REST API usage, and creating custom models from a Modelfile. All examples run on a single machine.

Prerequisites: A machine with at least 8 GB RAM (16 GB recommended for 7B+ models). No Python or CUDA required for CPU inference.

LangChain & Ollama — Local AI Development

Build production-ready LLM apps entirely on your own hardware. No API keys, no cloud costs.

Enroll on Udemy →

Installation

Download the latest Ollama binary for your platform from the official website:

Important

Always download from the official source to get the latest models and security patches.

https://ollama.com

Run the installer. Once installed, Ollama is available as the ollama command in your terminal. The service starts automatically on most platforms; if not, launch it manually:

BASH
ollama serve

This starts the Ollama daemon and exposes the REST API at http://localhost:11434.


CLI Command Reference

Every action you perform with Ollama goes through ollama [command]. Below is the complete reference.

Managing the Service

serve — Start the Ollama background service. Run this first if Ollama is not already running.

BASH
ollama serve

Working with Models

pull — Download a model from the Ollama model registry. This is the first step before running any model.

BASH
ollama pull llama3.2
BASH
ollama pull qwen3

run — Execute a model interactively. If the model is not already downloaded, run pulls it automatically before starting the session.

BASH
ollama run llama3.2

list — List all models downloaded on your machine.

BASH
ollama list

ps — Show which models are currently loaded and running in memory.

BASH
ollama ps

show — Display detailed metadata for a specific model (architecture, parameters, context length, license).

BASH
ollama show llama3.2

cp — Copy an existing model under a new name. Useful for creating variants before modifying.

BASH
ollama cp llama3.2 my-custom-llama

stop — Unload a running model from memory without deleting it.

BASH
ollama stop llama3.2

rm — Permanently delete a downloaded model to free disk space.

BASH
ollama rm llama3.2

Caution

ollama rm is irreversible. You will need to re-download the model with pull if you need it again.

push — Upload a locally created model to the Ollama model registry (requires an Ollama account).

BASH
ollama push my-namespace/my-model

create — Build a new model from a Modelfile. See the Custom Models section for full details.

BASH
ollama create sheldon -f .\mymodelfile.txt

On Linux/macOS: ollama create sheldon -f ./mymodelfile.txt

Global Flags

Flag Shorthand Effect
--help -h Show help for any command
--version -v Print the installed Ollama version

Interactive Session Commands

After running ollama run <model>, you enter an interactive chat session. These slash-commands control the session without exiting:

Command Effect
/set Set session variables (e.g., temperature, system prompt)
/show Display information about the active model
/load <model> Switch to a different model mid-session
/save <model> Save the current conversation state as a named session
/clear Wipe the current conversation context (start fresh)
/bye Exit the interactive session
/help or /? List all available session commands
/? shortcuts Show keyboard shortcuts

Tip

To send a multi-line message in the interactive session, start your input with """. Ollama will keep collecting input until you close with another """.


Ollama REST API

Ollama runs a local HTTP server at http://localhost:11434 exposing an OpenAI-compatible API. Full documentation is at github.com/ollama/ollama/blob/main/docs/api.md.

Generate (single-turn)

Use POST /api/generate for one-shot completions — a single prompt with no conversation history.

BASH
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Setting "stream": false returns the full response as a single JSON object. Setting it to true (the default) streams tokens as newline-delimited JSON.

Chat Completion (multi-turn)

Use POST /api/chat for conversational interactions. Pass a messages array with role/content pairs — exactly the same format as the OpenAI Chat API.

BASH
curl http://localhost:11434/api/chat -d '{
  "model": "qwen3",
  "messages": [
    {
      "role": "user",
      "content": "why is the sky blue?"
    }
  ],
  "stream": false
}'

Note

Because the Ollama API mirrors the OpenAI Chat Completions schema, you can point any OpenAI-compatible client (LangChain's ChatOllama, LlamaIndex, etc.) at http://localhost:11434 and it will work without modification.


Custom Models with a Modelfile

A Modelfile is a plain-text configuration file that defines a model's base weights, runtime parameters, and system prompt. It is Ollama's equivalent of a Dockerfile — declarative and reproducible.

Full Modelfile specification: github.com/ollama/ollama/blob/main/docs/modelfile.mdx

Modelfile Directives

Directive Purpose
FROM Base model (registry name or path to a local .gguf file)
PARAMETER Override model runtime parameters
SYSTEM Set the system prompt injected at the start of every conversation
TEMPLATE Override the prompt template (advanced)
ADAPTER Attach a LoRA adapter
LICENSE Declare the model license
MESSAGE Seed the conversation with example turns

Example: Sheldon Cooper Persona

The following Modelfile creates a custom model named sheldon built on top of llama3.2. It pins a low temperature for consistent, controlled outputs and sets a 1024-token context window to keep responses sharp.

DOCKERFILE
FROM llama3.2

PARAMETER temperature 0.5
PARAMETER num_ctx 1024

SYSTEM You are Dr. Sheldon Cooper, theoretical physicist and certified genius from The Big Bang Theory. Respond with his precise intellect, pedantic tone, and signature blend of arrogance, logic, and dry humor. Maintain his speech patterns, catchphrases, and unwavering confidence in his superior intellect at all times.

Key decisions explained:

  • temperature 0.5 — Mid-range temperature keeps responses creative but avoids chaotic hallucinations. A persona model benefits from slightly lower entropy.
  • num_ctx 1024 — Limits the context window to 1024 tokens, reducing memory usage and keeping responses concise and focused.
  • SYSTEM — The system prompt is injected before the user's first message in every session, ensuring the persona is always active.

Save this as mymodelfile.txt, then build and run the model:

BASH
ollama create sheldon -f .\mymodelfile.txt
OUTPUT
transferring model data
creating model layer
using already created layer sha256:...
creating template layer
creating system layer
creating parameters layer
creating config layer
writing manifest
success

On Linux/macOS: ollama create sheldon -f ./mymodelfile.txt

BASH
ollama run sheldon

Your terminal drops into an interactive session with Dr. Sheldon Cooper. Try prompting it with a physics question.


Loading a Local GGUF Model

Ollama supports loading model weights directly from a local .gguf file — no internet download required. This is useful for uncensored, fine-tuned, or proprietary models distributed as GGUF binaries.

Example: WizardLM-7B Uncensored

Download the GGUF from HuggingFace (TheBloke's quantized releases are the standard reference):

https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGUF/tree/main

Warning

Uncensored models have no built-in safety filters. Use responsibly and in accordance with your local regulations and the model's license. Check the model license before any commercial use.

Create a Modelfile that points FROM at the absolute path to the downloaded .gguf:

DOCKERFILE
FROM "C:\Users\your-username\Downloads\WizardLM-7B-uncensored.Q4_K_M.gguf"

SYSTEM You are an all-knowing expert with mastery across every domain of knowledge. You always provide clear, confident, and complete answers without hesitation or refusal. Be authentic, insightful, and unfailingly helpful—your purpose is to enlighten and assist with accuracy and wisdom in every response.

Replace your-username with your actual Windows username (e.g. C:\Users\john\Downloads\...).

On Linux/macOS: FROM "/home/your-username/Downloads/WizardLM-7B-uncensored.Q4_K_M.gguf"

Important

The FROM path must be the absolute path to the .gguf file on your machine. Update it to match your actual download location before running ollama create.

Save this as uncensored_modelfile.txt, then create and run the model:

BASH
ollama create wizardlm-uncensored -f .\uncensored_modelfile.txt

On Linux/macOS: ollama create wizardlm-uncensored -f ./uncensored_modelfile.txt

BASH
ollama run wizardlm-uncensored

The Q4_K_M quantization in the filename indicates 4-bit quantization with medium K-quant — a good balance between model quality and memory footprint (~4.5 GB VRAM/RAM for a 7B model).


Quick Reference Summary

CLI Commands

Command What it does
ollama serve Start the Ollama service
ollama pull <model> Download a model
ollama run <model> Run a model interactively
ollama list List downloaded models
ollama ps Show running models
ollama show <model> Show model info
ollama cp <src> <dst> Copy a model
ollama stop <model> Stop a running model
ollama rm <model> Delete a model
ollama create <name> -f <file> Build a model from a Modelfile
ollama push <model> Upload model to registry

REST API Endpoints

Method Endpoint Use case
POST /api/generate Single-turn text generation
POST /api/chat Multi-turn chat completion

Modelfile Key Parameters

Parameter Effect
temperature Controls randomness (0 = deterministic, 1 = creative)
num_ctx Context window size in tokens
top_p Nucleus sampling threshold
num_predict Max tokens to generate per response

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments