Introduction
Running large language models locally has become practical thanks to quantization techniques, efficient inference engines, and a thriving open-source ecosystem. Whether for privacy, cost savings, or offline availability, local LLMs offer a compelling alternative to cloud APIs for many workloads. This guide covers the two most popular local LLM platforms — Ollama and LM Studio.

Ollama
Ollama is the most popular tool for running LLMs locally, known for its simplicity and command-line focus.
Installation
macOS
brew install ollama
Linux
curl -fsSL https://ollama.ai/install.sh | sh
Windows
Download from https://ollama.ai/download
Getting Started
Ollama makes running a model a single command:
Pull and run a model
ollama run llama3.2:3b
List available models
ollama list
Pull a specific model without running
ollama pull mistral:7b
Popular Models for Ollama
| Model | Size | RAM Required | Best For |
|-------|------|-------------|----------|
| Llama 3.2 3B | 2.0 GB | 4 GB | Fast responses, simple tasks |
| Llama 3.1 8B | 4.7 GB | 8 GB | General purpose Q&A; |
| Mistral 7B | 4.1 GB | 8 GB | Code, reasoning, instruction following |
| Qwen2.5 7B | 4.8 GB | 8 GB | Strong multilingual, coding |
| Mixtral 8x7B | 26 GB | 48 GB | High quality, close to GPT-3.5 |
| DeepSeek-R1 7B | 4.5 GB | 8 GB | Strong reasoning, step-by-step |
Using Ollama Programmatically
Ollama provides a REST API at http://localhost:11434:
import requests
response = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3.2:3b",
"prompt": "Explain quantum computing in three sentences.",
"stream": False
})
print(response.json()["response"])
Or use the official Python library:
import ollama
response = ollama.chat(model="llama3.2:3b", messages=[
{"role": "user", "content": "What is the capital of France?"}
])
print(response["message"]["content"])
Custom Modelfiles
Create custom models with system prompts and parameters:
FROM llama3.1:8b
Set system prompt
SYSTEM "You are a helpful coding assistant. Provide concise code examples."
Configure parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
Build and run:
ollama create my-coding-assistant -f Modelfile
ollama run my-coding-assistant
LM Studio
LM Studio is a GUI-focused alternative that excels for users who prefer visual interfaces and easy model browsing.
Key Features
-
Built-in model browser : Search and download models from Hugging Face
-
GUI chat interface : Familiar ChatGPT-like experience
-
Local API server : OpenAI-compatible API endpoint
-
Model configuration : Easy sliders for context length, GPU offloading, and temperature
-
Multi-model support : Load multiple models and switch between them
Setup
- Download from lmstudio.ai
2\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Open the app and browse the model catalog
3\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Download a model (start with Llama 3.2 3B or Mistral 7B)
4\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Load the model and start chatting
API Server
LM Studio can serve models via an OpenAI-compatible API:
http://localhost:1234/v1/chat/completions
This means any tool that works with OpenAI's API can use your local model by changing the base URL:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}]
)
Performance Optimization
Quantization Levels
Quantization reduces model size at the cost of some accuracy:
-
Q4_K_M : Best balance of quality and size (4-bit, recommended)
-
Q5_K_M : Higher quality, larger (5-bit, use if you have RAM headroom)
-
Q8_0 : Near-full quality, 2x RAM requirement
-
Q2_K : Minimal RAM, noticeable quality loss
Rule of thumb: each quantization step roughly doubles model size but improves quality marginally. Q4_K_M is the sweet spot.
GPU Acceleration
Both Ollama and LM Studio support GPU acceleration via CUDA (NVIDIA), Metal (Apple Silicon), or Vulkan (AMD):
Ollama uses Metal automatically on Apple Silicon
For NVIDIA, install CUDA and Ollama detects it
Check which device is being used
ollama run llama3.2:3b --verbose
Context Window
Larger context windows consume more memory. A 128K context with Q4_K_M requires approximately:
-
7B model: ~8 GB total
-
13B model: ~14 GB total
-
70B model: ~48 GB total
Use Cases for Local LLMs
-
Privacy-sensitive data : Medical records, legal documents, personal information
-
Offline environments : Air-gapped systems, travel, remote locations
-
Cost-sensitive workloads : High-volume batch processing
-
Experimentation : Rapid testing of different models without API costs
-
Latency-critical applications : No network calls for inference
Conclusion
Running LLMs locally is easier than ever with Ollama and LM Studio. Ollama offers command-line simplicity and a rich set of pre-built models. LM Studio provides a polished GUI and OpenAI-compatible API. Start with a 7B model at Q4_K_M quantization on a machine with 8-16 GB of RAM, and scale up as your needs grow. Local LLMs won't replace cloud APIs for every use case, but they are an essential tool in the AI practitioner's toolkit.
Enjoy this article? Share your thoughts, questions, or experiences in the comments below — your insights help other readers too.
Join the discussion ↓