Running LLMs Locally

Introduction

Running large language models locally has become practical thanks to quantization techniques, efficient inference engines, and a thriving open-source ecosystem. Whether for privacy, cost savings, or offline availability, local LLMs offer a compelling alternative to cloud APIs for many workloads. This guide covers the two most popular local LLM platforms — Ollama and LM Studio.

Ollama

Ollama is the most popular tool for running LLMs locally, known for its simplicity and command-line focus.

Installation

macOS

brew install ollama

Linux

curl -fsSL https://ollama.ai/install.sh | sh

Windows

Download from https://ollama.ai/download

Getting Started

Ollama makes running a model a single command:

Pull and run a model

ollama run llama3.2:3b

List available models

ollama list

Pull a specific model without running

ollama pull mistral:7b

Popular Models for Ollama

|-------|------|-------------|----------|

| Llama 3.2 3B | 2.0 GB | 4 GB | Fast responses, simple tasks |

| Llama 3.1 8B | 4.7 GB | 8 GB | General purpose Q&A; |

| Mistral 7B | 4.1 GB | 8 GB | Code, reasoning, instruction following |

| Qwen2.5 7B | 4.8 GB | 8 GB | Strong multilingual, coding |

| Mixtral 8x7B | 26 GB | 48 GB | High quality, close to GPT-3.5 |

| DeepSeek-R1 7B | 4.5 GB | 8 GB | Strong reasoning, step-by-step |

Using Ollama Programmatically

Ollama provides a REST API at http://localhost:11434:

import requests

response = requests.post("http://localhost:11434/api/generate", json={

"model": "llama3.2:3b",

"prompt": "Explain quantum computing in three sentences.",

"stream": False

})

print(response.json()["response"])

Or use the official Python library:

import ollama

response = ollama.chat(model="llama3.2:3b", messages=[

{"role": "user", "content": "What is the capital of France?"}

])

print(response["message"]["content"])

Custom Modelfiles

Create custom models with system prompts and parameters:

FROM llama3.1:8b

Set system prompt

SYSTEM "You are a helpful coding assistant. Provide concise code examples."

Configure parameters

PARAMETER temperature 0.3

PARAMETER top_p 0.9

Build and run:

ollama create my-coding-assistant -f Modelfile

ollama run my-coding-assistant

LM Studio

LM Studio is a GUI-focused alternative that excels for users who prefer visual interfaces and easy model browsing.

Key Features

Built-in model browser : Search and download models from Hugging Face
GUI chat interface : Familiar ChatGPT-like experience
Local API server : OpenAI-compatible API endpoint
Model configuration : Easy sliders for context length, GPU offloading, and temperature
Multi-model support : Load multiple models and switch between them

Setup

Download from lmstudio.ai

2\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Open the app and browse the model catalog

3\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Download a model (start with Llama 3.2 3B or Mistral 7B)

4\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Load the model and start chatting

API Server

LM Studio can serve models via an OpenAI-compatible API:

http://localhost:1234/v1/chat/completions

This means any tool that works with OpenAI's API can use your local model by changing the base URL:

from openai import OpenAI

client = OpenAI(

base_url="http://localhost:1234/v1",

api_key="not-needed"

)

response = client.chat.completions.create(

model="local-model",

messages=[{"role": "user", "content": "Hello!"}]

)

Performance Optimization

Quantization Levels

Quantization reduces model size at the cost of some accuracy:

Q4_K_M : Best balance of quality and size (4-bit, recommended)
Q5_K_M : Higher quality, larger (5-bit, use if you have RAM headroom)
Q8_0 : Near-full quality, 2x RAM requirement
Q2_K : Minimal RAM, noticeable quality loss

Rule of thumb: each quantization step roughly doubles model size but improves quality marginally. Q4_K_M is the sweet spot.

GPU Acceleration

Both Ollama and LM Studio support GPU acceleration via CUDA (NVIDIA), Metal (Apple Silicon), or Vulkan (AMD):

Ollama uses Metal automatically on Apple Silicon

For NVIDIA, install CUDA and Ollama detects it

Check which device is being used

ollama run llama3.2:3b --verbose

Context Window

Larger context windows consume more memory. A 128K context with Q4_K_M requires approximately:

7B model: ~8 GB total
13B model: ~14 GB total
70B model: ~48 GB total

Use Cases for Local LLMs

Privacy-sensitive data : Medical records, legal documents, personal information
Offline environments : Air-gapped systems, travel, remote locations
Cost-sensitive workloads : High-volume batch processing
Experimentation : Rapid testing of different models without API costs
Latency-critical applications : No network calls for inference

Conclusion

Running LLMs locally is easier than ever with Ollama and LM Studio. Ollama offers command-line simplicity and a rich set of pre-built models. LM Studio provides a polished GUI and OpenAI-compatible API. Start with a 7B model at Q4_K_M quantization on a machine with 8-16 GB of RAM, and scale up as your needs grow. Local LLMs won't replace cloud APIs for every use case, but they are an essential tool in the AI practitioner's toolkit.