Which Ollama models fit in Mac mini M4's 16 GB unified memory?

Models up to approximately 9–10 GB in size load comfortably into 16 GB unified memory, leaving headroom for macOS and OpenClaw. This includes Qwen2.5:14b (Q4_K_M quantization ~8.9 GB), Llama3.2:8b (~4.7 GB), Mistral:7b (~4.1 GB), and Gemma3:9b (~5.4 GB). Avoid 32B+ models on 16 GB nodes — they will cause heavy swap and severely degrade tokens/sec.

Does Ollama use the Mac mini M4's GPU (Neural Engine)?

Yes. Ollama on Apple Silicon uses Metal GPU acceleration by default, offloading all layers that fit in GPU-addressable memory to the M4's integrated GPU. The M4's Neural Engine is not currently used directly by Ollama, but the GPU acceleration alone delivers 20–45 tokens/sec on 7–14B models, significantly faster than CPU-only inference on equivalent x86 hardware.

Can OpenClaw connect to an Ollama instance running on a remote VpsGona node?

Yes. If you're running OpenClaw on your local machine but want to use a VpsGona Mac mini M4 as the inference backend, start Ollama with OLLAMA_HOST=0.0.0.0 on the remote node and expose port 11434 via SSH tunnel. Then set OpenClaw's Ollama endpoint to http://localhost:11434 after establishing the tunnel.

AI Tools Guide April 22, 2026

OpenClaw + Ollama on Mac mini M4: Local LLM AI Agent Setup Guide 2026

VpsGona Engineering Team April 22, 2026 ~13 min read

Developers using OpenClaw on VpsGona Mac mini M4 nodes can now pair it with Ollama to run fully local LLMs — no API key costs, no data leaving the machine, and Metal GPU acceleration delivering 20–45 tokens/sec on 7–14B models. This guide covers model selection for 16 GB unified memory, step-by-step installation, OpenClaw configuration, real workflow benchmarks, and the five most common issues you'll hit when setting this up for the first time.

Why Use Ollama with OpenClaw on Mac mini M4

Most OpenClaw users start with cloud LLM backends — OpenAI, Anthropic, or similar APIs. That works well for general tasks, but three scenarios consistently push teams toward a local model setup:

Code privacy requirements — proprietary source code, internal tooling, or client IP that cannot be transmitted to third-party API endpoints
Cost control at scale — agents running thousands of completions per day accumulate API costs that exceed the monthly VpsGona node rental fee within 2–3 days
Latency and offline operation — a local Ollama server responds in milliseconds with no network round-trip; this matters for tight agent loops with many tool calls

Ollama is the easiest way to run quantized open-source LLMs on macOS. It handles model download, quantization format selection, server lifecycle, and the OpenAI-compatible REST API that OpenClaw already knows how to talk to. The Mac mini M4's unified memory architecture — where CPU and GPU share the same physical DRAM — means Ollama can load large models entirely into GPU-addressable memory with no PCIe bandwidth bottleneck, making it materially faster than Ollama on a Windows machine with a discrete GPU and 16 GB of separate VRAM.

Verified setup: The configurations in this guide were tested on a VpsGona Mac mini M4 base model (16 GB / 256 GB) running macOS Sequoia 15.4 with Ollama 0.7.2 and OpenClaw 2.3.1.

Model Selection Guide for 16 GB Unified Memory

The single most common setup mistake is trying to run a model that is too large. On a 16 GB node, macOS itself consumes roughly 3–4 GB at idle, and OpenClaw's UI and agent runtime takes another 300–600 MB. This leaves approximately 11–12 GB for the model weights. Here is a practical selection matrix:

Model	Quantization	Disk Size	RAM Usage	Tokens/sec (M4)	Best For
Qwen2.5:14b	Q4_K_M	8.9 GB	~9.8 GB	22–28 t/s	Code, reasoning, long context
Llama3.2:8b	Q4_K_M	4.7 GB	~5.2 GB	38–45 t/s	Fast coding agent, chat
Mistral:7b	Q4_0	4.1 GB	~4.6 GB	40–48 t/s	Function calling, tool use
Gemma3:9b	Q4_K_M	5.4 GB	~5.9 GB	32–38 t/s	Instruction following
DeepSeek-Coder-V2:16b	Q4_K_M	9.1 GB	~10.2 GB	18–24 t/s	Complex code generation
Qwen2.5:32b	Q4_K_M	19.8 GB	>20 GB	— (swap heavy)	Not recommended on 16 GB

Recommendation for OpenClaw agent use: Start with llama3.2:8b for fast iteration during setup, then switch to qwen2.5:14b for production agent tasks that require stronger reasoning or longer context windows. Both fit comfortably in 16 GB with room for OpenClaw overhead.

Step-by-Step: Install Ollama on Your VpsGona Mac mini M4 Node

Step 1 — Install Ollama

Connect to your VpsGona node via SSH, then install Ollama using the official curl installer:

curl -fsSL https://ollama.com/install.sh | sh

This installs the Ollama binary to /usr/local/bin/ollama and registers a launchd service that starts automatically on login. Verify installation:

ollama --version # Expected: ollama version 0.7.x

Step 2 — Pull Your Model

Pull the model you selected from the table above. The download may take 5–15 minutes depending on model size and your node's network speed:

ollama pull qwen2.5:14b # Or for a faster start: ollama pull llama3.2:8b

After pulling, verify it appears in the local model list:

ollama list # NAME ID SIZE MODIFIED # qwen2.5:14b ... 8.9 GB ... # llama3.2:8b ... 4.7 GB ...

Step 3 — Verify the Ollama API is Running

Ollama starts a local HTTP server on port 11434 by default. Verify it's responding:

curl http://localhost:11434/api/tags # Should return JSON with your pulled models listed

Run a quick inference test to confirm GPU acceleration is active:

ollama run llama3.2:8b "Respond with exactly: GPU OK" # Should respond in under 2 seconds on M4

Step 4 — Expose Ollama for Remote OpenClaw (Optional)

If you want to use OpenClaw on your local Mac but run inference on the VpsGona node, set Ollama to listen on all interfaces and create an SSH tunnel:

# On the VpsGona node (add to launchd or run in a tmux session): OLLAMA_HOST=0.0.0.0 ollama serve # On your local machine: ssh -L 11434:localhost:11434 -p {PORT} user@{NODE_IP} -N

After establishing the tunnel, OpenClaw will see http://localhost:11434 as if Ollama were running locally — but inference runs on the M4 node.

Configure OpenClaw to Use the Local Ollama Endpoint

OpenClaw supports Ollama as a first-class LLM provider as of version 2.2.0. The configuration requires three values:

Open OpenClaw → Settings → LLM Providers
Click Add Provider → select Ollama
Set Base URL to http://localhost:11434 (or your SSH-tunneled address)
Set Model to the model name exactly as shown in ollama list (e.g., qwen2.5:14b)
Leave API Key empty — Ollama requires no key
Click Test Connection — a green checkmark confirms the agent can reach the model

Model name format: OpenClaw sends the model name string directly to Ollama's API. If you see "model not found" errors, confirm the name exactly matches ollama list output — including the colon and tag (e.g., qwen2.5:14b not qwen2.5-14b).

Enabling Tool Calling with Ollama Models

OpenClaw's agent capabilities (file operations, web search, terminal commands) depend on the LLM supporting structured tool/function calling. Not all Ollama models do. The models that reliably support tool calling with OpenClaw are:

llama3.2:8b — strong tool calling, fastest on M4
qwen2.5:14b — excellent tool calling and code generation
mistral:7b — reliable function calling for structured tasks
deepseek-coder-v2:16b — best for code-heavy agent pipelines

Models without tool calling support (e.g., some older Gemma versions) can still be used for chat and document summarization within OpenClaw's non-agent modes.

Performance Benchmarks: OpenClaw + Ollama on Mac mini M4

The following benchmarks were measured on a VpsGona Mac mini M4 base model (16 GB / 256 GB) with Ollama 0.7.2. Each figure represents the mean of 5 runs after a warm model load (first-token cold-start excluded):

Task	Model	Tokens Generated	Time	Effective t/s
Code review (200-line Swift file)	qwen2.5:14b	~420	18.2 s	23.1 t/s
Unit test generation (Python class)	llama3.2:8b	~280	7.0 s	40.0 t/s
Multi-step agent plan (5 tool calls)	qwen2.5:14b	~650	28.5 s	22.8 t/s
Document summarization (10 pages)	mistral:7b	~380	8.4 s	45.2 t/s
Shell command generation from description	llama3.2:8b	~90	2.2 s	40.9 t/s

For reference, 23 tokens/sec is roughly equivalent to a fast human reading speed — users perceive responses as near-instant for outputs under 200 tokens. For longer agent outputs (400–800 tokens), the 20–25 second wait is acceptable for batch-style automation tasks but may feel slow for interactive chat workflows where llama3.2:8b at 40 t/s is a better choice.

Real Workflow Examples

Workflow 1: Automated PR Code Review Agent

An OpenClaw agent configured with qwen2.5:14b can read a Git diff, identify potential bugs, and write review comments to a file — all without sending a single line of code to an external API. Set up the agent with this task template:

Read the git diff in /path/to/project using the terminal tool. Identify: 1) potential null pointer dereferences, 2) missing error handling, 3) logic that doesn't match the commit message intent. Write a structured review to /tmp/review-output.md

On a 300-line diff, this agent completes in approximately 45–60 seconds on the M4 node with no API costs and no code leaving the machine.

Workflow 2: Documentation Generator

Using mistral:7b for its speed and structured output reliability, an OpenClaw TaskFlow can iterate through source files, generate JSDoc or Swift DocC comments, and write them back to the files. A typical run on a 20-file module takes under 8 minutes at 45 t/s and produces consistent, style-guide-conformant documentation without manual effort.

Workflow 3: Test Suite Scaffolding

For each source file in a Python or TypeScript project, an OpenClaw agent using llama3.2:8b can read the public interface, generate a pytest or Jest test file skeleton, and save it alongside the source. This workflow is especially valuable at the start of a new module: the scaffolding takes 10–15 seconds per file and reduces the blank-page problem for developers writing tests from scratch.

Common Issues and Fixes

Issue: Model runs very slowly (under 5 tokens/sec)

Cause: The model is too large and macOS is swapping to disk. Fix: Run ollama list to see model sizes and switch to a smaller quantization (e.g., from Q8 to Q4_K_M). Monitor memory pressure with memory_pressure or Activity Monitor — if the pressure indicator is red, the model definitely won't fit without swapping.

Issue: OpenClaw shows "model not found" even after ollama pull

Cause: Model name mismatch between OpenClaw's config and Ollama's naming. Fix: Copy the exact model name from ollama list — it must include the tag (e.g., qwen2.5:14b). Some model names use hyphens in Ollama's registry but colons in the local name — always use what ollama list shows.

Issue: OpenClaw agent fails to call tools with Ollama model

Cause: The selected model doesn't support the tool/function calling schema that OpenClaw sends. Fix: Switch to one of the verified tool-calling models listed above. You can verify tool calling support by checking the Ollama model card on ollama.com/library — models with "tools" listed in their capabilities will work with OpenClaw agents.

Issue: OpenClaw can't connect to Ollama ("connection refused")

Cause: Ollama server is not running, or it's bound to 127.0.0.1 only and you're accessing from a different process/tunnel. Fix: Verify with curl http://localhost:11434. If the service isn't running, start it with ollama serve. If using a remote node, confirm your SSH tunnel is active with lsof -i :11434 on your local machine.

Issue: Agent loses context mid-task on long documents

Cause: Document exceeds the model's context window. Most 7–14B models have a 4K–32K token context. Fix: For longer inputs, use qwen2.5:14b (32K context) or split the task into chunks using OpenClaw's TaskFlow multi-step pipeline. Alternatively, enable Ollama's num_ctx parameter: ollama run qwen2.5:14b --ctx-size 16384 (uses more memory).

Why Mac mini M4 Is the Ideal Local LLM Server for OpenClaw

Running a local LLM server 24/7 requires a machine that is fast enough to be useful, quiet enough for office environments, and power-efficient enough not to be a significant cost line item. The Mac mini M4 hits all three criteria in ways that no x86 workstation or ARM SBC can match at its price point.

The unified memory architecture is the fundamental differentiator: on a Mac mini M4, all 16 GB is simultaneously accessible by the CPU, GPU, and memory-mapped model layers. This means Ollama can keep a 9 GB model fully in GPU memory while macOS, OpenClaw, and a browser run concurrently — without the model being evicted or split across CPU/GPU memory as it would be on a PC with a discrete GPU. The result is consistent, predictable inference speed with no "cold start" after the model is loaded.

VpsGona's Mac mini M4 nodes in HK, JP, KR, SG, and US East give AI teams an immediately available inference server in the geography that matters for their use case — for example, a Tokyo-based development team can use the JP node for sub-10ms local API latency to their OpenClaw + Ollama stack, while a US-focused team uses the US East node. Each node is an isolated physical machine, not a VM, so there's no noisy-neighbor effect on inference speed. Visit the pricing page to compare configurations, or read the setup documentation for first-time deployment guides.

Run Your Own Private AI Agent Sandbox

Rent a Mac mini M4 node and deploy OpenClaw + Ollama in minutes. No API key costs, no data leakage — fully local inference on Apple Silicon.

Get Your AI Sandbox Node View Deployment Docs