mac-code — Free Local AI Agent on Apple Silicon

Skill by

ara.so

— Daily 2026 Skills collection.

Run a 35B reasoning model locally on your Mac for $0/month. mac-code is a CLI AI coding agent (Claude Code alternative) that routes tasks — web search, shell commands, file edits, chat — through a local LLM. Supports llama.cpp (30 tok/s) and MLX (64K context, persistent KV cache) backends on Apple Silicon.

What It Does

LLM-as-router

The model classifies every prompt as
search
,
shell
, or
chat
and routes accordingly
35B MoE at 30 tok/s
via llama.cpp + IQ2_M quantization (fits in 16 GB RAM)
35B full Q4 on 16 GB
via custom MoE Expert Sniper (1.54 tok/s, only 1.42 GB RAM used)
9B at 64K context
via quantized KV cache (
q4_0
keys/values)
MLX backend
adds persistent KV cache save/load, context compression, R2 sync
Tools: DuckDuckGo search, shell execution, file read/write Installation Prerequisites brew install llama.cpp pip3 install rich ddgs huggingface-hub mlx-lm --break-system-packages Clone the repo git clone https://github.com/walter-grace/mac-code cd mac-code Download models 35B MoE — fast daily driver (10.6 GB, fits in 16 GB RAM): mkdir -p ~/models python3 -c " from huggingface_hub import hf_hub_download hf_hub_download( 'unsloth/Qwen3.5-35B-A3B-GGUF', 'Qwen3.5-35B-A3B-UD-IQ2_M.gguf', local_dir=' $HOME /models/' ) " 9B — 64K context, long documents (5.3 GB): python3 -c " from huggingface_hub import hf_hub_download hf_hub_download( 'unsloth/Qwen3.5-9B-GGUF', 'Qwen3.5-9B-Q4_K_M.gguf', local_dir=' $HOME /models/' ) " Starting the Backend Option A: llama.cpp + 35B MoE (recommended, 30 tok/s) llama-server \ --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \ --port 8000 --host 127.0 .0.1 \ --flash-attn on --ctx-size 12288 \ --cache-type-k q4_0 --cache-type-v q4_0 \ --n-gpu-layers 99 --reasoning off -np 1 -t 4 Option B: llama.cpp + 9B (64K context) llama-server \ --model ~/models/Qwen3.5-9B-Q4_K_M.gguf \ --port 8000 --host 127.0 .0.1 \ --flash-attn on --ctx-size 65536 \ --cache-type-k q4_0 --cache-type-v q4_0 \ --n-gpu-layers 99 --reasoning off -t 4 Option C: MLX backend (persistent context, 9B)

Starts server on port 8000, downloads model on first run

python3 mlx/mlx_engine.py Start the agent (all options) python3 agent.py Agent CLI Commands Inside the agent REPL, type / for all commands: Command Action /agent Agent mode with tools (default) /raw Direct streaming, no tools /model 9b Switch to 9B model (64K context) /model 35b Switch to 35B MoE /search Quick DuckDuckGo search /bench Run speed benchmark /stats Session statistics /cost Show cost savings vs cloud /good / /bad Grade the last response /improve View response grading stats /clear Reset conversation /quit Exit Example prompts

find all Python files modified in the last 7 days → routes to "shell", generates: find . -name "*.py" -mtime -7 who won the NBA finals → routes to "search", queries DuckDuckGo, summarizes explain how attention works → routes to "chat", streams directly MLX Backend — Persistent KV Cache API The MLX engine exposes a REST API on localhost:8000 . Save context after processing a large codebase curl -X POST localhost:8000/v1/context/save \ -H "Content-Type: application/json" \ -d '{"name": "my-project", "prompt": "$(cat README.md)"}' Load saved context instantly (0.0003s) curl -X POST localhost:8000/v1/context/load \ -H "Content-Type: application/json" \ -d '{"name": "my-project"}' Download context from Cloudflare R2 (cross-Mac sync)

Requires R2 credentials in environment

export R2_ACCOUNT_ID = your_account_id export R2_ACCESS_KEY_ID = your_key_id export R2_SECRET_ACCESS_KEY = your_secret export R2_BUCKET = your_bucket_name curl -X POST localhost:8000/v1/context/download \ -H "Content-Type: application/json" \ -d '{"name": "my-project"}' Standard OpenAI-compatible chat import requests response = requests . post ( "http://localhost:8000/v1/chat/completions" , json = { "model" : "local" , "messages" : [ { "role" : "user" , "content" : "Write a Python quicksort" } ] , "stream" : False } ) print ( response . json ( ) [ "choices" ] [ 0 ] [ "message" ] [ "content" ] ) Streaming chat import requests , json with requests . post ( "http://localhost:8000/v1/chat/completions" , json = { "model" : "local" , "messages" : [ { "role" : "user" , "content" : "Explain transformers" } ] , "stream" : True } , stream = True ) as r : for line in r . iter_lines ( ) : if line . startswith ( b"data: " ) : chunk = json . loads ( line [ 6 : ] ) delta = chunk [ "choices" ] [ 0 ] [ "delta" ] . get ( "content" , "" ) print ( delta , end = "" , flush = True ) KV Cache Compression (MLX) Compress context 4x with 99.3% similarity: from mlx . turboquant import compress_kv_cache from mlx . kv_cache import save_kv_cache , load_kv_cache

After building a KV cache from a long document

compressed

compress_kv_cache ( kv_cache , bits = 4 )

26.6 MB → 6.7 MB

save_kv_cache ( compressed , "my-project-compressed" )

Load later

kv

load_kv_cache ( "my-project-compressed" ) Flash Streaming — Out-of-Core Inference For models larger than your RAM (research mode): cd research/flash-streaming

Run 35B MoE Expert Sniper (22 GB model, 1.42 GB RAM)

python3 moe_expert_sniper.py

Run 32B dense flash stream (18.4 GB model, 4.5 GB RAM)

python3 flash_stream_v2.py How F_NOCACHE direct I/O works import os , fcntl

Open model file bypassing macOS Unified Buffer Cache

fd

os . open ( "model.bin" , os . O_RDONLY ) fcntl . fcntl ( fd , fcntl . F_NOCACHE , 1 )

bypass page cache

Aligned read (16KB boundary for DART IOMMU)

ALIGN

16384 offset = ( layer_offset // ALIGN ) * ALIGN data = os . pread ( fd , layer_size + ALIGN , offset ) weights = data [ layer_offset - offset : layer_offset - offset + layer_size ] MoE Expert Sniper pattern

Router predicts which 8 of 256 experts activate per token

active_experts

router_forward ( hidden_state )

returns [8] indices

Load only those experts from SSD (8 threads, parallel pread)

from concurrent . futures import ThreadPoolExecutor def load_expert ( expert_idx ) : offset = expert_offsets [ expert_idx ] return os . pread ( fd , expert_size , offset ) with ThreadPoolExecutor ( max_workers = 8 ) as pool : expert_weights = list ( pool . map ( load_expert , active_experts ) )

~14 MB loaded per layer instead of 221 MB (dense)

Common Patterns Use as a Python library (direct API calls) import requests BASE = "http://localhost:8000/v1" def ask ( prompt : str , system : str = "You are a helpful coding assistant." ) -

str : r = requests . post ( f" { BASE } /chat/completions" , json = { "model" : "local" , "messages" : [ { "role" : "system" , "content" : system } , { "role" : "user" , "content" : prompt } ] } ) return r . json ( ) [ "choices" ] [ 0 ] [ "message" ] [ "content" ]

Examples

print ( ask ( "Write a Python function to parse JSON safely" ) ) print ( ask ( "Explain this error: AttributeError: NoneType has no attribute split" ) ) Process a large file with paged inference from mlx . paged_inference import PagedInference engine = PagedInference ( model = "mlx-community/Qwen3.5-9B-4bit" ) with open ( "large_codebase.txt" ) as f : content = f . read ( )

beyond single context window

Automatically pages through content

result

engine . summarize ( content , question = "What does this codebase do?" ) print ( result ) Monitor server performance python3 dashboard.py Model Selection Guide Your Mac RAM Best Option Command 8 GB 9B Q4_K_M --model ~/models/Qwen3.5-9B-Q4_K_M.gguf --ctx-size 4096 16 GB 35B IQ2_M (30 tok/s) Default Option A above 16 GB (quality) 35B Q4 Expert Sniper python3 research/flash-streaming/moe_expert_sniper.py 48 GB 35B Q4_K_M native Download full Q4, --n-gpu-layers 99 192 GB 397B frontier Any large GGUF, full offload Troubleshooting Server not responding on port 8000

Check if server is running

curl http://localhost:8000/health

Check what's on port 8000

lsof -i :8000

Restart llama-server with verbose logging

llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \ --port 8000 --verbose Model download fails / incomplete

Resume interrupted download

python3 -c " from huggingface_hub import hf_hub_download hf_hub_download( 'unsloth/Qwen3.5-35B-A3B-GGUF', 'Qwen3.5-35B-A3B-UD-IQ2_M.gguf', local_dir=' $HOME /models/', resume_download=True ) " Slow inference / RAM pressure on 16 GB Mac

Reduce context size to free RAM

llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \ --port 8000 --ctx-size 4096 \

reduced from 12288

--cache-type-k q4_0 --cache-type-v q4_0 \ --n-gpu-layers 99 -t 4

Or switch to 9B for lower RAM usage

python3 agent.py

Then: /model 9b

MLX engine crashes with memory error

MLX uses unified memory — check pressure

vm_stat | grep "Pages free"

Reduce batch size in mlx_engine.py

Edit: max_batch_size = 512 → max_batch_size = 128

F_NOCACHE not bypassing page cache (macOS Sonoma+)

Verify F_NOCACHE is active

import fcntl , os fd = os . open ( model_path , os . O_RDONLY ) result = fcntl . fcntl ( fd , fcntl . F_NOCACHE , 1 ) assert result == 0 , "F_NOCACHE failed — check macOS version and SIP status" ddgs search fails pip3 install --upgrade ddgs --break-system-packages

ddgs uses DuckDuckGo — no API key required, but may rate-limit

Retry after 60 seconds if you get a 202 response

Wrong reshape on GGUF dequantization

GGUF tensors are column-major — correct reshape:

weights

dequantized_flat . reshape ( ne [ 1 ] , ne [ 0 ] )

CORRECT

NOT: dequantized_flat.reshape(ne[0], ne[1]).T # WRONG

Architecture Summary agent.py ├── Intent classification → "search" | "shell" | "chat" ├── search → ddgs.DDGS().text() → summarize ├── shell → generate command → subprocess.run() └── chat → stream directly Backends (both expose OpenAI-compatible API on :8000) ├── llama.cpp → fast, standard, no persistence └── mlx/ → KV cache save/load/compress/sync Flash Streaming (research/) ├── moe_expert_sniper.py → 35B Q4, 1.42 GB RAM └── flash_stream_v2.py → 32B dense, 4.5 GB RAM └── F_NOCACHE + pread + 16KB alignment

安装