llmfit Hardware Model Matcher Skill by ara.so — Daily 2026 Skills collection. llmfit detects your system's RAM, CPU, and GPU then scores hundreds of LLM models across quality, speed, fit, and context dimensions — telling you exactly which models will run well on your hardware. It ships with an interactive TUI and a CLI, supports multi-GPU, MoE architectures, dynamic quantization, and local runtime providers (Ollama, llama.cpp, MLX, Docker Model Runner). Installation macOS / Linux (Homebrew) brew install llmfit Quick install script curl -fsSL https://llmfit.axjns.dev/install.sh | sh
Without sudo, installs to ~/.local/bin
curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local Windows (Scoop) scoop install llmfit Docker / Podman docker run ghcr.io/alexsjones/llmfit
With jq for scripting
podman run ghcr.io/alexsjones/llmfit recommend --use-case coding | jq '.models[].name' From source (Rust) git clone https://github.com/AlexsJones/llmfit.git cd llmfit cargo build --release
binary at target/release/llmfit
- Core Concepts
- Fit tiers
- :
- perfect
- (runs great),
- good
- (runs well),
- marginal
- (runs but tight),
- too_tight
- (won't run)
- Scoring dimensions
-
- quality, speed (tok/s estimate), fit (memory headroom), context capacity
- Run modes
-
- GPU, CPU+GPU offload, CPU-only, MoE
- Quantization
-
- automatically selects best quant (e.g. Q4_K_M, Q5_K_S, mlx-4bit) for your hardware
- Providers
- Ollama, llama.cpp, MLX, Docker Model Runner Key Commands Launch Interactive TUI llmfit CLI Table Output llmfit --cli Show System Hardware Detection llmfit system llmfit --json system
JSON output
List All Models llmfit list Search Models llmfit search "llama 8b" llmfit search "mistral" llmfit search "qwen coding" Fit Analysis
All runnable models ranked by fit
llmfit fit
Only perfect fits, top 5
llmfit fit --perfect -n 5
JSON output
llmfit --json fit -n 10 Model Detail llmfit info "Mistral-7B" llmfit info "Llama-3.1-70B" Recommendations
Top 5 recommendations (JSON default)
llmfit recommend --json --limit 5
Filter by use case: general, coding, reasoning, chat, multimodal, embedding
llmfit recommend --json --use-case coding --limit 3 llmfit recommend --json --use-case reasoning --limit 5 Hardware Planning (invert: what hardware do I need?) llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --quant mlx-4bit llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --target-tps 25 --json llmfit plan "Qwen/Qwen2.5-Coder-0.5B-Instruct" --context 8192 --json REST API Server (for cluster scheduling) llmfit serve llmfit serve --host 0.0 .0.0 --port 8787 Hardware Overrides When autodetection fails (VMs, broken nvidia-smi, passthrough setups):
Override GPU VRAM
llmfit --memory = 32G llmfit --memory = 24G --cli llmfit --memory = 24G fit --perfect -n 5 llmfit --memory = 24G recommend --json
Megabytes
llmfit --memory = 32000M
Works with any subcommand
llmfit --memory = 16G info "Llama-3.1-70B" Accepted suffixes: G / GB / GiB , M / MB / MiB , T / TB / TiB (case-insensitive). Context Length Cap
Estimate memory fit at 4K context
llmfit --max-context 4096 --cli
With subcommands
llmfit --max-context 8192 fit --perfect -n 5 llmfit --max-context 16384 recommend --json --limit 5
Environment variable alternative
export OLLAMA_CONTEXT_LENGTH = 8192 llmfit recommend --json REST API Reference Start the server: llmfit serve --host 0.0 .0.0 --port 8787 Endpoints
Health check
curl http://localhost:8787/health
Node hardware info
curl http://localhost:8787/api/v1/system
Full model list with filters
curl "http://localhost:8787/api/v1/models?min_fit=marginal&runtime=llamacpp&sort=score&limit=20"
Top runnable models for this node (key scheduling endpoint)
curl "http://localhost:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"
Search by model name/provider
curl "http://localhost:8787/api/v1/models/Mistral?runtime=any" Query Parameters for /models and /models/top Param Values Description limit / n integer Max rows returned min_fit perfect|good|marginal|too_tight Minimum fit tier perfect true|false Force perfect-only runtime any|mlx|llamacpp Filter by runtime use_case general|coding|reasoning|chat|multimodal|embedding Use case filter provider string Substring match on provider search string Free-text across name/provider/size/use-case sort score|tps|params|mem|ctx|date|use_case Sort column include_too_tight true|false Include non-runnable models max_context integer Per-request context cap Scripting & Automation Examples Bash: Get top coding models as JSON
!/bin/bash
Get top 3 coding models that fit perfectly
llmfit recommend --json --use-case coding --limit 3 | \ jq -r '.models[] | "(.name) ((.score)) - (.quantization)"' Bash: Check if a specific model fits
!/bin/bash
MODEL
"Mistral-7B" RESULT = $( llmfit info " $MODEL " --json 2
/dev/null ) FIT = $( echo " $RESULT " | jq -r '.fit' ) if [ [ " $FIT " == "perfect" || " $FIT " == "good" ] ] ; then echo " $MODEL will run well (fit: $FIT )" else echo " $MODEL may not run well (fit: $FIT )" fi Bash: Auto-pull top Ollama model
!/bin/bash
Get the top fitting model name and pull it with Ollama
TOP_MODEL
$( llmfit recommend --json --limit 1 | jq -r '.models[0].name' ) echo "Pulling: $TOP_MODEL " ollama pull " $TOP_MODEL " Python: Query the REST API import requests BASE_URL = "http://localhost:8787" def get_system_info ( ) : resp = requests . get ( f" { BASE_URL } /api/v1/system" ) return resp . json ( ) def get_top_models ( use_case = "coding" , limit = 5 , min_fit = "good" ) : params = { "use_case" : use_case , "limit" : limit , "min_fit" : min_fit , "sort" : "score" } resp = requests . get ( f" { BASE_URL } /api/v1/models/top" , params = params ) return resp . json ( ) def search_models ( query , runtime = "any" ) : resp = requests . get ( f" { BASE_URL } /api/v1/models/ { query } " , params = { "runtime" : runtime } ) return resp . json ( )
Example usage
system
- get_system_info
- (
- )
- (
- f"GPU:
- {
- system
- .
- get
- (
- 'gpu_name'
- )
- }
- | VRAM:
- {
- system
- .
- get
- (
- 'vram_gb'
- )
- }
- GB"
- )
- models
- =
- get_top_models
- (
- use_case
- =
- "reasoning"
- ,
- limit
- =
- 3
- )
- for
- m
- in
- models
- .
- get
- (
- "models"
- ,
- [
- ]
- )
- :
- (
- f"
- {
- m
- [
- 'name'
- ]
- }
- score=
{
m
[
'score'
]
}
, fit=
{
m
[
'fit'
]
}
, quant=
{
m
[
'quantization'
]
}
"
)
Python: Hardware-aware model selector for agents
import
subprocess
import
json
def
get_best_model_for_task
(
use_case
:
str
,
min_fit
:
str
=
"good"
)
-
dict : """Use llmfit to select the best model for a given task.""" result = subprocess . run ( [ "llmfit" , "recommend" , "--json" , "--use-case" , use_case , "--limit" , "1" ] , capture_output = True , text = True ) data = json . loads ( result . stdout ) models = data . get ( "models" , [ ] ) return models [ 0 ] if models else None def plan_hardware_requirements ( model_name : str , context : int = 4096 ) -
dict : """Get hardware requirements for running a specific model.""" result = subprocess . run ( [ "llmfit" , "plan" , model_name , "--context" , str ( context ) , "--json" ] , capture_output = True , text = True ) return json . loads ( result . stdout )
Select best coding model
best
get_best_model_for_task ( "coding" ) if best : print ( f"Best coding model: { best [ 'name' ] } " ) print ( f" Quantization: { best [ 'quantization' ] } " ) print ( f" Estimated tok/s: { best [ 'tps' ] } " ) print ( f" Memory usage: { best [ 'mem_pct' ] } %" )
Plan hardware for a specific model
plan
plan_hardware_requirements ( "Qwen/Qwen3-4B-MLX-4bit" , context = 8192 ) print ( f"Min VRAM needed: { plan [ 'hardware' ] [ 'min_vram_gb' ] } GB" ) print ( f"Recommended VRAM: { plan [ 'hardware' ] [ 'recommended_vram_gb' ] } GB" ) Docker Compose: Node scheduler pattern version : "3.8" services : llmfit-api : image : ghcr.io/alexsjones/llmfit command : serve - - host 0.0.0.0 - - port 8787 ports : - "8787:8787" environment : - OLLAMA_CONTEXT_LENGTH=8192 devices : - /dev/nvidia0 : /dev/nvidia0
pass GPU through
TUI Key Reference Key Action ↑ / ↓ or j / k Navigate models / Search (name, provider, params, use case) Esc / Enter Exit search Ctrl-U Clear search f Cycle fit filter: All → Runnable → Perfect → Good → Marginal a Cycle availability: All → GGUF Avail → Installed s Cycle sort: Score → Params → Mem% → Ctx → Date → Use Case t Cycle color theme (auto-saved) v Visual mode (multi-select for comparison) V Select mode (column-based filtering) p Plan mode (what hardware needed for this model?) P Provider filter popup U Use-case filter popup C Capability filter popup m Mark model for comparison c Compare view (marked vs selected) d Download model (via detected runtime) r Refresh installed models from runtimes Enter Toggle detail view g / G Jump to top/bottom q Quit Themes t cycles: Default → Dracula → Solarized → Nord → Monokai → Gruvbox Theme saved to ~/.config/llmfit/theme GPU Detection Details GPU Vendor Detection Method NVIDIA nvidia-smi (multi-GPU, aggregates VRAM) AMD rocm-smi Intel Arc sysfs (discrete) / lspci (integrated) Apple Silicon system_profiler (unified memory = VRAM) Ascend npu-smi Common Patterns "What can I run on my 16GB M2 Mac?" llmfit fit --perfect -n 10
or interactively
llmfit
press 'f' to filter to Perfect fit
"I have a 3090 (24GB VRAM), what coding models fit?" llmfit recommend --json --use-case coding | jq '.models[]'
or with manual override if detection fails
llmfit --memory = 24G recommend --json --use-case coding "Can Llama 70B run on my machine?" llmfit info "Llama-3.1-70B"
Plan what hardware you'd need
llmfit plan "Llama-3.1-70B" --context 4096 --json "Show me only models already installed in Ollama" llmfit
press 'a' to cycle to Installed filter
or
llmfit fit -n 20
run, press 'i' in TUI for installed-first
"Script: find best model and start Ollama" MODEL = $( llmfit recommend --json --limit 1 | jq -r '.models[0].name' ) ollama serve & ollama run " $MODEL " "API: poll node capabilities for cluster scheduler"
Check node, get top 3 good+ models for reasoning
curl -s "http://node1:8787/api/v1/models/top?limit=3&min_fit=good&use_case=reasoning" | \ jq '.models[].name' Troubleshooting GPU not detected / wrong VRAM reported
Verify detection
llmfit system
Manual override
llmfit --memory = 24G --cli nvidia-smi not found but you have an NVIDIA GPU
Install CUDA toolkit or nvidia-utils, then retry
Or override manually:
llmfit --memory = 8G fit --perfect Models show as too_tight but you have enough RAM
llmfit may be using context-inflated estimates; cap context
llmfit --max-context 2048 fit --perfect -n 10 REST API: test endpoints
Spawn server and run validation suite
python3 scripts/test_api.py --spawn
Test already-running server
python3 scripts/test_api.py --base-url http://127.0.0.1:8787 Apple Silicon: VRAM shows as system RAM (expected)
This is correct — Apple Silicon uses unified memory
llmfit accounts for this automatically
llmfit system
should show backend: Metal
Context length environment variable export OLLAMA_CONTEXT_LENGTH = 4096 llmfit recommend --json