llmfit Hardware Model Matcher Skill by ara.so — Daily 2026 Skills collection. llmfit detects your system's RAM, CPU, and GPU then scores hundreds of LLM models across quality, speed, fit, and context dimensions — telling you exactly which models will run well on your hardware. It ships with an interactive TUI and a CLI, supports multi-GPU, MoE architectures, dynamic quantization, and local runtime providers (Ollama, llama.cpp, MLX, Docker Model Runner). Installation macOS / Linux (Homebrew) brew install llmfit Quick install script curl -fsSL https://llmfit.axjns.dev/install.sh | sh

Without sudo, installs to ~/.local/bin

curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local Windows (Scoop) scoop install llmfit Docker / Podman docker run ghcr.io/alexsjones/llmfit

With jq for scripting

podman run ghcr.io/alexsjones/llmfit recommend --use-case coding | jq '.models[].name' From source (Rust) git clone https://github.com/AlexsJones/llmfit.git cd llmfit cargo build --release

binary at target/release/llmfit

Core Concepts

Fit tiers

:

perfect

(runs great),

good

(runs well),

marginal

(runs but tight),

too_tight

(won't run)

Scoring dimensions

quality, speed (tok/s estimate), fit (memory headroom), context capacity

Run modes

GPU, CPU+GPU offload, CPU-only, MoE

Quantization

automatically selects best quant (e.g. Q4_K_M, Q5_K_S, mlx-4bit) for your hardware
Providers: Ollama, llama.cpp, MLX, Docker Model Runner Key Commands Launch Interactive TUI llmfit CLI Table Output llmfit --cli Show System Hardware Detection llmfit system llmfit --json system

JSON output

List All Models llmfit list Search Models llmfit search "llama 8b" llmfit search "mistral" llmfit search "qwen coding" Fit Analysis

All runnable models ranked by fit

llmfit fit

Only perfect fits, top 5

llmfit fit --perfect -n 5

JSON output

llmfit --json fit -n 10 Model Detail llmfit info "Mistral-7B" llmfit info "Llama-3.1-70B" Recommendations

Top 5 recommendations (JSON default)

llmfit recommend --json --limit 5

Filter by use case: general, coding, reasoning, chat, multimodal, embedding

llmfit recommend --json --use-case coding --limit 3 llmfit recommend --json --use-case reasoning --limit 5 Hardware Planning (invert: what hardware do I need?) llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --quant mlx-4bit llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --target-tps 25 --json llmfit plan "Qwen/Qwen2.5-Coder-0.5B-Instruct" --context 8192 --json REST API Server (for cluster scheduling) llmfit serve llmfit serve --host 0.0 .0.0 --port 8787 Hardware Overrides When autodetection fails (VMs, broken nvidia-smi, passthrough setups):

Override GPU VRAM

llmfit --memory = 32G llmfit --memory = 24G --cli llmfit --memory = 24G fit --perfect -n 5 llmfit --memory = 24G recommend --json

Megabytes

llmfit --memory = 32000M

Works with any subcommand

llmfit --memory = 16G info "Llama-3.1-70B" Accepted suffixes: G / GB / GiB , M / MB / MiB , T / TB / TiB (case-insensitive). Context Length Cap

Estimate memory fit at 4K context

llmfit --max-context 4096 --cli

With subcommands

llmfit --max-context 8192 fit --perfect -n 5 llmfit --max-context 16384 recommend --json --limit 5

Environment variable alternative

export OLLAMA_CONTEXT_LENGTH = 8192 llmfit recommend --json REST API Reference Start the server: llmfit serve --host 0.0 .0.0 --port 8787 Endpoints

Health check

curl http://localhost:8787/health

Node hardware info

curl http://localhost:8787/api/v1/system

Full model list with filters

curl "http://localhost:8787/api/v1/models?min_fit=marginal&runtime=llamacpp&sort=score&limit=20"

Top runnable models for this node (key scheduling endpoint)

curl "http://localhost:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"

Search by model name/provider

!/bin/bash

Get top 3 coding models that fit perfectly

llmfit recommend --json --use-case coding --limit 3 | \ jq -r '.models[] | "(.name) ((.score)) - (.quantization)"' Bash: Check if a specific model fits

!/bin/bash

MODEL

"Mistral-7B" RESULT = $( llmfit info " $MODEL " --json 2

/dev/null ) FIT = $( echo " $RESULT " | jq -r '.fit' ) if [ [ " $FIT " == "perfect" || " $FIT " == "good" ] ] ; then echo " $MODEL will run well (fit: $FIT )" else echo " $MODEL may not run well (fit: $FIT )" fi Bash: Auto-pull top Ollama model

!/bin/bash

Get the top fitting model name and pull it with Ollama

TOP_MODEL

$( llmfit recommend --json --limit 1 | jq -r '.models[0].name' ) echo "Pulling: $TOP_MODEL " ollama pull " $TOP_MODEL " Python: Query the REST API import requests BASE_URL = "http://localhost:8787" def get_system_info ( ) : resp = requests . get ( f" { BASE_URL } /api/v1/system" ) return resp . json ( ) def get_top_models ( use_case = "coding" , limit = 5 , min_fit = "good" ) : params = { "use_case" : use_case , "limit" : limit , "min_fit" : min_fit , "sort" : "score" } resp = requests . get ( f" { BASE_URL } /api/v1/models/top" , params = params ) return resp . json ( ) def search_models ( query , runtime = "any" ) : resp = requests . get ( f" { BASE_URL } /api/v1/models/ { query } " , params = { "runtime" : runtime } ) return resp . json ( )

Example usage

system

get_system_info
(
)
print
(
f"GPU:
{
system
.
get
(
'gpu_name'
)
}
| VRAM:
{
system
.
get
(
'vram_gb'
)
}
GB"
)
models
=
get_top_models
(
use_case
=
"reasoning"
,
limit
=
3
)
for
m
in
models
.
get
(
"models"
,
[
]
)
:
print
(
f"
{
m
[
'name'
]
}: score= { m [ 'score' ] } , fit= { m [ 'fit' ] } , quant= { m [ 'quantization' ] } " ) Python: Hardware-aware model selector for agents import subprocess import json def get_best_model_for_task ( use_case : str , min_fit : str = "good" ) -

dict : """Use llmfit to select the best model for a given task.""" result = subprocess . run ( [ "llmfit" , "recommend" , "--json" , "--use-case" , use_case , "--limit" , "1" ] , capture_output = True , text = True ) data = json . loads ( result . stdout ) models = data . get ( "models" , [ ] ) return models [ 0 ] if models else None def plan_hardware_requirements ( model_name : str , context : int = 4096 ) -

dict : """Get hardware requirements for running a specific model.""" result = subprocess . run ( [ "llmfit" , "plan" , model_name , "--context" , str ( context ) , "--json" ] , capture_output = True , text = True ) return json . loads ( result . stdout )

Select best coding model

best

get_best_model_for_task ( "coding" ) if best : print ( f"Best coding model: { best [ 'name' ] } " ) print ( f" Quantization: { best [ 'quantization' ] } " ) print ( f" Estimated tok/s: { best [ 'tps' ] } " ) print ( f" Memory usage: { best [ 'mem_pct' ] } %" )

Plan hardware for a specific model

plan

plan_hardware_requirements ( "Qwen/Qwen3-4B-MLX-4bit" , context = 8192 ) print ( f"Min VRAM needed: { plan [ 'hardware' ] [ 'min_vram_gb' ] } GB" ) print ( f"Recommended VRAM: { plan [ 'hardware' ] [ 'recommended_vram_gb' ] } GB" ) Docker Compose: Node scheduler pattern version : "3.8" services : llmfit-api : image : ghcr.io/alexsjones/llmfit command : serve - - host 0.0.0.0 - - port 8787 ports : - "8787:8787" environment : - OLLAMA_CONTEXT_LENGTH=8192 devices : - /dev/nvidia0 : /dev/nvidia0

pass GPU through

TUI Key Reference Key Action ↑ / ↓ or j / k Navigate models / Search (name, provider, params, use case) Esc / Enter Exit search Ctrl-U Clear search f Cycle fit filter: All → Runnable → Perfect → Good → Marginal a Cycle availability: All → GGUF Avail → Installed s Cycle sort: Score → Params → Mem% → Ctx → Date → Use Case t Cycle color theme (auto-saved) v Visual mode (multi-select for comparison) V Select mode (column-based filtering) p Plan mode (what hardware needed for this model?) P Provider filter popup U Use-case filter popup C Capability filter popup m Mark model for comparison c Compare view (marked vs selected) d Download model (via detected runtime) r Refresh installed models from runtimes Enter Toggle detail view g / G Jump to top/bottom q Quit Themes t cycles: Default → Dracula → Solarized → Nord → Monokai → Gruvbox Theme saved to ~/.config/llmfit/theme GPU Detection Details GPU Vendor Detection Method NVIDIA nvidia-smi (multi-GPU, aggregates VRAM) AMD rocm-smi Intel Arc sysfs (discrete) / lspci (integrated) Apple Silicon system_profiler (unified memory = VRAM) Ascend npu-smi Common Patterns "What can I run on my 16GB M2 Mac?" llmfit fit --perfect -n 10

or interactively

llmfit

press 'f' to filter to Perfect fit

"I have a 3090 (24GB VRAM), what coding models fit?" llmfit recommend --json --use-case coding | jq '.models[]'

or with manual override if detection fails

llmfit --memory = 24G recommend --json --use-case coding "Can Llama 70B run on my machine?" llmfit info "Llama-3.1-70B"

Plan what hardware you'd need

llmfit plan "Llama-3.1-70B" --context 4096 --json "Show me only models already installed in Ollama" llmfit

press 'a' to cycle to Installed filter

or

llmfit fit -n 20

run, press 'i' in TUI for installed-first

"Script: find best model and start Ollama" MODEL = $( llmfit recommend --json --limit 1 | jq -r '.models[0].name' ) ollama serve & ollama run " $MODEL " "API: poll node capabilities for cluster scheduler"

Check node, get top 3 good+ models for reasoning

curl -s "http://node1:8787/api/v1/models/top?limit=3&min_fit=good&use_case=reasoning" | \ jq '.models[].name' Troubleshooting GPU not detected / wrong VRAM reported

Verify detection

llmfit system

Manual override

llmfit --memory = 24G --cli nvidia-smi not found but you have an NVIDIA GPU

Install CUDA toolkit or nvidia-utils, then retry

Or override manually:

llmfit --memory = 8G fit --perfect Models show as too_tight but you have enough RAM

llmfit may be using context-inflated estimates; cap context

llmfit --max-context 2048 fit --perfect -n 10 REST API: test endpoints

Spawn server and run validation suite

python3 scripts/test_api.py --spawn

Test already-running server

python3 scripts/test_api.py --base-url http://127.0.0.1:8787 Apple Silicon: VRAM shows as system RAM (expected)

This is correct — Apple Silicon uses unified memory

llmfit accounts for this automatically

llmfit system

should show backend: Metal

Context length environment variable export OLLAMA_CONTEXT_LENGTH = 4096 llmfit recommend --json

安装

Without sudo, installs to ~/.local/bin

With jq for scripting

binary at target/release/llmfit

JSON output

All runnable models ranked by fit

Only perfect fits, top 5

JSON output

Top 5 recommendations (JSON default)

Filter by use case: general, coding, reasoning, chat, multimodal, embedding

Override GPU VRAM

Megabytes

Works with any subcommand

Estimate memory fit at 4K context

With subcommands

Environment variable alternative

Health check

Node hardware info

Full model list with filters

Top runnable models for this node (key scheduling endpoint)

Search by model name/provider

!/bin/bash

Get top 3 coding models that fit perfectly

!/bin/bash

MODEL

!/bin/bash

Get the top fitting model name and pull it with Ollama

TOP_MODEL

Example usage

system

Select best coding model

best

Plan hardware for a specific model

plan

pass GPU through

or interactively

press 'f' to filter to Perfect fit

or with manual override if detection fails

Plan what hardware you'd need

press 'a' to cycle to Installed filter

or

run, press 'i' in TUI for installed-first

Check node, get top 3 good+ models for reasoning

Verify detection

Manual override

Install CUDA toolkit or nvidia-utils, then retry

Or override manually:

llmfit may be using context-inflated estimates; cap context

Spawn server and run validation suite

Test already-running server

This is correct — Apple Silicon uses unified memory

llmfit accounts for this automatically

should show backend: Metal

uses 4096 as context cap