- Gemma Multimodal Fine-Tuner
- Skill by
- ara.so
- — Daily 2026 Skills collection.
- Fine-tune Gemma 4 and Gemma 3n models on text, images, and audio data entirely on Apple Silicon (MPS), with support for streaming large datasets from GCS/BigQuery without filling local storage.
- What It Does
- Text LoRA
-
- instruction-tuning or completion fine-tuning from local CSV
- Image + Text LoRA
-
- captioning and VQA from local CSV
- Audio + Text LoRA
-
- the only Apple-Silicon-native path for this modality
- Cloud streaming
-
- train on terabytes from GCS/BigQuery without local copy
- MPS-native
- no NVIDIA GPU required — runs on MacBook Pro/Air/Mac Studio Installation Prerequisites macOS 12.3+ with Apple Silicon (arm64) Python 3.10+ (native arm64, not Rosetta) Hugging Face account with Gemma access
Install Python 3.12 if needed
brew install python@3.12
Create venv
python3.12 -m venv .venv source .venv/bin/activate
Verify arm64 (must show arm64, not x86_64)
python -c "import platform; print(platform.machine())"
Install PyTorch
pip install torch torchaudio
Clone and install
git clone https://github.com/mattmireles/gemma-tuner-multimodal cd gemma-tuner-multimodal pip install -e .
For Gemma 4 support (separate venv recommended)
pip install -r requirements/requirements-gemma4.txt Authenticate with Hugging Face huggingface-cli login
Or set environment variable:
export HF_TOKEN = your_token_here CLI Commands
Check system is ready
gemma-macos-tuner system-check
Guided setup wizard (recommended for first run)
gemma-macos-tuner wizard
Prepare dataset
gemma-macos-tuner prepare < dataset-profile
Fine-tune a model
gemma-macos-tuner finetune < profile
--json-logging
Evaluate a run
gemma-macos-tuner evaluate < profile-or-run
Export merged HF/SafeTensors (merges LoRA when adapter_config.json present)
gemma-macos-tuner export < run-dir-or-profile
Blacklist bad samples from errors
gemma-macos-tuner blacklist < profile
List training runs
gemma-macos-tuner runs list
Configuration (
config/config.ini
)
The config is hierarchical INI: defaults → groups → models → datasets → profiles.
[
defaults
]
output_dir
=
output
batch_size
=
2
gradient_accumulation_steps
=
8
learning_rate
=
2e-4
num_train_epochs
=
3
[
model:gemma-3n-e2b-it
]
group
=
gemma
base_model
=
google/gemma-3n-E2B-it
[
model:gemma-4-e2b-it
]
group
=
gemma
base_model
=
google/gemma-4-E2B-it
[
dataset:my-audio-dataset
]
data_dir
=
data/datasets/my-audio-dataset
audio_column
=
audio_path
text_column
=
transcript
[
profile:my-audio-profile
]
model
=
gemma-3n-e2b-it
dataset
=
my-audio-dataset
modality
=
audio
lora_r
=
16
lora_alpha
=
32
lora_dropout
=
0.05
max_seq_length
=
512
Use
GEMMA_TUNER_CONFIG
env var to point to config outside repo root:
export
GEMMA_TUNER_CONFIG
=
/path/to/my/config.ini
Modality Configuration
Text-Only Fine-Tuning
Instruction tuning
(user/assistant pairs):
[
profile:text-instruction
]
model
=
gemma-3n-e2b-it
dataset
=
my-text-dataset
modality
=
text
text_sub_mode
=
instruction
prompt_column
=
prompt
text_column
=
response
max_seq_length
=
2048
lora_r
=
16
lora_alpha
=
32
Completion tuning
(full sequence trained):
[
profile:text-completion
]
model
=
gemma-3n-e2b-it
dataset
=
my-text-dataset
modality
=
text
text_sub_mode
=
completion
text_column
=
text
max_seq_length
=
2048
CSV format
for instruction tuning (
data/datasets/my-text-dataset/train.csv
):
prompt
,
response
"What is photosynthesis?"
,
"Photosynthesis is the process by which plants..."
"Explain LoRA fine-tuning"
,
"LoRA (Low-Rank Adaptation) is a parameter-efficient..."
Image Fine-Tuning
[
profile:image-caption
]
model
=
gemma-3n-e2b-it
dataset
=
my-image-dataset
modality
=
image
image_sub_mode
=
captioning
image_token_budget
=
256
prompt_column
=
prompt
text_column
=
caption
max_seq_length
=
512
CSV format
(
data/datasets/my-image-dataset/train.csv
):
image_path
,
prompt
,
caption
/data/images/img1.jpg
,
Describe this image
,
A dog sitting on a green lawn...
/data/images/img2.jpg
,
What is shown here
,
A bar chart showing quarterly revenue...
Audio Fine-Tuning
[
profile:audio-asr
]
model
=
gemma-3n-e2b-it
dataset
=
my-audio-dataset
modality
=
audio
audio_column
=
audio_path
text_column
=
transcript
max_seq_length
=
512
lora_r
=
16
lora_alpha
=
32
lora_dropout
=
0.05
CSV format
(
data/datasets/my-audio-dataset/train.csv
):
audio_path
,
transcript
/data/audio/recording1.wav
,
The patient presents with acute respiratory symptoms
/data/audio/recording2.wav
,
Counsel objects to the characterization of the evidence
Supported Models
Model Key
Hugging Face ID
Notes
gemma-3n-e2b-it
google/gemma-3n-E2B-it
Default, ~2B instruct
gemma-3n-e4b-it
google/gemma-3n-E4B-it
~4B instruct
gemma-4-e2b-it
google/gemma-4-E2B-it
Needs requirements-gemma4.txt
gemma-4-e4b-it
google/gemma-4-E4B-it
Needs requirements-gemma4.txt
gemma-4-e2b
google/gemma-4-E2B
Base, needs Gemma 4 stack
gemma-4-e4b
google/gemma-4-E4B
Base, needs Gemma 4 stack
Add custom models with a
[model:your-name]
section using
group = gemma
.
Dataset Directory Layout
data/
└── datasets/
└──
Load config
config
load_config ( "config/config.ini" )
Run fine-tuning for a profile
run_finetune ( profile = "my-audio-profile" , config = config , json_logging = True ) Using Device Utilities from gemma_tuner . utils . device import get_device , memory_hint device = get_device ( )
Returns "mps", "cuda", or "cpu"
print ( f"Training on: { device } " ) hint = memory_hint ( model_key = "gemma-3n-e2b-it" ) print ( hint ) Loading and Inspecting Datasets from gemma_tuner . utils . dataset_utils import load_csv_dataset train_df , val_df = load_csv_dataset ( data_dir = "data/datasets/my-text-dataset" , text_column = "response" , prompt_column = "prompt" ) print ( f"Train samples: { len ( train_df ) } , Val samples: { len ( val_df ) } " ) Custom LoRA Config from peft import LoraConfig , get_peft_model from transformers import AutoModelForCausalLM model = AutoModelForCausalLM . from_pretrained ( "google/gemma-3n-E2B-it" , torch_dtype = "auto" , device_map = "mps" ) lora_config = LoraConfig ( r = 16 , lora_alpha = 32 , lora_dropout = 0.05 , target_modules = [ "q_proj" , "v_proj" , "k_proj" , "o_proj" ] , task_type = "CAUSAL_LM" ) model = get_peft_model ( model , lora_config ) model . print_trainable_parameters ( ) Common Patterns Full Workflow: Text Instruction Tuning
1. Prepare your data
mkdir -p data/datasets/my-dataset cp train.csv data/datasets/my-dataset/ cp validation.csv data/datasets/my-dataset/
2. Add profile to config/config.ini
cat
config/config.ini << 'EOF' [dataset:my-dataset] data_dir = data/datasets/my-dataset [profile:my-text-run] model = gemma-3n-e2b-it dataset = my-dataset modality = text text_sub_mode = instruction prompt_column = prompt text_column = response max_seq_length = 2048 lora_r = 16 lora_alpha = 32 EOF
3. Prepare dataset
gemma-macos-tuner prepare my-dataset
4. Fine-tune
gemma-macos-tuner finetune my-text-run --json-logging
5. Export merged weights
gemma-macos-tuner export my-text-run GCS Streaming for Large Datasets [ dataset:large-audio-gcs ] source = gcs gcs_bucket = my-bucket gcs_prefix = audio-training-data/ audio_column = audio_path text_column = transcript [ profile:large-audio-run ] model = gemma-3n-e4b-it dataset = large-audio-gcs modality = audio lora_r = 32 lora_alpha = 64 Set credentials: export GOOGLE_APPLICATION_CREDENTIALS = /path/to/service-account.json gemma-macos-tuner finetune large-audio-run Add a Custom Gemma Checkpoint [ model:my-custom-gemma ] group = gemma base_model = my-org/my-gemma-checkpoint [ profile:custom-run ] model = my-custom-gemma dataset = my-dataset modality = text text_sub_mode = instruction Troubleshooting Wrong architecture (x86_64 instead of arm64) python -c "import platform; print(platform.machine())"
Must be arm64 — if x86_64, reinstall Python natively:
brew install python@3.12 python3.12 -m venv .venv && source .venv/bin/activate MPS out of memory Reduce batch_size (try 1) Increase gradient_accumulation_steps to compensate Use a smaller model ( e2b instead of e4b ) Reduce max_seq_length Gemma 4 model not loading
Gemma 4 requires the updated Transformers stack
pip install -r requirements/requirements-gemma4.txt
Use a separate venv if you also need Gemma 3n
Config not found outside repo root export GEMMA_TUNER_CONFIG = /absolute/path/to/config/config.ini gemma-macos-tuner finetune my-profile Hugging Face auth errors huggingface-cli login
Or:
export HF_TOKEN = your_hf_token
Accept Gemma license at: https://huggingface.co/google/gemma-3n-E2B-it
System check before debugging anything else gemma-macos-tuner system-check Audio tower loaded even for text-only runs This is a known v1 issue — USM audio tower weights stay in memory even for modality = text . See README/KNOWN_ISSUES.md . Workaround: use a smaller model variant to stay within RAM budget. Architecture Reference File Role gemma_tuner/cli_typer.py Main CLI entrypoint ( gemma-macos-tuner ) gemma_tuner/core/ops.py Dispatches prepare/finetune/evaluate/export gemma_tuner/scripts/finetune.py Router: Gemma models → models/gemma/finetune.py gemma_tuner/models/gemma/finetune.py Core training loop with LoRA gemma_tuner/scripts/export.py Merges LoRA → HF/SafeTensors tree gemma_tuner/utils/device.py MPS/CUDA/CPU selection and memory hints gemma_tuner/utils/dataset_utils.py CSV loading, blacklist/protection semantics gemma_tuner/wizard/ Interactive CLI wizard (questionary + Rich) config/config.ini Hierarchical INI configuration