BigCode Evaluation Harness - Code Model Benchmarking Quick Start
BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages).
Installation:
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git cd bigcode-evaluation-harness pip install -e . accelerate config
Evaluate on HumanEval:
accelerate launch main.py \ --model bigcode/starcoder2-7b \ --tasks humaneval \ --max_length_generation 512 \ --temperature 0.2 \ --n_samples 20 \ --batch_size 10 \ --allow_code_execution \ --save_generations
View available tasks:
python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"
Common Workflows Workflow 1: Standard Code Benchmark Evaluation
Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+).
Checklist:
Code Benchmark Evaluation: - [ ] Step 1: Choose benchmark suite - [ ] Step 2: Configure model and generation - [ ] Step 3: Run evaluation with code execution - [ ] Step 4: Analyze pass@k results
Step 1: Choose benchmark suite
Python code generation (most common):
HumanEval: 164 handwritten problems, function completion HumanEval+: Same 164 problems with 80× more tests (stricter) MBPP: 500 crowd-sourced problems, entry-level difficulty MBPP+: 399 curated problems with 35× more tests
Multi-language (18 languages):
MultiPL-E: HumanEval/MBPP translated to C++, Java, JavaScript, Go, Rust, etc.
Advanced:
APPS: 10,000 problems (introductory/interview/competition) DS-1000: 1,000 data science problems across 7 libraries
Step 2: Configure model and generation
Standard HuggingFace model
accelerate launch main.py \ --model bigcode/starcoder2-7b \ --tasks humaneval \ --max_length_generation 512 \ --temperature 0.2 \ --do_sample True \ --n_samples 200 \ --batch_size 50 \ --allow_code_execution
Quantized model (4-bit)
accelerate launch main.py \ --model codellama/CodeLlama-34b-hf \ --tasks humaneval \ --load_in_4bit \ --max_length_generation 512 \ --allow_code_execution
Custom/private model
accelerate launch main.py \ --model /path/to/my-code-model \ --tasks humaneval \ --trust_remote_code \ --use_auth_token \ --allow_code_execution
Step 3: Run evaluation
Full evaluation with pass@k estimation (k=1,10,100)
accelerate launch main.py \ --model bigcode/starcoder2-7b \ --tasks humaneval \ --temperature 0.8 \ --n_samples 200 \ --batch_size 50 \ --allow_code_execution \ --save_generations \ --metric_output_path results/starcoder2-humaneval.json
Step 4: Analyze results
Results in results/starcoder2-humaneval.json:
{ "humaneval": { "pass@1": 0.354, "pass@10": 0.521, "pass@100": 0.689 }, "config": { "model": "bigcode/starcoder2-7b", "temperature": 0.8, "n_samples": 200 } }
Workflow 2: Multi-Language Evaluation (MultiPL-E)
Evaluate code generation across 18 programming languages.
Checklist:
Multi-Language Evaluation: - [ ] Step 1: Generate solutions (host machine) - [ ] Step 2: Run evaluation in Docker (safe execution) - [ ] Step 3: Compare across languages
Step 1: Generate solutions on host
Generate without execution (safe)
accelerate launch main.py \ --model bigcode/starcoder2-7b \ --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \ --max_length_generation 650 \ --temperature 0.8 \ --n_samples 50 \ --batch_size 50 \ --generation_only \ --save_generations \ --save_generations_path generations_multi.json
Step 2: Evaluate in Docker container
Pull the MultiPL-E Docker image
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
Run evaluation inside container
docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro \ -it evaluation-harness-multiple python3 main.py \ --model bigcode/starcoder2-7b \ --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \ --load_generations_path /app/generations.json \ --allow_code_execution \ --n_samples 50
Supported languages: Python, JavaScript, Java, C++, Go, Rust, TypeScript, C#, PHP, Ruby, Swift, Kotlin, Scala, Perl, Julia, Lua, R, Racket
Workflow 3: Instruction-Tuned Model Evaluation
Evaluate chat/instruction models with proper formatting.
Checklist:
Instruction Model Evaluation: - [ ] Step 1: Use instruction-tuned tasks - [ ] Step 2: Configure instruction tokens - [ ] Step 3: Run evaluation
Step 1: Choose instruction tasks
instruct-humaneval: HumanEval with instruction prompts humanevalsynthesize-{lang}: HumanEvalPack synthesis tasks
Step 2: Configure instruction tokens
For models with chat templates (e.g., CodeLlama-Instruct)
accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf \
--tasks instruct-humaneval \
--instruction_tokens "[INST],,[/INST]" \
--max_length_generation 512 \
--allow_code_execution
Step 3: HumanEvalPack for instruction models
Test code synthesis across 6 languages
accelerate launch main.py \ --model codellama/CodeLlama-7b-Instruct-hf \ --tasks humanevalsynthesize-python,humanevalsynthesize-js \ --prompt instruct \ --max_length_generation 512 \ --allow_code_execution
Workflow 4: Compare Multiple Models
Benchmark suite for model comparison.
Step 1: Create evaluation script
!/bin/bash
eval_models.sh
MODELS=( "bigcode/starcoder2-7b" "codellama/CodeLlama-7b-hf" "deepseek-ai/deepseek-coder-6.7b-base" ) TASKS="humaneval,mbpp"
for model in "${MODELS[@]}"; do model_name=$(echo $model | tr '/' '-') echo "Evaluating $model"
accelerate launch main.py \ --model $model \ --tasks $TASKS \ --temperature 0.2 \ --n_samples 20 \ --batch_size 20 \ --allow_code_execution \ --metric_output_path results/${model_name}.json done
Step 2: Generate comparison table
import json import pandas as pd
models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"] results = []
for model in models: with open(f"results/{model}.json") as f: data = json.load(f) results.append({ "Model": model, "HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}", "MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}" })
df = pd.DataFrame(results) print(df.to_markdown(index=False))
When to Use vs Alternatives
Use BigCode Evaluation Harness when:
Evaluating code generation models specifically Need multi-language evaluation (18 languages via MultiPL-E) Testing functional correctness with unit tests (pass@k) Benchmarking for BigCode/HuggingFace leaderboards Evaluating fill-in-the-middle (FIM) capabilities
Use alternatives instead:
lm-evaluation-harness: General LLM benchmarks (MMLU, GSM8K, HellaSwag) EvalPlus: Stricter HumanEval+/MBPP+ with more test cases SWE-bench: Real-world GitHub issue resolution LiveCodeBench: Contamination-free, continuously updated problems CodeXGLUE: Code understanding tasks (clone detection, defect prediction) Supported Benchmarks Benchmark Problems Languages Metric Use Case HumanEval 164 Python pass@k Standard code completion HumanEval+ 164 Python pass@k Stricter evaluation (80× tests) MBPP 500 Python pass@k Entry-level problems MBPP+ 399 Python pass@k Stricter evaluation (35× tests) MultiPL-E 164×18 18 languages pass@k Multi-language evaluation APPS 10,000 Python pass@k Competition-level DS-1000 1,000 Python pass@k Data science (pandas, numpy, etc.) HumanEvalPack 164×3×6 6 languages pass@k Synthesis/fix/explain Mercury 1,889 Python Efficiency Computational efficiency Common Issues
Issue: Different results than reported in papers
Check these factors:
1. Verify n_samples (need 200 for accurate pass@k)
--n_samples 200
2. Check temperature (0.2 for greedy-ish, 0.8 for sampling)
--temperature 0.8
3. Verify task name matches exactly
--tasks humaneval # Not "human_eval" or "HumanEval"
4. Check max_length_generation
--max_length_generation 512 # Increase for longer problems
Issue: CUDA out of memory
Use quantization
--load_in_8bit
OR
--load_in_4bit
Reduce batch size
--batch_size 1
Set memory limit
--max_memory_per_gpu "20GiB"
Issue: Code execution hangs or times out
Use Docker for safe execution:
Generate on host (no execution)
--generation_only --save_generations
Evaluate in Docker
docker run ... --allow_code_execution --load_generations_path ...
Issue: Low scores on instruction models
Ensure proper instruction formatting:
Use instruction-specific tasks
--tasks instruct-humaneval
Set instruction tokens for your model
--instruction_tokens "[INST],,[/INST]"
Issue: MultiPL-E language failures
Use the dedicated Docker image:
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
Command Reference Argument Default Description --model - HuggingFace model ID or local path --tasks - Comma-separated task names --n_samples 1 Samples per problem (200 for pass@k) --temperature 0.2 Sampling temperature --max_length_generation 512 Max tokens (prompt + generation) --batch_size 1 Batch size per GPU --allow_code_execution False Enable code execution (required) --generation_only False Generate without evaluation --load_generations_path - Load pre-generated solutions --save_generations False Save generated code --metric_output_path results.json Output file for metrics --load_in_8bit False 8-bit quantization --load_in_4bit False 4-bit quantization --trust_remote_code False Allow custom model code --precision fp32 Model precision (fp32/fp16/bf16) Hardware Requirements Model Size VRAM (fp16) VRAM (4-bit) Time (HumanEval, n=200) 7B 14GB 6GB ~30 min (A100) 13B 26GB 10GB ~1 hour (A100) 34B 68GB 20GB ~2 hours (A100) Resources GitHub: https://github.com/bigcode-project/bigcode-evaluation-harness Documentation: https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs BigCode Leaderboard: https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard HumanEval Dataset: https://huggingface.co/datasets/openai/openai_humaneval MultiPL-E: https://github.com/nuprl/MultiPL-E