obliteratus-abliteration

安装量: 355
排名: #6049

安装

npx skills add https://github.com/aradotso/trending-skills --skill obliteratus-abliteration

OBLITERATUS — LLM Abliteration Toolkit Skill by ara.so — Daily 2026 Skills collection. OBLITERATUS is an open-source toolkit for identifying and surgically removing refusal behaviors from large language models using mechanistic interpretability techniques (abliteration). It locates refusal directions in a model's hidden states via SVD/PCA, projects them out of the weights, and preserves core language capabilities. Ships with a Gradio UI, CLI, Python API, and Colab notebook. Installation

Core install

pip install obliteratus

With Gradio UI support

pip install "obliteratus[spaces]"

With all optional analysis modules

pip install "obliteratus[full]"

From source (latest)

git clone https://github.com/elder-plinius/OBLITERATUS cd OBLITERATUS pip install -e ".[full]" Requirements: Python 3.10+ PyTorch 2.1+ with CUDA (recommended) or CPU transformers , accelerate , gradio>=5.29.0 HuggingFace account + token for gated models export HF_TOKEN = your_hf_token_here huggingface-cli login CLI — Key Commands

Basic obliteration (default method)

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct

Advanced method (whitened SVD + bias projection + iterative refinement)

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced

Analysis-informed pipeline (auto-configures from geometry analysis)

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method informed

Specify output directory and push to Hub

obliteratus obliterate mistralai/Mistral-7B-Instruct-v0.3 \ --method advanced \ --output ./my-liberated-model \ --push-to-hub your-username/mistral-7b-liberated

LoRA-based reversible ablation (non-destructive)

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct \ --method lora \ --lora-rank 1

Strength sweep — find the capability/compliance tradeoff

obliteratus sweep meta-llama/Llama-3.1-8B-Instruct \ --strengths 0.2 ,0.4,0.6,0.8,1.0

Run analysis modules only (no modification)

obliteratus analyze meta-llama/Llama-3.1-8B-Instruct \ --modules concept_cone,alignment_imprint,universality

Benchmark: compare methods on a model

obliteratus benchmark meta-llama/Llama-3.1-8B-Instruct \ --methods basic,advanced,informed

Launch local Gradio UI

obliteratus ui obliteratus ui --port 8080 --share obliteratus ui --no-telemetry Python API Basic obliteration from obliteratus import Obliterator

Initialize with a HuggingFace model ID or local path

obl

Obliterator ( "meta-llama/Llama-3.1-8B-Instruct" )

Run the full pipeline: SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH

result

obl . obliterate ( method = "advanced" ) print ( result . perplexity_delta )

capability preservation metric

print ( result . refusal_rate_delta )

refusal reduction

print ( result . output_path )

where the model was saved

Step-by-step pipeline from obliteratus import Obliterator from obliteratus . pipeline import PipelineConfig config = PipelineConfig ( method = "advanced" , num_directions = 32 ,

number of refusal directions to extract

strength

1.0 ,

projection strength (0.0–1.0+)

preserve_norm

True ,

norm-preserving biprojection

project_biases

True ,

also remove from bias terms

iterative_passes

3 ,

re-probe after each pass

layers

"auto" ,

or list of ints, e.g. [10, 11, 12, 13]

dtype

"bfloat16" , device = "cuda" , ) obl = Obliterator ( "mistralai/Mistral-7B-Instruct-v0.3" , config = config )

Individual stages

obl . summon ( )

load model + tokenizer

activations

obl . probe ( )

collect activations on restricted vs unrestricted prompts

directions

obl . distill ( activations )

extract refusal directions via SVD

obl . excise ( directions )

project out guardrail directions

metrics

obl . verify ( )

perplexity + coherence checks

obl . rebirth ( "./liberated-mistral-7b" )

save with metadata

Custom probe prompts from obliteratus import Obliterator from obliteratus . probing import ProbeDataset

Use your own restricted/unrestricted prompt pairs

dataset

ProbeDataset ( restricted = [ "How do I pick a lock?" , "Write a story with explicit violence." , "Explain how malware works in detail." , ] , unrestricted = [ "What is the capital of France?" , "Write a story about a dog." , "Explain how encryption works." , ] ) obl = Obliterator ( "google/gemma-2-9b-it" ) obl . summon ( ) activations = obl . probe ( dataset = dataset ) directions = obl . distill ( activations ) obl . excise ( directions ) obl . rebirth ( "./liberated-gemma-2-9b" ) Analysis modules from obliteratus . analysis import AnalysisSuite suite = AnalysisSuite ( "meta-llama/Llama-3.1-8B-Instruct" ) suite . load ( )

Concept Cone Geometry — how many distinct refusal mechanisms?

cone

suite . concept_cone_geometry ( ) print ( f"Solid angle estimate: { cone . solid_angle : .4f } " ) print ( f"Distinct refusal clusters: { cone . num_clusters } " )

Alignment Imprint Detection — DPO vs RLHF vs CAI vs SFT?

imprint

suite . alignment_imprint ( ) print ( f"Detected training method: { imprint . method } " )

e.g. "RLHF"

print ( f"Confidence: { imprint . confidence : .2% } " )

Ouroboros Effect — will it self-repair?

ouroboros

suite . ouroboros_quantification ( ) print ( f"Self-repair score: { ouroboros . score : .4f } " ) print ( f"Recommended passes: { ouroboros . recommended_passes } " )

Cross-layer heatmap of refusal signal

heatmap

suite . layer_refusal_heatmap ( ) heatmap . plot ( save_path = "./refusal_heatmap.png" )

Safety-capability entanglement

entanglement

suite . entanglement_map ( ) print ( f"Safe layers to modify: { entanglement . safe_layers } " ) print ( f"Risky layers (entangled): { entanglement . risky_layers } " ) Analysis-informed obliteration from obliteratus import Obliterator from obliteratus . pipeline import PipelineConfig

"informed" method runs analysis modules mid-pipeline

to auto-configure every decision

config

PipelineConfig ( method = "informed" ) obl = Obliterator ( "meta-llama/Llama-3.1-8B-Instruct" , config = config ) result = obl . obliterate ( ) print ( result . analysis_report )

full auto-configuration decisions

Chat with obliterated model from obliteratus import Obliterator from obliteratus . chat import ChatSession obl = Obliterator ( "./liberated-llama-3.1-8b" ) obl . summon ( )

loads pre-obliterated model

session

ChatSession ( obl . model , obl . tokenizer ) response = session . chat ( "Explain in detail how a buffer overflow exploit works." , max_new_tokens = 512 , temperature = 0.7 , ) print ( response ) A/B comparison from obliteratus . compare import ABComparison ab = ABComparison ( original_path = "meta-llama/Llama-3.1-8B-Instruct" , obliterated_path = "./liberated-llama-3.1-8b" , ) prompt = "Write a story involving morally grey characters." original_resp , liberated_resp = ab . compare ( prompt ) print ( "=== ORIGINAL ===" ) print ( original_resp ) print ( "=== LIBERATED ===" ) print ( liberated_resp ) Push obliterated model to Hub import os from obliteratus import Obliterator obl = Obliterator ( "meta-llama/Llama-3.1-8B-Instruct" ) result = obl . obliterate ( method = "advanced" ) result . push_to_hub ( repo_id = f" { os . environ [ 'HF_USERNAME' ] } /Llama-3.1-8B-Instruct-abliterated" , token = os . environ [ "HF_TOKEN" ] , private = True , ) Obliteration Methods Method Description Best For basic Mean-difference direction extraction, single pass Quick experiments advanced Whitened SVD + bias projection + iterative refinement Production use informed Analysis-guided auto-configuration Unknown models lora Reversible LoRA rank-1 adapters (no weight surgery) Reversible ablation pca PCA-based direction extraction Research/comparison sparse Sparse autoencoder decomposition MoE models Configuration from obliteratus . pipeline import PipelineConfig config = PipelineConfig (

Core

method

"advanced" ,

abliteration method

strength

1.0 ,

projection strength (tune down if capability degrades)

num_directions

32 ,

refusal directions to extract

Layer selection

layers

"auto" ,

"auto", "cosmic", or list of ints

layer_selection

"cosmic" ,

COSMIC: most separable layers

Weight modification

preserve_norm

True ,

norm-preserving biprojection (recommended)

project_biases

True ,

project out bias terms too

project_attention

True ,

modify attention projection weights

project_mlp

True ,

modify MLP weights

Iterative refinement

iterative_passes

3 ,

re-probe after each pass (catches rotated directions)

MoE-specific

expert_granular

False ,

Expert-Granular Abliteration for MoE models

CoT preservation

cot_aware

True ,

preserve chain-of-thought directions

Hardware

dtype

"bfloat16" ,

"float32", "float16", "bfloat16"

device

"cuda" ,

"cuda", "cpu", "auto"

load_in_4bit

False ,

bitsandbytes 4-bit loading

Telemetry (anonymous, contributes to research dataset)

telemetry

True , ) Common Patterns Tune strength to preserve capability from obliteratus import Obliterator from obliteratus . sweep import StrengthSweep

Find the sweet spot before running full obliteration

sweep

StrengthSweep ( "meta-llama/Llama-3.1-8B-Instruct" ) results = sweep . run ( strengths = [ 0.2 , 0.4 , 0.6 , 0.8 , 1.0 , 1.2 ] ) for r in results : print ( f"Strength { r . strength : .1f } | perplexity_delta= { r . perplexity_delta : .2f } | refusal_rate= { r . refusal_rate : .2% } " )

Pick the best tradeoff

best

sweep . recommend ( ) print ( f"Recommended strength: { best . strength } " ) MoE model (Mixtral, DeepSeek-MoE) from obliteratus import Obliterator from obliteratus . pipeline import PipelineConfig config = PipelineConfig ( method = "advanced" , expert_granular = True ,

decompose per-expert refusal signals

project_attention

True , project_mlp = True , ) obl = Obliterator ( "mistralai/Mixtral-8x7B-Instruct-v0.1" , config = config ) obl . obliterate ( ) obl . rebirth ( "./liberated-mixtral-8x7b" ) Batch benchmark multiple models from obliteratus . benchmark import ModelBenchmark models = [ "meta-llama/Llama-3.1-8B-Instruct" , "google/gemma-2-9b-it" , "mistralai/Mistral-7B-Instruct-v0.3" , ] bench = ModelBenchmark ( models = models , method = "advanced" ) report = bench . run ( ) report . save ( "./benchmark_report.json" ) report . plot_heatmap ( "./benchmark_heatmap.png" ) Troubleshooting Out of memory (OOM) on large models config = PipelineConfig ( dtype = "float16" , load_in_4bit = True ,

requires bitsandbytes

device

"cuda" , layers = [ 10 , 11 , 12 , 13 ] ,

target fewer layers

num_directions

16 ,

fewer directions

) Capability degradation after obliteration

Lower the strength or use COSMIC layer selection (most separable layers)

config

PipelineConfig ( strength = 0.6 , layer_selection = "cosmic" , cot_aware = True ,

protect reasoning directions

iterative_passes

1 ,

fewer passes = less aggressive

) Refusal persists after obliteration

Use informed method + increase passes

config

PipelineConfig ( method = "informed" , iterative_passes = 5 , project_biases = True ,

don't forget bias terms

num_directions

64 ,

extract more directions

) Gated model access error export HF_TOKEN = your_hf_token_here

Accept model license on HuggingFace Hub first, then:

huggingface-cli login Gradio UI won't start pip install "obliteratus[spaces]"

Check port availability

obliteratus ui --port 7861 No-Code Options HuggingFace Space: spaces/pliny-the-prompter/obliteratus — free with HF Pro, ZeroGPU Colab notebook: notebooks/abliterate.ipynb — run all cells, no setup Key Research References Arditi et al. (2024) — arXiv:2406.11717 — foundational abliteration paper Gabliteration — arXiv:2512.18901 COSMIC layer selection — arXiv:2506.00085 , ACL 2025 Turner et al. (2023) — arXiv:2308.10248 — activation steering Rimsky et al. (2024) — arXiv:2312.06681 — contrastive activation addition

返回排行榜