audiocraft-audio-generation

安装量: 257
排名: #3410

安装

npx skills add https://github.com/davila7/claude-code-templates --skill audiocraft-audio-generation

AudioCraft: Audio Generation

Comprehensive guide to using Meta's AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.

When to use AudioCraft

Use AudioCraft when:

Need to generate music from text descriptions Creating sound effects and environmental audio Building music generation applications Need melody-conditioned music generation Want stereo audio output Require controllable music generation with style transfer

Key features:

MusicGen: Text-to-music generation with melody conditioning AudioGen: Text-to-sound effects generation EnCodec: High-fidelity neural audio codec Multiple model sizes: Small (300M) to Large (3.3B) Stereo support: Full stereo audio generation Style conditioning: MusicGen-Style for reference-based generation

Use alternatives instead:

Stable Audio: For longer commercial music generation Bark: For text-to-speech with music/sound effects Riffusion: For spectogram-based music generation OpenAI Jukebox: For raw audio generation with lyrics Quick start Installation

From PyPI

pip install audiocraft

From GitHub (latest)

pip install git+https://github.com/facebookresearch/audiocraft.git

Or use HuggingFace Transformers

pip install transformers torch torchaudio

Basic text-to-music (AudioCraft) import torchaudio from audiocraft.models import MusicGen

Load model

model = MusicGen.get_pretrained('facebook/musicgen-small')

Set generation parameters

model.set_generation_params( duration=8, # seconds top_k=250, temperature=1.0 )

Generate from text

descriptions = ["happy upbeat electronic dance music with synths"] wav = model.generate(descriptions)

Save audio

torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)

Using HuggingFace Transformers from transformers import AutoProcessor, MusicgenForConditionalGeneration import scipy

Load model and processor

processor = AutoProcessor.from_pretrained("facebook/musicgen-small") model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") model.to("cuda")

Generate music

inputs = processor( text=["80s pop track with bassy drums and synth"], padding=True, return_tensors="pt" ).to("cuda")

audio_values = model.generate( **inputs, do_sample=True, guidance_scale=3, max_new_tokens=256 )

Save

sampling_rate = model.config.audio_encoder.sampling_rate scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())

Text-to-sound with AudioGen from audiocraft.models import AudioGen

Load AudioGen

model = AudioGen.get_pretrained('facebook/audiogen-medium')

model.set_generation_params(duration=5)

Generate sound effects

descriptions = ["dog barking in a park with birds chirping"] wav = model.generate(descriptions)

torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)

Core concepts Architecture overview AudioCraft Architecture: ┌──────────────────────────────────────────────────────────────┐ │ Text Encoder (T5) │ │ │ │ │ Text Embeddings │ └────────────────────────┬─────────────────────────────────────┘ │ ┌────────────────────────▼─────────────────────────────────────┐ │ Transformer Decoder (LM) │ │ Auto-regressively generates audio tokens │ │ Using efficient token interleaving patterns │ └────────────────────────┬─────────────────────────────────────┘ │ ┌────────────────────────▼─────────────────────────────────────┐ │ EnCodec Audio Decoder │ │ Converts tokens back to audio waveform │ └──────────────────────────────────────────────────────────────┘

Model variants Model Size Description Use Case musicgen-small 300M Text-to-music Quick generation musicgen-medium 1.5B Text-to-music Balanced musicgen-large 3.3B Text-to-music Best quality musicgen-melody 1.5B Text + melody Melody conditioning musicgen-melody-large 3.3B Text + melody Best melody musicgen-stereo-* Varies Stereo output Stereo generation musicgen-style 1.5B Style transfer Reference-based audiogen-medium 1.5B Text-to-sound Sound effects Generation parameters Parameter Default Description duration 8.0 Length in seconds (1-120) top_k 250 Top-k sampling top_p 0.0 Nucleus sampling (0 = disabled) temperature 1.0 Sampling temperature cfg_coef 3.0 Classifier-free guidance MusicGen usage Text-to-music generation from audiocraft.models import MusicGen import torchaudio

model = MusicGen.get_pretrained('facebook/musicgen-medium')

Configure generation

model.set_generation_params( duration=30, # Up to 30 seconds top_k=250, # Sampling diversity top_p=0.0, # 0 = use top_k only temperature=1.0, # Creativity (higher = more varied) cfg_coef=3.0 # Text adherence (higher = stricter) )

Generate multiple samples

descriptions = [ "epic orchestral soundtrack with strings and brass", "chill lo-fi hip hop beat with jazzy piano", "energetic rock song with electric guitar" ]

Generate (returns [batch, channels, samples])

wav = model.generate(descriptions)

Save each

for i, audio in enumerate(wav): torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)

Melody-conditioned generation from audiocraft.models import MusicGen import torchaudio

Load melody model

model = MusicGen.get_pretrained('facebook/musicgen-melody') model.set_generation_params(duration=30)

Load melody audio

melody, sr = torchaudio.load("melody.wav")

Generate with melody conditioning

descriptions = ["acoustic guitar folk song"] wav = model.generate_with_chroma(descriptions, melody, sr)

torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)

Stereo generation from audiocraft.models import MusicGen

Load stereo model

model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium') model.set_generation_params(duration=15)

descriptions = ["ambient electronic music with wide stereo panning"] wav = model.generate(descriptions)

wav shape: [batch, 2, samples] for stereo

print(f"Stereo shape: {wav.shape}") # [1, 2, 480000] torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)

Audio continuation from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-medium") model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")

Load audio to continue

import torchaudio audio, sr = torchaudio.load("intro.wav")

Process with text and audio

inputs = processor( audio=audio.squeeze().numpy(), sampling_rate=sr, text=["continue with a epic chorus"], padding=True, return_tensors="pt" )

Generate continuation

audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)

MusicGen-Style usage Style-conditioned generation from audiocraft.models import MusicGen

Load style model

model = MusicGen.get_pretrained('facebook/musicgen-style')

Configure generation with style

model.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=5.0 # Style influence )

Configure style conditioner

model.set_style_conditioner_params( eval_q=3, # RVQ quantizers (1-6) excerpt_length=3.0 # Style excerpt length )

Load style reference

style_audio, sr = torchaudio.load("reference_style.wav")

Generate with text + style

descriptions = ["upbeat dance track"] wav = model.generate_with_style(descriptions, style_audio, sr)

Style-only generation (no text)

Generate matching style without text prompt

model.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=None # Disable double CFG for style-only )

wav = model.generate_with_style([None], style_audio, sr)

AudioGen usage Sound effect generation from audiocraft.models import AudioGen import torchaudio

model = AudioGen.get_pretrained('facebook/audiogen-medium') model.set_generation_params(duration=10)

Generate various sounds

descriptions = [ "thunderstorm with heavy rain and lightning", "busy city traffic with car horns", "ocean waves crashing on rocks", "crackling campfire in forest" ]

wav = model.generate(descriptions)

for i, audio in enumerate(wav): torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)

EnCodec usage Audio compression from audiocraft.models import CompressionModel import torch import torchaudio

Load EnCodec

model = CompressionModel.get_pretrained('facebook/encodec_32khz')

Load audio

wav, sr = torchaudio.load("audio.wav")

Ensure correct sample rate

if sr != 32000: resampler = torchaudio.transforms.Resample(sr, 32000) wav = resampler(wav)

Encode to tokens

with torch.no_grad(): encoded = model.encode(wav.unsqueeze(0)) codes = encoded[0] # Audio codes

Decode back to audio

with torch.no_grad(): decoded = model.decode(codes)

torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)

Common workflows Workflow 1: Music generation pipeline import torch import torchaudio from audiocraft.models import MusicGen

class MusicGenerator: def init(self, model_name="facebook/musicgen-medium"): self.model = MusicGen.get_pretrained(model_name) self.sample_rate = 32000

def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0):
    self.model.set_generation_params(
        duration=duration,
        top_k=250,
        temperature=temperature,
        cfg_coef=cfg
    )

    with torch.no_grad():
        wav = self.model.generate([prompt])

    return wav[0].cpu()

def generate_batch(self, prompts, duration=30):
    self.model.set_generation_params(duration=duration)

    with torch.no_grad():
        wav = self.model.generate(prompts)

    return wav.cpu()

def save(self, audio, path):
    torchaudio.save(path, audio, sample_rate=self.sample_rate)

Usage

generator = MusicGenerator() audio = generator.generate( "epic cinematic orchestral music", duration=30, temperature=1.0 ) generator.save(audio, "epic_music.wav")

Workflow 2: Sound design batch processing import json from pathlib import Path from audiocraft.models import AudioGen import torchaudio

def batch_generate_sounds(sound_specs, output_dir): """ Generate multiple sounds from specifications.

Args:
    sound_specs: list of {"name": str, "description": str, "duration": float}
    output_dir: output directory path
"""
model = AudioGen.get_pretrained('facebook/audiogen-medium')
output_dir = Path(output_dir)
output_dir.mkdir(exist_ok=True)

results = []

for spec in sound_specs:
    model.set_generation_params(duration=spec.get("duration", 5))

    wav = model.generate([spec["description"]])

    output_path = output_dir / f"{spec['name']}.wav"
    torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)

    results.append({
        "name": spec["name"],
        "path": str(output_path),
        "description": spec["description"]
    })

return results

Usage

sounds = [ {"name": "explosion", "description": "massive explosion with debris", "duration": 3}, {"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5}, {"name": "door", "description": "wooden door creaking and closing", "duration": 2} ]

results = batch_generate_sounds(sounds, "sound_effects/")

Workflow 3: Gradio demo import gradio as gr import torch import torchaudio from audiocraft.models import MusicGen

model = MusicGen.get_pretrained('facebook/musicgen-small')

def generate_music(prompt, duration, temperature, cfg_coef): model.set_generation_params( duration=duration, temperature=temperature, cfg_coef=cfg_coef )

with torch.no_grad():
    wav = model.generate([prompt])

# Save to temp file
path = "temp_output.wav"
torchaudio.save(path, wav[0].cpu(), sample_rate=32000)
return path

demo = gr.Interface( fn=generate_music, inputs=[ gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"), gr.Slider(1, 30, value=8, label="Duration (seconds)"), gr.Slider(0.5, 2.0, value=1.0, label="Temperature"), gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient") ], outputs=gr.Audio(label="Generated Music"), title="MusicGen Demo" )

demo.launch()

Performance optimization Memory optimization

Use smaller model

model = MusicGen.get_pretrained('facebook/musicgen-small')

Clear cache between generations

torch.cuda.empty_cache()

Generate shorter durations

model.set_generation_params(duration=10) # Instead of 30

Use half precision

model = model.half()

Batch processing efficiency

Process multiple prompts at once (more efficient)

descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"] wav = model.generate(descriptions) # Single batch

Instead of

for desc in descriptions: wav = model.generate([desc]) # Multiple batches (slower)

GPU memory requirements Model FP32 VRAM FP16 VRAM musicgen-small ~4GB ~2GB musicgen-medium ~8GB ~4GB musicgen-large ~16GB ~8GB Common issues Issue Solution CUDA OOM Use smaller model, reduce duration Poor quality Increase cfg_coef, better prompts Generation too short Check max duration setting Audio artifacts Try different temperature Stereo not working Use stereo model variant References Advanced Usage - Training, fine-tuning, deployment Troubleshooting - Common issues and solutions Resources GitHub: https://github.com/facebookresearch/audiocraft Paper (MusicGen): https://arxiv.org/abs/2306.05284 Paper (AudioGen): https://arxiv.org/abs/2209.15352 HuggingFace: https://huggingface.co/facebook/musicgen-small Demo: https://huggingface.co/spaces/facebook/MusicGen

返回排行榜