Transcription Automation Comprehensive skill for automating audio/video transcription and content processing. Core Workflows 1. Transcription Pipeline TRANSCRIPTION FLOW: ┌─────────────────┐ │ Audio/Video │ │ Input │ └────────┬────────┘ ▼ ┌─────────────────┐ │ Pre-Processing │ │ - Convert │ │ - Enhance │ │ - Split │ └────────┬────────┘ ▼ ┌─────────────────┐ │ Transcription │ │ - STT Engine │ │ - Diarization │ └────────┬────────┘ ▼ ┌─────────────────┐ │ Post-Processing │ │ - Format │ │ - Timestamps │ │ - Speakers │ └────────┬────────┘ ▼ ┌─────────────────┐ │ Output │ │ - Text/SRT/VTT │ │ - Summary │ └─────────────────┘ 2. Transcription Configuration transcription_config : engine : whisper
whisper, assembly_ai, deepgram
audio_settings : sample_rate : 16000 channels : mono format : wav transcription : language : auto
or specific: en, zh, es
model : large
tiny, base, small, medium, large
task : transcribe
transcribe or translate
features : speaker_diarization : true word_timestamps : true punctuation : true profanity_filter : false output : formats : - txt - srt - vtt - json include_confidence : true include_timestamps : true Meeting Transcription Meeting Notes Template meeting_transcript : metadata : title : "{{meeting_title}}" date : "{{date}}" duration : "{{duration}}" attendees : "{{speakers}}" output_template : |
{{title}}
Date: { { date } } Duration: { { duration } } Attendees: { { attendees } }
Summary
{ { ai_summary } }
Key Points
{ {
each key_points}}
- { { this } } { { /each } }
Action Items
{ {
each action_items}}
- [ ] { { task } } - @ { { assignee } } - Due : { { due_date } } { { /each } }
Full Transcript
{ {
each segments}}
** [ { { timestamp } } ] { { speaker } } : ** { { text } } { { /each } } Speaker Diarization diarization_config : min_speakers : 2 max_speakers : 10 speaker_labels : - name : "Speaker 1" voice_sample : "sample_1.wav"
Optional
- name : "Speaker 2" voice_sample : "sample_2.wav" output_format : speaker_prefix : true speaker_timestamps : true example_output : | [00:00:05] SPEAKER_1: Welcome everyone to today's meeting. [00:00:12] SPEAKER_2: Thanks for having us. [00:00:18] SPEAKER_1: Let's start with the agenda. Subtitle Generation SRT Format subtitle_config : format : srt timing : max_duration : 7
seconds per subtitle
min_gap : 0.1
seconds between subtitles
chars_per_line : 42 max_lines : 2 style : case : sentence
sentence, upper, lower
numbers : words
words, digits
example_output : | 1 00:00:05,000 --> 00:00:08,500 Welcome to today's presentation about transcription automation. 2 00 : 00:09 , 000 - -
00 : 00:12 , 000 Let me start by explaining the basic concepts. VTT Format vtt_config : format : vtt features : cue_settings : true styling : true example_output : | WEBVTT 00 : 00 : 05.000 - -
00 : 00 : 08.500 align : center Welcome to today's presentation about transcription automation. 00 : 00 : 09.000 - -
00 : 00 : 12.000 align : center <v Speaker 1
Let me start by explaining the basic concepts. Integration Workflows Zoom Integration zoom_transcription : trigger : event : recording_completed workflow : - step : download_recording source : zoom_cloud - step : transcribe engine : whisper language : auto - step : diarize identify_speakers : true - step : generate_notes template : meeting_notes include_summary : true extract_action_items : true - step : distribute destinations : - notion_page - slack_channel - email_attendees YouTube Integration youtube_subtitles : trigger : event : video_uploaded workflow : - step : download_audio source : youtube_video - step : transcribe engine : whisper task : transcribe - step : generate_subtitles formats : [ srt , vtt ] - step : translate target_languages : [ es , zh , ja , de , fr ] - step : upload_subtitles destination : youtube as_cc : true Podcast Processing podcast_workflow : input : source : rss_feed format : audio/mp3 processing : - transcribe : engine : whisper model : large - generate_chapters : detect_topics : true min_duration : 60
seconds
- create_show_notes : summarize : true extract_links : true highlight_quotes : true - create_searchable_index : full_text : true timestamps : true output : - transcript_txt - chapters_json - show_notes_md - search_index Language Support Multi-Language Transcription multilingual : auto_detect : true supported_languages : - code : en name : English model : large - code : zh name : Chinese model : large - code : es name : Spanish model : large - code : ja name : Japanese model : medium translation : enabled : true target : en preserve_original : true Code-Switching code_switching : enabled : true primary_language : en secondary_languages : [ zh , es ] output : | [00:01:23] The next topic is about 人工智能, which has been muy importante in recent years. handling : detect_language_per_segment : true tag_language_switches : true Quality Enhancement Post-Processing post_processing : text_cleanup : - remove_filler_words : [ "um" , "uh" , "like" ] - fix_common_errors : true - normalize_numbers : true formatting : - add_punctuation : true - capitalize_sentences : true - paragraph_breaks : true speaker_attribution : - merge_short_segments : true - min_segment_duration : 1.0 output_enhancement : - add_timestamps : true - highlight_keywords : true - generate_summary : true Accuracy Metrics TRANSCRIPTION QUALITY REPORT ═══════════════════════════════════════ File: meeting_2024_01_15.mp3 Duration: 45:32 Engine: Whisper Large METRICS: Word Error Rate (WER): 4.2% Character Error Rate: 2.8% Confidence Score: 0.94 SPEAKER DIARIZATION: Speakers Detected: 4 Diarization Accuracy: 91% PROCESSING TIME: Total: 8m 23s Real-time Factor: 0.18x DETECTED ISSUES: • Low confidence at 12:34 (background noise) • Overlapping speech at 23:45 • Unknown speaker at 34:12 API Examples OpenAI Whisper import openai
Transcribe audio
with open ( "meeting.mp3" , "rb" ) as audio_file : transcript = openai . Audio . transcribe ( model = "whisper-1" , file = audio_file , response_format = "verbose_json" , timestamp_granularities = [ "word" , "segment" ] )
Access results
- for
- segment
- in
- transcript
- .
- segments
- :
- (
- f"[
- {
- segment
- .
- start
- :
- .2f
- }
- ]
- {
- segment
- .
- text
- }
- "
- )
- AssemblyAI
- import
- assemblyai
- as
- aai
- transcriber
- =
- aai
- .
- Transcriber
- (
- )
- config
- =
- aai
- .
- TranscriptionConfig
- (
- speaker_labels
- =
- True
- ,
- auto_chapters
- =
- True
- ,
- entity_detection
- =
- True
- )
- transcript
- =
- transcriber
- .
- transcribe
- (
- "https://example.com/meeting.mp3"
- ,
- config
- =
- config
- )
- for
- utterance
- in
- transcript
- .
- utterances
- :
- (
- f"Speaker
- {
- utterance
- .
- speaker
- }
- :
- {
- utterance
- .
- text
- }
- "
- )
- Best Practices
- Quality Audio
-
- Clean input = better output
- Choose Right Model
-
- Balance speed vs accuracy
- Use Diarization
-
- Identify speakers clearly
- Post-Process
-
- Clean up automated output
- Verify Critical Content
-
- Human review important
- Consider Privacy
-
- Handle sensitive content
- Store Efficiently
-
- Compress and index
- Provide Context
- Vocabulary hints help