This is a submission for the AssemblyAI Voice Agents Challenge
What I Built
I created a real-time voice-controlled spell casting system that transforms spoken Harry Potter spells into instant keyboard commands. This project addresses the Real-Time Performance category by achieving ultra-low latency voice recognition for gaming applications where every millisecond matters.
The system recognizes over 30 different spells (like "Lumos", "Wingardium Leviosa", "Stupefy") and instantly triggers corresponding game actions through keyboard shortcuts. It features advanced fuzzy matching to handle pronunciation variations and partial transcript processing for immediate response - perfect for immersive gaming experiences.
Demo
🎥 YouTube Demo Video - Watch the spell casting in action!
Key features demonstrated:
- ⚡ Sub-300ms response time from speech to action
- 🎯 Accurate recognition of complex spell names
- 🔄 Handles pronunciation variations and partial words
- 🛡️ Smart spam prevention for rapid casting
- 🎮 Seamless integration with game controls
GitHub Repository
⚡ Hogwarts Spell Caster
A real-time voice-controlled spell casting system that transforms spoken Harry Potter spells into instant keyboard commands using AssemblyAI's Ultra-Fast Universal-Streaming technology. Cast spells with your voice and watch them trigger game actions in under 300ms!
🎯 Features
- ⚡ Ultra-Low Latency: Sub-300ms response time from speech to action
- 🎭 30+ Harry Potter Spells: Complete spell repertoire from the wizarding world
- 🧠 Intelligent Recognition: Advanced fuzzy matching handles pronunciation variations
- 🚀 Partial Processing: Acts on incomplete words for instant response
- 🛡️ Spam Prevention: Smart cooldowns prevent accidental rapid-fire casting
- 🎮 Gaming Ready: Direct keyboard integration for seamless game control
- 🔧 Optimized Performance: Pre-computed variations and early-exit logic
🎬 Demo
Click to watch the magic in action!
🚀 Quick Start
Prerequisites
- Python 3.8 or higher
- Microphone access
- AssemblyAI API key (free tier includes $50 credits)
Installation
-
Clone the…
Technical Implementation & AssemblyAI Integration
Core Architecture
The system leverages AssemblyAI's Universal-Streaming technology with aggressive optimization for minimal latency:
# Optimized streaming parameters for ultra-low latency
client.connect(
StreamingParameters(
sample_rate=16000,
format_turns=True,
# Aggressive turn detection for faster response
end_of_turn_confidence_threshold=0.5, # Lower threshold for faster detection
min_end_of_turn_silence_when_confident=100, # Reduced from 160ms
max_turn_silence=1500, # Reduced from 2400ms
)
)
Real-Time Processing Innovation
The key innovation is dual-layer processing that handles both partial and complete transcripts:
def on_turn(self: Type[StreamingClient], event: TurnEvent):
transcript = event.transcript
is_partial = not event.end_of_turn
if is_partial:
# Process partial transcript for immediate response (min 4 characters)
if len(transcript) >= 4:
print(f"👂 Partial: {transcript}")
if process_transcript(transcript, confidence_threshold=0.8, is_partial=True):
print("✨ Spell cast from partial transcript!")
else:
# Process complete transcript with lower threshold
print(f"🗣️ Complete: {transcript}")
process_transcript(transcript, confidence_threshold=0.6, is_partial=False)
Intelligent Spell Matching
I implemented an optimized fuzzy matching system that prioritizes speed:
def optimized_fuzzy_match(text, spell_list, threshold=0.6):
text = text.lower().strip()
# First, try exact matches in pre-computed variations
if text in SPELL_VARIATIONS:
return SPELL_VARIATIONS[text]
# Quick substring check for common patterns
for spell in spell_list:
if spell in text or text in spell:
if len(text) >= len(spell) * 0.7: # At least 70% of spell length
return spell
# Fallback to SequenceMatcher only when needed
# ... fuzzy matching logic
Performance Optimizations
- Pre-computed Spell Variations: Common spell variations are cached for instant lookup
- Spam Prevention: Prevents accidental rapid-fire casting with time-based cooldowns
- Early Exit Logic: Avoids expensive fuzzy matching when exact matches are found
- Partial Processing: Acts on partial transcripts for sub-300ms response times
AssemblyAI Features Utilized
- Universal-Streaming: Core real-time transcription with 300ms latency
- Turn Detection: Intelligent endpointing for natural speech flow
- High Accuracy: Handles complex fantasy terminology and pronunciation variations
- Partial Transcripts: Enables immediate response without waiting for complete utterances
Results
The system consistently achieves sub-300ms latency from speech input to game action, making spell casting feel truly magical and responsive. The combination of AssemblyAI's ultra-fast streaming with optimized processing creates an immersive gaming experience where voice commands feel as natural as pressing keys.
Perfect for Harry Potter games, VR experiences, or any application requiring instant voice command recognition! 🧙♂️✨