DevArt keeps this article discoverable at a fast, self-canonical URL and links clearly to the original DEV publication.

This is a submission for the AssemblyAI Voice Agents Challenge

What I Built

I created a real-time voice-controlled spell casting system that transforms spoken Harry Potter spells into instant keyboard commands. This project addresses the Real-Time Performance category by achieving ultra-low latency voice recognition for gaming applications where every millisecond matters.

The system recognizes over 30 different spells (like "Lumos", "Wingardium Leviosa", "Stupefy") and instantly triggers corresponding game actions through keyboard shortcuts. It features advanced fuzzy matching to handle pronunciation variations and partial transcript processing for immediate response - perfect for immersive gaming experiences.

Demo

🎥 YouTube Demo Video - Watch the spell casting in action!

Key features demonstrated:

⚡ Sub-300ms response time from speech to action
🎯 Accurate recognition of complex spell names
🔄 Handles pronunciation variations and partial words
🛡️ Smart spam prevention for rapid casting
🎮 Seamless integration with game controls

GitHub Repository

turazashvili / Hogwards-Legacy-Cast-with-Voice

⚡ Hogwarts Spell Caster

A real-time voice-controlled spell casting system that transforms spoken Harry Potter spells into instant keyboard commands using AssemblyAI's Ultra-Fast Universal-Streaming technology. Cast spells with your voice and watch them trigger game actions in under 300ms!

🎯 Features

⚡ Ultra-Low Latency: Sub-300ms response time from speech to action
🎭 30+ Harry Potter Spells: Complete spell repertoire from the wizarding world
🧠 Intelligent Recognition: Advanced fuzzy matching handles pronunciation variations
🚀 Partial Processing: Acts on incomplete words for instant response
🛡️ Spam Prevention: Smart cooldowns prevent accidental rapid-fire casting
🎮 Gaming Ready: Direct keyboard integration for seamless game control
🔧 Optimized Performance: Pre-computed variations and early-exit logic

🎬 Demo

Hogwarts Spell Caster Demo

Click to watch the magic in action!

🚀 Quick Start

Prerequisites

Python 3.8 or higher
Microphone access
AssemblyAI API key (free tier includes $50 credits)

Installation

Clone the…

View on GitHub

Technical Implementation & AssemblyAI Integration

Core Architecture

The system leverages AssemblyAI's Universal-Streaming technology with aggressive optimization for minimal latency:

# Optimized streaming parameters for ultra-low latency
client.connect(
    StreamingParameters(
        sample_rate=16000,
        format_turns=True,
        # Aggressive turn detection for faster response
        end_of_turn_confidence_threshold=0.5,  # Lower threshold for faster detection
        min_end_of_turn_silence_when_confident=100,  # Reduced from 160ms
        max_turn_silence=1500,  # Reduced from 2400ms
    )
)

Real-Time Processing Innovation

The key innovation is dual-layer processing that handles both partial and complete transcripts:

def on_turn(self: Type[StreamingClient], event: TurnEvent):
    transcript = event.transcript
    is_partial = not event.end_of_turn

    if is_partial:
        # Process partial transcript for immediate response (min 4 characters)
        if len(transcript) >= 4:
            print(f"👂 Partial: {transcript}")
            if process_transcript(transcript, confidence_threshold=0.8, is_partial=True):
                print("✨ Spell cast from partial transcript!")
    else:
        # Process complete transcript with lower threshold
        print(f"🗣️ Complete: {transcript}")
        process_transcript(transcript, confidence_threshold=0.6, is_partial=False)

Intelligent Spell Matching

I implemented an optimized fuzzy matching system that prioritizes speed:

def optimized_fuzzy_match(text, spell_list, threshold=0.6):
    text = text.lower().strip()

    # First, try exact matches in pre-computed variations
    if text in SPELL_VARIATIONS:
        return SPELL_VARIATIONS[text]

    # Quick substring check for common patterns
    for spell in spell_list:
        if spell in text or text in spell:
            if len(text) >= len(spell) * 0.7:  # At least 70% of spell length
                return spell

    # Fallback to SequenceMatcher only when needed
    # ... fuzzy matching logic

Performance Optimizations

Pre-computed Spell Variations: Common spell variations are cached for instant lookup
Spam Prevention: Prevents accidental rapid-fire casting with time-based cooldowns
Early Exit Logic: Avoids expensive fuzzy matching when exact matches are found
Partial Processing: Acts on partial transcripts for sub-300ms response times

AssemblyAI Features Utilized

Universal-Streaming: Core real-time transcription with 300ms latency
Turn Detection: Intelligent endpointing for natural speech flow
High Accuracy: Handles complex fantasy terminology and pronunciation variations
Partial Transcripts: Enables immediate response without waiting for complete utterances

Results

The system consistently achieves sub-300ms latency from speech input to game action, making spell casting feel truly magical and responsive. The combination of AssemblyAI's ultra-fast streaming with optimized processing creates an immersive gaming experience where voice commands feel as natural as pressing keys.

Perfect for Harry Potter games, VR experiences, or any application requiring instant voice command recognition! 🧙‍♂️✨