Luis Montes
Founder, Iceddev
Introduction - who you are, what you do, why you're passionate about karaoke tech
First Song Played from Physical Media
1877 - Thomas Edison records on tin foil cylinder
"Mary Had a Little Lamb"
Fun opener - the very first "playback" was Edison singing a nursery rhyme.
Joke: "Someone probably sang along" - the first karaoke moment!
We've come a long way from tin foil to AI stems.
The Rise of Karaoke
1971 - Kobe, Japan
Daisuke Inoue invents karaoke ("empty orchestra")
1980s
Spreads globally; karaoke boxes become cultural phenomenon
1990s
Karaoke bars boom in US and Europe...
Inoue was a drummer who couldn't read music. Businessmen would ask him to back their singing.
He built a machine so they could sing without him.
Never patented it - considered one of the greatest missed opportunities in business history.
By the 90s, karaoke was a global phenomenon. The format wars began...
CD+G
A star is born
What is CD+G?
CD+Graphics - Philips/Sony, 1986
Standard audio CD with graphics embedded
Backwards compatible with regular CD players
Still extremely popular for karaoke today!
Part of the Red Book CD standard extension.
The fact that a 1986 format is still dominant says a lot about the karaoke industry.
97.73% Audio
2.27%
Graphics
2.27% of each frame (1/33 bytes × 6/8 channels)
Visual representation of how little space CD+G graphics get.
1/33 bytes per frame × 6/8 channels = 2.27% for graphics!
26.5 kbit/s
Less than this dial-up modem.
Let that sink in - you have dial-up modem bandwidth to display graphics.
This is why CD+G looks like it does!
Why Does This Look Like an Atari?
CD+G (1986) Atari 2600 (1977)
Resolution 288 × 192 160 × 192
Colors 16 of 4,096 128 total
Rendering 6×12 tiles "Racing the beam"
9 years newer, same visual era!
Different constraints, similar results.
Atari was limited by RAM cost. CD+G limited by subcode bandwidth.
Both forced to use clever tricks to display anything.
Why CD+G Won't Die
Massive existing library (100,000+ songs)
Professional publishers (Sound Choice, Sunfly, Chartbuster)
Reasonable file sizes (MP3 ~4MB + CDG ~4MB)
Every KJ has a CD+G collection
But it's 2026... can we do better?
Making your own CD+G is a pain: need instrumental track
(studio re-recording or vocal removal), lyrics, manual sync.
Tools exist (Karaoke Builder Studio, Power Karaoke) but $100+.
Most content comes from pro publishers who re-recorded songs.
NOT the original recordings - cover versions!
And now the AI stuff
From Karaoke to DJing?
The DJ's Dream (Stems)
🥁 🎸 🎹 🎤
DJs wanted to isolate parts of songs
Vocals for mashups and remixes
Drums for beat matching
Previously: expensive studio stems or nothing
For decades, getting stems meant either being the artist or paying big money.
AI Source Separation
2015: Early neural network attempts
2019: Spleeter (Deezer) - first practical solution
2021: Demucs (Facebook/Meta)
2024+: Stem separation in DJ software
The quality improved dramatically year over year.
Now it's good enough for professional use.
Demucs
Meta's Audio Source Separation
What is Demucs?
Open source (MIT license)
State-of-the-art source separation
Runs on PyTorch
Trained on huge dataset of music
Facebook AI Research project.
Multiple versions, each better than the last.
Demucs in Action
$ python -m demucs song.wav
# Output: separated/htdemucs/song/
# vocals.wav
# drums.wav
# bass.wav
# other.wav
Python-based, uses PyTorch under the hood.
4 stems = full control for mixing.
Needs ~7GB GPU RAM, or use -d cpu (slower).
MP3 Files
MP4 Stems
🎬
The New Standard?
Stems Structure
song.stem.mp4
├── Track 0: Master mix - plays in normal players
├── Track 1: Drums
├── Track 2: Bass
├── Track 3: Other (keys, guitars, etc.)
├── Track 4: Vocals
└── Metadata: atoms
Typically 4-5 stems plus the mixed track.
Some tools support more stems.
Why MP4 Stems for Karaoke?
Real original audio - not MIDI recreation!
Control vocal volume (or mute entirely)
Practice with just vocals + one instrument
This is where it gets interesting.
We can CREATE stems from any song.
But Wait...
We have the music separated.
What about the lyrics ?
Whisper
OpenAI's Speech Recognition
What is Whisper?
Open source speech recognition
Trained on 680,000 hours of audio
Multilingual (99 languages)
Timestamp generation!
The timestamp feature is crucial for karaoke.
Whisper for Lyrics
{
"text": "Never gonna give you up",
"start": 43.52,
"end": 45.84
}
Feed it the isolated vocals from Demucs
Get word-level timestamps
Embed directly into MP4 stems file
Process: Full song → Demucs → Vocals → Whisper → Timed lyrics
The Pipeline
┌───────────────┐ ┌─────────────┐ ┌─────────────┐
│ Any Song │ ──► │ Demucs │ ──► │ Whisper │
│ (mp3, flac, │ │ (Stems) │ │ (Lyrics) │
│ ogg, wav) │ │ │ │ │
└───────────────┘ └─────────────┘ └──────┬──────┘
│
▼
┌─────────────────────┐
│ MP4 Stems File │
│ with synced lyrics│
└─────────────────────┘
Fully automated pipeline.
Any song can become karaoke.
Whisper Challenges
Not always perfect transcription
Timing can be slightly off
Struggles with some vocal styles
Trained on speech, not music
... insurmountable?
LLMs
Clankers to the rescue!
LLMs for Lyrics
Fix Whisper transcription errors
Look up actual lyrics and align
Keep lyric timing accurate
LLMs can cross-reference with known lyrics databases.
Example Corrections
Whisper: "Excuse me while I kiss this guy"
LLM: "Excuse me while I kiss the sky"
Whisper: "Hold me closer, Tony Danza"
LLM: "Hold me closer, tiny dancer"
Whisper: "I feel stupid and contagious"
LLM: "...actually that's correct" 🤷
Classic misheard lyrics that LLMs can fix.
The Nirvana one is real - sometimes the weird transcription IS the lyric!
Storing lyrics: MP4 Metadata
Standard atoms - Artist, title, album, cover art
stem atom - NI Stems metadata for DJ software
kara atom - Synced lyrics for karaoke
One .stem.mp4 file works in Traktor, Mixxx, AND Loukai
No conversion needed between DJ and karaoke use.
Software that doesn't understand an atom just ignores it.
Open Source player
MP4 stems karaoke with real-time mixing
Also supports legacy CD+G format
Audio-reactive visualizations
Cross-platform (Linux, Windows, macOS)
Built with Electron, React, Web Audio API.
Karaoke Creator / Editor
Demucs - AI stem separation
Whisper - AI lyrics transcription
CREPE - Musical key detection
LLM correction - Fix transcription errors
Manual Editing - better timings, lyrics adjustments
any audio file → karaoke-ready .stem.mp4
No separate tools needed. The Creator tab handles everything.
Tech Stack
Electron
React
Vite
Tailwind CSS
Butterchurn/WebGL
Web Audio API
Socket.IO
WASM
PyTorch
ffmpeg
Modern stack, fast iteration.
Demo Time!
Let's light this candle
Live demo of Loukai:
- Load a stems file
- Show stem mixing
- Show visualizations
- Show web remote
Karaoke in 2026
AI-generated stems from any song
Automatic lyrics with timestamps
Real-time pitch correction
Vocal coaching
Web GPU compute shaders?
.stem.mp4 is a worthy successor to CD+G !