How We Built Cross-Meeting Voice Fingerprinting
Using SpeechBrain's ECAPA-TDNN model, we can identify speakers across meetings with just seconds of audio. Here's how we engineered our voice recognition pipeline.
One of the most powerful features of Huddix is the ability to recognize speakers across different meetings — even when they join from different devices or accounts. This post dives deep into how we built our voice fingerprinting system.
The Challenge
In a typical meeting intelligence system, speaker diarization ( figuring out "who spoke when") is done within a single meeting. But what if you could identify a speaker across meetings? What if "Sarah from Engineering" was automatically recognized even when she called in from her phone to a client call?
This requires a different approach — voice fingerprinting, not just diarization.
Our Approach: ECAPA-TDNN
We evaluated several approaches for voice fingerprinting, including traditional methods like i-vectors and modern deep learning approaches. After extensive testing, we chose to implement ECAPA-TDNN (Emperor Couplet Attention with Positional Attention - Time Delay Neural Network) from SpeechBrain.
ECAPA-TDNN works by:
- Extracting mel-frequency cepstral coefficients (MFCCs) from audio
- Processing them through a series of dilated convolutional layers
- Applying channel-wise and context-dependent attention mechanisms
- Producing a compact embedding vector that uniquely represents a voice
Building the Pipeline
Our voice fingerprinting pipeline has several stages:
1. Audio Preprocessing
First, we preprocess the audio to handle different sample rates, remove noise, and normalize volume levels. We use WebRTC VAD (Voice Activity Detection) to segment speech from silence.
2. Speaker Embedding Generation
For each speaker segment, we generate a 192-dimensional embedding vector using our trained ECAPA-TDNN model. This happens in real-time as meetings progress.
3. Clustering and Linking
Within a meeting, we use agglomerative clustering to group segments by speaker. Across meetings, we compare embedding vectors using cosine similarity to identify when the same person is speaking.
4. Voice Profile Updates
Each user's voice profile is updated incrementally as they speak in more meetings. We use an exponential moving average to balance stability (not changing too frequently) with responsiveness (adapting to voice changes over time).
Performance
On our benchmark dataset of 500 meetings with 2,000+ unique speakers, ECAPA-TDNN achieves:
- 95.3% speaker identification accuracy within meetings
- 89.7% cross-meeting speaker linking accuracy
- ~50ms per-second processing time (enabling real-time processing)
Privacy Considerations
Voice fingerprints are stored as encrypted embedding vectors — not actual audio. These vectors cannot be reversed to reconstruct speech, ensuring user privacy while still enabling accurate identification.
Future Work
We're exploring several improvements, including handling multiple languages, improving recognition for accented speech, and reducing the amount of enrollment audio needed for new speakers.