What is Huddix and how does it work?

Huddix is an AI-powered meeting intelligence platform. Once installed, it automatically detects when you join a meeting on Zoom, Google Meet, or Microsoft Teams. It captures the audio, transcribes everything with speaker labels, generates AI summaries with action items, and stores searchable memories across all your meetings.

Which meeting platforms does Huddix support?

Huddix works with all major meeting platforms including Zoom, Google Meet, Microsoft Teams, Webex, and any app that uses your microphone. The desktop app auto-detects meetings — no browser extensions or bots needed.

Is Huddix free to use?

Yes! Huddix offers a free plan that includes local transcription and up to 5 meetings per month. Our Pro plan ($10/month) adds HD cloud transcription, AI summaries, semantic memory search, voice fingerprinting, and calendar integration with 30 hours per month.

How accurate is the transcription?

Huddix uses AssemblyAI for HD cloud transcription, achieving over 95% accuracy for clear audio. Our system includes automatic speaker diarization, so you always know who said what — even in meetings with many participants.

What is voice fingerprinting?

Voice fingerprinting uses neural network technology (ECAPA-TDNN) to create a unique voice profile for each speaker. This allows Huddix to automatically recognize and label speakers across different meetings — even when they switch devices or join from different accounts.

How does the cross-meeting memory work?

After each meeting, Huddix extracts key facts, decisions, and action items and stores them as semantic embeddings using pgvector. You can search across all your meetings using natural language — like asking "What did Sarah say about the Q4 budget?" and getting instant, accurate results.

Is my meeting data secure?

Absolutely. Huddix uses end-to-end encryption, row-level security in our database, and follows SOC 2 security practices. Your audio and transcripts are only accessible to you. You can export or delete all your data at any time.

What platforms is the desktop app available on?

The Huddix desktop app is available for macOS (Apple Silicon and Intel), Windows, and Linux. Download it from our download page and set it up in under a minute.

Back to Blog

Engineering Mar 22, 2026 8 min read

How We Built Cross-Meeting Voice Fingerprinting

Using SpeechBrain's ECAPA-TDNN model, we can identify speakers across meetings with just seconds of audio. Here's how we engineered our voice recognition pipeline.

Huddix Team

One of the most powerful features of Huddix is the ability to recognize speakers across different meetings — even when they join from different devices or accounts. This post dives deep into how we built our voice fingerprinting system.

The Challenge

In a typical meeting intelligence system, speaker diarization ( figuring out "who spoke when") is done within a single meeting. But what if you could identify a speaker across meetings? What if "Sarah from Engineering" was automatically recognized even when she called in from her phone to a client call?

This requires a different approach — voice fingerprinting, not just diarization.

Our Approach: ECAPA-TDNN

We evaluated several approaches for voice fingerprinting, including traditional methods like i-vectors and modern deep learning approaches. After extensive testing, we chose to implement ECAPA-TDNN (Emperor Couplet Attention with Positional Attention - Time Delay Neural Network) from SpeechBrain.

ECAPA-TDNN works by:

Extracting mel-frequency cepstral coefficients (MFCCs) from audio
Processing them through a series of dilated convolutional layers
Applying channel-wise and context-dependent attention mechanisms
Producing a compact embedding vector that uniquely represents a voice

Building the Pipeline

Our voice fingerprinting pipeline has several stages:

1. Audio Preprocessing

First, we preprocess the audio to handle different sample rates, remove noise, and normalize volume levels. We use WebRTC VAD (Voice Activity Detection) to segment speech from silence.

2. Speaker Embedding Generation

For each speaker segment, we generate a 192-dimensional embedding vector using our trained ECAPA-TDNN model. This happens in real-time as meetings progress.

3. Clustering and Linking

Within a meeting, we use agglomerative clustering to group segments by speaker. Across meetings, we compare embedding vectors using cosine similarity to identify when the same person is speaking.

4. Voice Profile Updates

Each user's voice profile is updated incrementally as they speak in more meetings. We use an exponential moving average to balance stability (not changing too frequently) with responsiveness (adapting to voice changes over time).

Performance

On our benchmark dataset of 500 meetings with 2,000+ unique speakers, ECAPA-TDNN achieves:

95.3% speaker identification accuracy within meetings
89.7% cross-meeting speaker linking accuracy
~50ms per-second processing time (enabling real-time processing)

Privacy Considerations

Voice fingerprints are stored as encrypted embedding vectors — not actual audio. These vectors cannot be reversed to reconstruct speech, ensuring user privacy while still enabling accurate identification.

Future Work

We're exploring several improvements, including handling multiple languages, improving recognition for accented speech, and reducing the amount of enrollment audio needed for new speakers.

Engineering

Building a Semantic Memory System with pgvector

How we use OpenAI embeddings and pgvector to build a cross-meeting memory graph that lets you search...