This system processes interviews in near real-time—first separating speakers in the audio stream, then transcribing and analyzing each utterance using deep learning models implemented in PyTorch.
We use Google Research’s Accurate Online Speaker Diarization (AO-SD), combining a dual-stage architecture: a frame-level embedding extractor and a clustering head that groups by speaker ID.
Each entry specifies the start and end time and the speaker ID \( s_i \), which guides the Whisper transcription.
Each sentence goes through a series of neural transformations for contextual analysis:
Sentence → Tokens: The sentence is split into tokens:
Tokens → Embeddings: Each token is mapped to a vector .
Contextual Encoding: A transformer maps each vector to a contextual embedding .
Pooling → Sentence Vector: All contextual embeddings are averaged: .
Classification: A softmax classifier outputs probabilities over classes: .
This allows the system to extract context-aware representations and classify the content accurately in real time.
[Speaker A]: I think this project has a lot of potential.
[Speaker B]: Definitely—especially in real-time applications.
We are planning to expand this to full multimodal intelligence by integrating:
This will enable powerful applications in UX research, journalism, and real-time moderation—highlighting what was said, how it was said, and by whom.