Real-Time Interview Analysis Using Voice and Text Intelligence

This system processes interviews in near real-time—first separating speakers in the audio stream, then transcribing and analyzing each utterance using deep learning models implemented in PyTorch.

Speaker Diarization with Google Research

We use Google Research’s Accurate Online Speaker Diarization (AO-SD), combining a dual-stage architecture: a frame-level embedding extractor and a clustering head that groups by speaker ID.

labels={(tistart,tiend,si)}i=1K\text{labels} = \{ (t_i^{\text{start}},\, t_i^{\text{end}},\, s_i) \}_{i=1}^{K}

Each entry specifies the start and end time and the speaker ID \( s_i \), which guides the Whisper transcription.

Neural Network Pipeline

Each sentence goes through a series of neural transformations for contextual analysis:

  • Sentence → Tokens: The sentence ss is split into tokens:

    {w1,w2,,wn}\{w_1, w_2, \dots, w_n\}
    s{w1,w2,,wn}s \rightarrow \{w_1, w_2, \dots, w_n\}
  • Tokens → Embeddings: Each token is mapped to a vector xiRd\mathbf{x}_i \in \mathbb{R}^d.

    wixiw_i \rightarrow \mathbf{x}_i
  • Contextual Encoding: A transformer maps each vector to a contextual embedding ei\mathbf{e}_i.

    xiei\mathbf{x}_i \rightarrow \mathbf{e}_i
  • Pooling → Sentence Vector: All contextual embeddings are averaged: v=1ni=1nei\mathbf{v} = \frac{1}{n} \sum_{i=1}^n \mathbf{e}_i.

    v=1ni=1nei\mathbf{v} = \frac{1}{n} \sum_{i=1}^n \mathbf{e}_i
  • Classification: A softmax classifier outputs probabilities over classes: y^=softmax(Wv+b)\hat{y} = \mathrm{softmax}(W \mathbf{v} + b).

    y^=softmax(Wv+b)\hat{y} = \mathrm{softmax}(W \mathbf{v} + b)

This allows the system to extract context-aware representations and classify the content accurately in real time.

Speaker-Attributed Transcript

[Speaker A]: I think this project has a lot of potential.
[Speaker B]: Definitely—especially in real-time applications.

Future Work

We are planning to expand this to full multimodal intelligence by integrating:

  • Real-time face detection and tracking (YOLOv8, RT-DETR)
  • Linking face tracks to voice clusters for multimodal identity resolution
  • Combining facial expression features with speech-based emotion signals

This will enable powerful applications in UX research, journalism, and real-time moderation—highlighting what was said, how it was said, and by whom.