Detect multiple voices without speech recognition

Is there a way to just detect in realtime if there are multiple people speaking? Do I need a voice recognition api for that?

I don't want to separate the audio and I don't want to transcribe it either. My approach would be to frequently record using one mic (-> mono) and then analyse those recordings. But how then would I detect und distinguish voices? I'd narrow it down by looking only at relevant frequencies, but then...

I do understand that this is no trivial undertaking. That's why I do hope there's an api out there capable of doing this out of the box - preferably an mobile/web-friendly api.

Now this might sound like a shopping list for Christmas but as mentioned I do not need to know anything about the content. So my guess is that a full fledged speech recognition would have a high toll on the performance.


Most of similar problems (adult/children classifier, speech/music classifier, single voice / voice mixture classifier) are standard machine learning problems. You can solve them with classifier like GMM. You only need to construct training data for your task, so:

  • Take some amount of clean recordings, you can download audiobook
  • Prepare mixed data by mixing clean recordings
  • Train GMM classifier on both
  • Compare probabilities from clean speech GMM and mixed speech GMM and decide the presence of mixture by ratio of probabilities from two classifiers.
  • You can find some code samples here:

    https://github.com/littleowen/Conceptor

    For example you can try

    https://github.com/littleowen/Conceptor/blob/master/Gender.ipynb

    链接地址: http://www.djcxy.com/p/34450.html

    上一篇: 使用Web Speech API进行语音识别

    下一篇: 在没有语音识别的情况下检测多个声音