Speech or no speech detection in Python

I am writing a program that recognizes speech. What it does is it records audio from the microphone and converts it to text using Sphinx. My problem is I want to start recording audio only when something is spoken by the user.

I experimented by reading the audio levels from the microphone and recording only when the level is above a particular value. But it ain't that effective. The program starts recording whenever it detects anything loud. This is the code I used

import audioop
import pyaudio as pa
import wav

class speech():
    def __init__(self):
        # soundtrack properties
        self.format = pa.paInt16
        self.rate = 16000
        self.channel = 1
        self.chunk = 1024
        self.threshold = 150
        self.file = 'audio.wav'

        # intialise microphone stream
        self.audio = pa.PyAudio()
        self.stream = self.audio.open(format=self.format,
                                  channels=self.channel,
                                  rate=self.rate,
                                  input=True,
                                  frames_per_buffer=self.chunk)


    def record(self)
        while True:
            data = self.stream.read(self.chunk)
            rms = audioop.rms(data,2) #get input volume
            if rms>self.threshold: #if input volume greater than threshold
                break

        # array to store frames
        frames = []
        # record upto silence only
        while rms>threshold:
            data = self.stream.read(self.chunk)
            rms = audioop.rms(data,2)
            frames.append(data)

        print 'finished recording.... writing file....'
        write_frames = wav.open(self.file, 'wb')
        write_frames.setnchannels(self.channel)
        write_frames.setsampwidth(self.audio.get_sample_size(self.format))
        write_frames.setframerate(self.rate)
        write_frames.writeframes(''.join(frames))
        write_frames.close()

Is there a way I can differentiate between human voice and other noise in Python ? Hope somebody can find me a solution.


I think that your issue is that at the moment you are trying to record without recognition of the speech so it is not discriminating - recognisable speech is anything that gives meaningful results after recognition - so catch 22. You could simplify matters by looking for an opening keyword. You can also filter on voice frequency range as the human ear and the telephone companies both do and you can look at the mark space ratio - I believe that there were some publications a while back on that but look out - it varies from language to language. A quick Google can be very informative. You may also find this article interesting.


I think waht you are looking for is VAD (voice activity detection). VAD can be used for preprocessing speech for ASR. Here is some open-source project for implements of VAD link. May it help you.


This is an example script using a VAD library. https://github.com/wiseman/py-webrtcvad/blob/master/example.py

链接地址: http://www.djcxy.com/p/34402.html

上一篇: Android语音识别与文本到语音冲突

下一篇: Python中的语音或无语音检测