Google Speech Recognition API: timestamp for each word?

It's possible to use Google's Speech recognition API to get a transcription for an audio file (WAV, MP3, etc.) by doing a request to http://www.google.com/speech-api/v2/recognize?...

Example: I have said "one two three for five" in a WAV file. Google API gives me this:

{
  u'alternative':
  [
    {u'transcript': u'12345'},
    {u'transcript': u'1 2 3 4 5'},
    {u'transcript': u'one two three four five'}
  ],
  u'final': True
}

Question: is it possible to get the time (in seconds) at which each word has been said?

With my example:

['one', 0.23, 0.80], ['two', 1.03, 1.45], ['three', 1.79, 2.35], etc.

ie the word "one" has been said between time 00:00:00.23 and 00:00:00.80,
the word "two" has been said between time 00:00:01.03 and 00:00:01.45 (in seconds).

PS: looking for an API supporting other languages than English, especially French.


It is not possible with google API.

If you want word timestamps, you can use other APIs, for example:

CMUSphinx - free offline speech recognition API

SpeechMatics SaaS speech recognition API

Speech Recognition API from IBM


I believe the other answer is now out of date. This is now possible with the Google Cloud Search API: https://cloud.google.com/speech/docs/async-time-offsets

链接地址: http://www.djcxy.com/p/34356.html

上一篇: 网络语音API与谷歌语音API

下一篇: Google语音识别API:每个单词的时间戳?