IBM Watson Speech

I'm working through the tutorial for IBM Watson Speech-to-Text, using WebSocket for real time transcription. I'm using Angular.

The first 25 lines of code are copied from the API reference. This code successfully connects and initiates a recognition request. Watson sends me a message { "state": "listening" } .

I wrote function onClose() that logs when the connection closes.

I made a button that runs the handler $scope.startSpeechRecognition . This uses getUserMedia() to stream audio from the microphone and websocket.send() to stream the data to Watson. This isn't working. Clicking this button closes the connection. I presume that I'm sending the wrong type of data and Watson is closing the connection?

I moved websocket.send(blob); from onOpen to my handler $scope.startSpeechRecognition . I changed websocket.send(blob); to websocket.send(mediaStream); . I might have this wrong: 'content-type': 'audio/l16;rate=22050' . How do I know what bit rate comes from the microphone?

Is there a tutorial for JavaScript? When I google "IBM Watson Speech-to-Text JavaScript tutorial" at the top is an 8000-line SDK. Is the SDK required or can I write a simple program to learn how the service works?

Here's my controller:

'use strict';
app.controller('WatsonController', ['$scope', 'watsonToken',  function($scope, watsonToken) {
  console.log("Watson controller.");

  var token = watsonToken;
  var wsURI = "wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize"
    + "?watson-token=" + token + '&model=en-US_BroadbandModel';

  var websocket = new WebSocket(wsURI); // opens connection to Watson
  websocket.onopen = function(evt) { onOpen(evt) }; // executes when a connection opens
  websocket.onclose = function(evt) { onClose(evt) }; // executes when a connection closes
  websocket.onmessage = function(evt) { onMessage(evt) }; // logs messages from Watson to the console
  websocket.onerror = function(evt) { onError(evt) }; // logs errors to the console

  function onOpen(evt) {
    var message = {
      action: 'start',
      'content-type': 'audio/flac',
      'interim_results': true,
      'max-alternatives': 3,
      keywords: ['colorado', 'tornado', 'tornadoes'],
      'keywords_threshold': 0.5
    };
    websocket.send(JSON.stringify(message));

    // Prepare and send the audio file.
    // websocket.send(blob);

    // websocket.send(JSON.stringify({action: 'stop'}));
  }

  function onClose() {
    console.log("Connection closed.");
  };

  function onMessage(evt) {
    console.log(evt.data); // log the message to the console
  }

  $scope.startSpeechRecognition = () => {
    console.log("Starting speech recognition.");
    var constraints = { audio: true, video: false };
    navigator.mediaDevices.getUserMedia(constraints)
    .then(function(mediaStream) {
      console.log("Streaming audio.");
      websocket.send(mediaStream);
    })
    .catch(function(err) { console.log(err.name + ": " + err.message); }); // log errors
  };

  $scope.stopSpeechRecognition = () => { // handler for button
    console.log("Stopping speech recognition.");
    websocket.send(JSON.stringify({action: 'stop'}));
  };

  $scope.closeWatsonSpeechToText = () => { // handler for button
    console.log("Closing connection to Watson.");
    websocket.close(); // closes connection to Watson?
  };

}]);

And here's my template:

<div class="row">
  <div class="col-sm-2 col-md-2 col-lg-2">
    <p>Watson test.</p>
  </div>
</div>

<div class="row">
  <div class="col-sm-2 col-md-2 col-lg-2">
    <button type="button" class="btn btn-primary" ng-click="startSpeechRecognition()">Start</button>
  </div>

  <div class="col-sm-2 col-md-2 col-lg-2">
    <button type="button" class="btn btn-warning" ng-click="stopSpeechRecognition()">Stop</button>
  </div>

  <div class="col-sm-2 col-md-2 col-lg-2">
    <button type="button" class="btn btn-danger" ng-click="closeWatsonSpeechToText()">Close</button>
  </div>
</div>

The SDK is not required, but as Geman Attanasio said, it does make your life much easier.

Onto your code, though, this line definitely won't work:

websocket.send(mediaStream);

The mediaStream object from getUserMedia() cannot be directly sent over the WebsSocket - WebSockets only accept text and binary data (the blob in the original example). You have to extract the audio and then send only it.

But even that isn't sufficient in this case, because the WebAudio API provides the audio in 32-bit floats, which is not a format that the Watson API natively understands. The SDK automatically extracts and converts it to audio/l16;rate=16000 (16-bit ints).

How do I know what bit rate comes from the microphone?

It's available on the AudioContext and, if you add a scriptProcessorNode, it can be passed AudioBuffers that include audio data and the sample rate. Multiply the sample rate by the size of each sample (32 bits before conversion to l16, 16 bits after) by the number of channels (usually 1) to get the bit rate.

BUT note that the number you put into the content-type under after rate= is the sample rate, not the bit rate. So you could just copy it from the AudioContext or AudioBuffer without multiplication. (Unless you down-sample the audio, as the SDK does. Then it should be set to the target sample-rate is, not the input rate.)

If you want to see how all of this works, the entire SDK is open source:

  • Extracting audio from mediaStream: https://github.com/saebekassebil/microphone-stream/blob/master/microphone-stream.js
  • Converting & down-sampling: https://github.com/watson-developer-cloud/speech-javascript-sdk/blob/master/speech-to-text/webaudio-l16-stream.js
  • Managing the WebSocket: https://github.com/watson-developer-cloud/speech-javascript-sdk/blob/master/speech-to-text/recognize-stream.js
  • Familiarity with the Node.js Streams standard is helpful when reading these files.

    FWIW, if you're using a bundling system like Browserify or Webpack, you can pick and choose only the parts of the SDK you need and get a much smaller file size. You can also set it up to download after the page loads and renders since the SDK won't be part of your initial render.

    链接地址: http://www.djcxy.com/p/34464.html

    上一篇: 从Wowza发送音频流到语音到文本云服务

    下一篇: IBM沃森演讲