Natural Language Understanding

Audiogum provides APIs for controlling devices through voice commands, either by sending raw audio from a microphone, or text that has been transcribed by a third-party speech-to-text service. The APIs return remote control actions that a device can act upon.

The voice APIs can be used in either streaming mode (via Remote Control WebSocket) or non-streaming mode (REST API).

Authorisation

To activate speech capabilities, a user (typcially through a companion app) must acquire a "delegated" user token via the /v1/tokens API using grant_type=token. The token API will issue a token that can be securely used by a device to operate on the user's behalf that should be passed to the device to securely store and use as it cannot authenticate as the user.

In app

The user must use their own (pre-existing) user authentication access token to create a new token with the scope of speech. For more information about Audiogum authentication, see Authentication

POST /v1/tokens HTTP/1.1
Content-Type: application/json
Authorization: Basic <client_id:secret in base64>

{
  "grant_type": "token",
  "token": "v1::...",
  "scope": "speech",
  "deviceid": "<deviceid>"
}

The response contains a body as follows. The access_token, expires_in and refresh_token should be passed to the device.

{
  "access_token": "v1::FNJUwKr...",
  "refresh_token": "v1::5AspKd...",
  "expires_in": 2592000
}

If the user has multiple devices, each device must receive its own dedicated speech token. The deviceid is required in the request.

On device

The access_token passed to the device as above will be used to access either the speech REST or WebSocket API, and the refresh_token should be used to refresh the access token when it expires. See Authentication: Refreshing Tokens.

These tokens are long lived. A speech access token created in this way will last 30 days (or 24 hours for test client_ids), and the refresh token will never expire.

If the device receives a 401 response status (or a voiceerror message with status 401 from the WebSocket) when attempting to access the speech APIs, it should assume the access token has expired. If the speech device fails to refresh the access token, it may retry, but after a small number of retries the device must assume that authorization to use speech services on the user's behalf has been revoked and stop retrying.

Voice Streaming API

To use the streaming API:

  1. The device must have an open Remote Control WebSocket connection (see Remote Control)
  2. The device must send a TEXT message containing JSON content of type voiceproperties that contains the speech token and describes the audio format and nowplaying information.
  3. The device then sends BINARY messages that contain microphone audio data.
  4. The device must send a TEXT message of type voiceproperties with JSON content that includes nowplaying information whenever the player begins a new track during the voice interaction.
  5. The device will receive TEXT messages containing JSON content of various types:
    1. voiceresult
    2. voiceactivity
    3. voicephrase
    4. voiceerror

voiceproperties message

The voiceproperties message must be sent by the device on the Remote Control WebSocket to initiate a voice interaction.
It may also be sent again during the interaction as necessary to update nowplaying data if the player state changes.
Each interaction should have a ref generated for it that should be sent in these properties, as well as in voiceevents.

Example

{
  "type": "voiceproperties",
  "speechtoken": "v1::FNJUwKr...",
  "ref": "171d8f27-71c4-44...",
  "encoding": "pcm",
  "samplerate": 16000,
  "languagecode": "en-GB",
  "immediate": true,
  "reportactivity": true,
  "nowplaying": {
    "item": {
      "name": "A hundred moons",
      "artistdisplayname": "GoGo Penguin",
      "albumname": "A Humdrum Star",
      "duration": 267,
      "service": "tidal",
      "ref": "84245613",
      "id": "0a5d723756f71db06eaeac9cb309c7c9",
      "artists": [
        {
          "id": "26f645b5e354481292533a890c817a4b",
          "name": "GoGo Penguin"
        }
      ]
    },
    "playable": {
      "id": "e85d65fe39814a659af67ceb5853efc0",
      "userid": "f646b99cdb99455bbd0dfcfe2c5152b4",
      "name": "GoGo Penguin",
      "startindex": 10,
      "parameters": {
        "type": "dynamicplaylist",
        "service": "tidal",
        "variant": "artist"
      }
    },
    "playstate": "playing",
    "source": "playable",
    "presetnumber": 1,
    "offset": 1,
    "volume": {
      "value": 5,
      "mute": false
    }
  }
}

Definitions

KeyDescriptionExample
typemandatory"voiceproperties"
speechtokenthe access_token supplied to the device, see authorisation, mandatory to begin voice interaction, optional subsequently
refmandatory reference for voice interaction, app or device should generate a GUID and send in here"171d8f27-71c4-44ab-a014-421a7c796643"
sampleratemandatory to begin interaction, the sample rate of the audio for every sample sent for the duration of the websocket16000
encodingmandatory to begin interaction"pcm"
languagecodeoptional, the language of the voice samples"en-GB"
nowplayingoptional, but must be sent if the device is playing. See below.
immediateoptional, set to true to enable receiving voicephrase messagestrue
reportactivityoptional, set to true to enable receiving voiceactivity messagestrue

nowplaying

KeyDescriptionExample
itemoptional, the item that's currently playing. See below.
playableoptional, the playable that's currently playing. See below.
playstateoptional, the play state of the player. Can be idle, buffering, playing or paused"idle"
sourceoptional, only allowed characters 'a-z', in the case of playing Audiogum playable, shoudl be playable"playable"
presetnumberoptional integer, supply if currently playing a preset8
offsetoptional, an integer offset into the currently playing track, in seconds0
volumeoptional, an object describing the playback volume. See below.

item

See item analytics documentation

playable

See playable analytics documentation

volume

KeyDescriptionExample
valuemandatory, an integer11
mutemandatory, a boolean true if and only if the device is mutedfalse

BINARY audio messages

To transmit microphone audio for processing, the device must send BINARY messages on the WebSocket. Recommended chunk size is 8KB per message, up to a maximum 32KB per message.

The device should not attempt to send more than 60 seconds of microphone data during the lifetime of a voice interaction. You may receive error messages after this time. Voice interaction can be re-started by sending voiceproperties with the initially required fields.

Terminating audio

The Audiogum API will detect voice activity in the incoming audio data. Once voice activity is no longer detected in the incoming bytes (silence, background noise, etc), then the speech will be processed.

If the client does not intend to send trailing audio bytes (silence, background noise, etc), the client must indicate that no more bytes are expected. To do this, the client sends an empty (zero bytes) BINARY message or use the libaudiogum ag_voice_send_mic_terminator function.

voiceresult message

When the Audiogum API has completed processing a voice phrase, it will return a TEXT message containing JSON with the type value voiceresult.

The voiceresult may include all the same fields as remotecomand messages from Remote Control API. In particular it may contain actions to be performed by the device and may contain respond indicating a voice response to play to the user.

See the commands documentation for details of actions and related features.

Example

{
  "type":"voiceresult",
  "text": "play me something",
  "actions": [
    {
      "action": "playplayable",
      "parameters": {
        "id": "bdfd3200cc6a49938035262cc4b7c5e6"
      }
    }
  ],
  "respond": {
    "languagecode": "en",
    "text": "sure, building a mix",
    "audio": "http://...08c0483166d78405aae44de9f8f514f24620848274.Amy.mp3"
  }
}

voiceactivity message

If you wish to be notified whenever speech is detected in the incoming audio stream, you can include "reportactivity": true in the voiceproperties message when starting voice interaction. Whenever speech is detected, you will receive a message like:

{
  "type": "voiceactivity"
}

This message may be received many times, indicating that voice activity is still being recognised from the audio data. The standard voiceresult message will eventually follow when a phrase is processed.

This mode allows the client to implement smart microphone timeouts. You may wish to keep the microphone open longer whenever you receive a voiceactivity message, and close the microphone only if you have not received voiceactivity for some time.

voicephrase message

If you wish to receive an instant reply that describes the transcribed text before Audiogum has processed the phrase, you can include "immediate": true in the voiceproperties message when starting voice interaction. Once a completed phrase has been recognised, you will immediately receive a message like:

{
  "type": "voicephrase",
  "text": "play me something"
}

This can be used as a signal to close the microphone and stop sending BINARY messages because no more speech will be processed for this interaction.

The standard voiceresult message will follow after the phrase has been processed.

Using libaudiogum for voice control

libaudiogum provides various functions that simplify the voice interaction using the Remote Control WebSocket.

  1. Store the speech token passed from the companion app with the ag_voice_store_delegated_token function.
  2. Connect with the ag_voice_open_session function, that handles initiating the voice interaction.
  3. Send now playing information with ag_voice_update_now_playing.
  4. Send binary audio data with ag_voice_send_mic_data.
  5. Actions will be indicated through the ag_action_callback.

Example voice interaction

In the following example, a user initiates voice control on a device and then talks to the device. The user says "play me something", the service interprets and sends the device a voice response and details of the action to play music.


>> TEXT message sent by device to initiate voice interaction
{
  "type": "voiceproperties",
  "speechtoken": "v1::FNJUwKr...",
  "encoding": "pcm",
  "samplerate": 16000,
  "languagecode": "en-GB",
  "immediate": true,
  "reportactivity": true,
  "nowplaying": {
    "playstate": "idle",
    "volume": {
      "value": 5,
      "mute": false
    }
  }
}

>> BINARY message sent by device (audio data from mic)
>> BINARY message sent by device
>> ...

<< TEXT message received by device to indicate voice activity detected
{
  "type": "voiceactivity"
}

>> BINARY message sent by device
>> ...

<< TEXT message received by device to indicate voice activity detected
{
  "type": "voiceactivity"
}

>> BINARY message sent by device
>> ...

<< TEXT message received by device to indicate phrase detected, microphone can stop
{
  "type": "voicephrase",
  "text": "play me something"
}

<< TEXT message received by device with results of processing
{
  "type":"voiceresult",
  "text": "play me something",
  "respond": {
    "languagecode": "en",
    "text": "sure, building a mix",
    "audio": "http://...08c0483166d78405aae44de9f8f514f24620848274.Amy.mp3"
  },
  "actions": [
    {
      "action": "playplayable",
      "parameters": {
        "id": "bdfd3200cc6a49938035262cc4b7c5e6"
      }
    }
  ]
}

The following timeline diagram illustrates a typical interaction as above:

Example voice interaction timeline

Voice errors

If the device sends a TEXT message that cannot be understood by the server (validation error, incorrect structure, unparseable JSON content, etc), the server will send an error TEXT message to the client via the WebSocket, e.g.:

Example:

{
  "type": "voiceerror",
  "status": 400,
  "message": "Received a message that failed validation",
  "errors": {"eencoding":"disallowed-key"}
}

If the device receives a message with status 401, it should assume the speech token is expired or invalid. See Authentication: On device. Example:

{
  "type": "voiceerror",
  "status": 401,
  "message": "Speech token expired"
}

If the device attempts to send BINARY messages before voice interaction is properly initated it may receive this message:

{
  "type": "voiceerror",
  "status": 400,
  "message": "Cannot send audio bytes before sending voiceproperties with speechtoken"
}

Voice errors may contain a respond map with text, languagecode and audio as in a voiceresult message. This allows for a explanation of an error to be given as a voice message.

If a device successfully sends audio bytes and we can transcribe the content but can't be understood i.e. the sentence "Blah, blah, blah" it may receive this message:

{
  "type": "voiceerror",
  "status": 422,
  "message": "Detected an utterance, but we did not understand it",
  "respond": {
    "text" "Sorry, I didn't understand that",
    "languagecode": "en",
    "audio": "http://...eabd3a77bb33468aaf73fde4973a8f65.Amy.mp3"
  },
  "text": "Blah, blah, blah"
}

The Remote Control WebSocket may also close in cases of unexpected communication errors, see RFC6455 §7.4.1 for a detailed description of each status code.

If the Remote Control WebSocket closes, any voice interaction ongoing at that time will fail and must be abandoned.

Conversation

Audiogum can support two-way conversations with devices that indicate that they can use this feature. Conversations are enabled by setting the expectreply capability when connecting the Remote Control WebSocket (see Remote Control: Capabilities).

When enabled, some voiceresult messages may include "expectreply": true. In this case, after processing the respond and actions if any, the device should establish another voice interaction by sending voiceproperties message, opening the microphone and sending binary messages containing the encoded audio.

Capabilities

Whether using the REST API or a websocket, a capabilities parameter should be specified to indicate to the service which features are desired. More details on capabilities can be found on the commands page.

Speech REST API

An alternative to the websockets API is the simpler REST API that will work for one-off commands as opposed to continuous listening.

APIPurpose
POST /v1/user/speechUpload an audio file and receive spoken 'action'
POST /v1/user/speech/textUpload text and receive 'action'

Text REST API

The POST /v1/user/speech/text API converts the text of an instruction taken from a separate ASR engine into action. The API requires a "delegated" user token - see authorisation. Note that the capabilities parameter passed in controls some feature availability - see the commands page.

Request:

POST /v1/user/speech/text HTTP/1.1
Host: api.audiogum.com
Authorization: Bearer …
Content-Type: application/json

{
  "capabilities": ["volumeup", "volumedown", "setvolume", "play", "stop"],
  "text": "Play some jazz",
  "deviceid": "4fac2cc33a05"
}

Response:

HTTP/1.1 201 Created
Content-Type: application/json;charset=utf-8

{
  "respond": {
    "text": "Sure. Fetching you some Jazz",
    "languagecode": "en",
    "audio": "http://...mp3"
  },
  "actions": [
    {
      "action": "playplayable",
      "parameters": {
        "id": "4b29667aa3024281af83ce009af647d4",
        "parameters": {
          "service": "tidal",
          "tags": [
            {
              "type": "genre",
              "id": "jazz"
            }
          ],
          "type": "dynamicplaylist",
          "allowunlinkedservice": true,
          "variant": "tag"
        }
      },
      "entities": {
        "genres": [
          {
            "id": "jazz",
            "name": "Jazz",
          }
        ]
      },
      "intent": {
        "name": "generateplaylist"
      }
    }
  ]
}

As shown, the successful case gives a 201 response code, but the Text REST API also returns error codes that match the websocket API:

CodeMeaning
401The speech token is expired or invalid - see Authentication: On device
410The phrase given contains a request to generate a dynamic playlist, but Audiogum could not create a playlist to fulfill the requirements (e.g. the requirements are too specific, or we have no relevant music available in the territory)
422The phrase given could not be understood, no intent was recognised

Voice control features

The following examples illustrate some of the features enabled through Audiogum voice integration. Note that this is not an exhaustive list and Audiogum is continuously adding voice control features. New features will be automatically available to existing clients, subject to the capabilities declared by the device.

intentexample phrasesdescription
play"play"
"resume"
"continue"
"go"
Device plays/resumes
stop"stop"
"turn off"
Device stops
pause"pause"
"shush"
"shut up"
Device pauses
skip next"skip"
"skip this"
"next"
"skip forward"
"next song"
Device skips to the next track
skip previous"skip previous"
"go back"
"previous"
"previous song"
Device skips to the previous track
volume up"louder"
"volume up"
"increase volume"
"turn it up"
Device increases volume
volume down"quieter"
"volume down"
"decrease volume"
"turn it down"
Device decreases volume
volume quiet"hush"
"quiet"
"make it quiet"
Device sets volume to specified value.
volume loud"loud"
"play loud"
"make it loud"
Device increases volume by specified relative value.
shuffle"shuffle"
"random"
"unshuffle"
"turn off random"
Indicates that the client should activate or deactivate 'shuffle' mode
repeat"repeat"
"repeat this song"
"stop repeating"
Indicates that the client should activate or deactivate 'repeat' mode

Voice informational features

The following examples illustrate some of the informational features enabled through Audiogum voice integration. As above, this is not an exhaustive list and Audiogum is continuously adding voice control features. New features will be automatically available to existing clients, subject to the capabilities declared by the device and potential licensing costs with 3rd party providers.

These informational responses use the nowplaying information sent by the device alongside the voice request. For the artist and album informational intents, specific named items can also be used.

intentexample phrasesdescription
get now playing info"what's this?"
"what is playing?"
"what song is this?"
"who is this?"
"what artist is this?"
Returns voice response with info about current playing content.
E.g. "This is Wonderwall by Oasis".
get artist info"tell me about this artist"
"can you tell me about this band"
"tell me about Katy Perry"
Returns voice response with info about the current playing artist or a specific artist.
E.g. "Katheryn Elizabeth Hudson, known professionally as Katy Perry, is an American singer, songwriter, actress, and television personality."
get artist timespan"when did this artist start?"
"when did this band break up?"
"when did David Bowie die?"
Returns voice response with info about when the current playing artist or a specific artist started, broke up or died.
E.g. "David Bowie died on 10th January 2016".
get performance info"who performs on this?"
"who played bass?"
"who sung backing vocals?"
Where available, returns voice response with info about who performed the specified role(s) on the song in the now playing data reported by the device.
E.g. "Bass was performed by Billy Gould".
get album info"tell me about this album"
"tell me about the album Simulation Theory by Muse"
Returns voice response with info about the current album or a specific album.
E.g. "Simulation Theory is the eighth studio album by English rock band Muse. It was released on 9 November 2018 through Warner Bros. Records and Helium-3."
get concert info"when are this band playing near me?"Where available, returns voice response with info about upcoming concerts for the current playing artist or a specific artist.
E.g. "Elton John is playing at Royal Arena in Copenhagan on May 18th 2019".
get chart info"what was number one on 3rd May 2016?"
"what was number 1 this time last year?"
Where available, returns voice response with info about the chart.
E.g. "One Dance by Drake featuring Wizkid and Kyla was number one on 3rd May 2016".
help"what can I say?"
"help"
Returns voice response with example phrases to try, based on the user's taste profile if available, or popular artists for the country if not.
E.g. "I understand artists and genres. What about asking me to Play David Bowie."

Voice playback features

The following examples illustrate some of the dynamic playlisting and other forms of playback enabled through Audiogum voice integration. Again, this is not an exhaustive list and Audiogum is continuously adding voice control features. New features will be automatically available to existing clients, subject to the capabilities declared by the device and potential licensing costs with 3rd party providers.

intentexample phrasesdescription
create dynamic playlist"play me something"
"play something I'll like"
"play something else"
"please play 90s music"
"play a mix"
"play some fast music"
"make a slow playlist"
"play me Oasis"
"play me the top ten from 1st September 1997"
"play music from Bristol"
"Play popular music"
Service generates a dynamic playlist with parameters based on recognised entities such as artists and genres, if any, or alternatively a personalised playlist based on the user's taste

Response includes the playable to play and a voice response
e.g "OK, let me find something you'll like"
or "Playing some Rock from the 80s"

Device starts playback of the specified playable after playing the response.

Supported entities include genres, artists, eras, years, tempo (fast, slow), locations, dates (for charts) and more
stick to current
[artist, genre, era]
"stick to this artist"
"only this genre"
"keep with this genre"
"stick with this era"
"more Foo Fighters"
"stick to Celine Dion"
"keep to Charlotte Gainsbourg"
Service adjusts or creates a new dynamic playlist based on what is currently playing and what is asked for.

Response includes new playable id and voice response
e.g. "OK, let's stick with Oasis"

Device replaces its cache of subsequent songs with those from the specified playable without interrupting the current song, so that the next track is from the new or updated playable. (Note refreshplayable action means the device should not interrupt the current song)
modify playlist"make it faster"As above but the change is immediate.

Response includes the new playable and voice response
e.g. "OK, speeding it up"

Device starts playback of the specified playable and plays the audio response simultaneously.
play playlist"play 70s Punk Classics"Play a specific playlist from a music service.
play radio"play BBC Radio 6 Music"Play a specific internet radio station.
play album"play Nevermind"
"play OK Computer by Radiohead"
"Play the new album by Rihanna"
Play a specific album.
play song"play Hey Jude"
"play Teardrop by Massive Attack"
Generate and play a dynamic playlist starting with a specific recognised song.
play song by lyric"play the song that goes on a dark desert highway"Generate and play a dynamic playlist starting with a song found by lyric search.
like"I love this"
"I like this"
Adds taste information to the user's profile based on what is currently playing
dislike"I hate this"
"I don't like this"
Adds taste information to the user's profile based on what is currently playing and where possible generates and plays something else
play playlist"play my driving playlist"Plays a specified playlist from the user's linked music service
set preset"set this as preset 2"Stores whatever is currently playing as a preset
play preset"preset three"
"switch to first preset"
"start preset channel two"
"play favourite 3"
Device plays the specified preset channel