There are many ways to interact with apps. Obviously the keyboard has been, and still is, one of the most often-used devices to communicate with computers. More recently (in human years, not computer years) the mouse allowed us to move past text-based interfaces. And improving upon the mouse, touch has let us use more natural interactions. But when people are communicating with each other, in close proximity, we don't type what we are thinking or point to it, we use spoken language. So why should it be any different with a computer?
There are two sides to a spoken conversation: the speaker and the listener. The speaker must generate the sounds to express ideas, and the listener interprets those sounds to reconstruct the ideas. This might seem like overstating the obvious as people do this every day. But a computer doesn't understand this and can't understand it unless these interactions are explained to them in great detail. This is why there are two speech APIs in Google Cloud.
First is the Text-to-Speech (or TTS) API. This service converts written (or typed) text into sounds resembling a human voice. It offers over 200 voices in over 40 languages. It supports Speech Synthesis Markup Language, or SSML, which lets you annotate written text with a set of "stage instructions" that customize the sounds for a more realistic effect. This includes pauses in text and the pronunciation of acronyms and abbreviations.
The other API is the complement, Speech-to-Text (or STT). If TTS is the speaker, then STT is the listener. The STT API can transcribe speech in more than 125 languages. And like the TTS API, it can be customized. A common use case is recognizing jargon present in specific industries. The STT API can even transcribe streaming audio in real time.
Both the TTS and STT APIs support client libraries for Python, Node.js, C#, Java, and other popular languages. They also both support a REST API. I'll be using the Python client libraries for this guide.
To get started with the API, you'll need to enable the Text-to-Speech API in a Cloud Console project. Then you'll need to download the credentials for a service account and store the path to the credentials file in an environment variable named GOOGLE_APPLICATION_SETTINGS
. My previous guide, Computer Vision with Google Cloud Vision, walks through this process. Refer to the link for the details.
To install the client library for Python, use pip
.
1$ pip install google-cloud-texttospeech
The simplest example begins by importing the texttospeech
module.
1from google.cloud import texttospeech
Next create a new TextToSpeechClient
.
1tts_client = texttospeech.TextToSpeechClient()
The VoiceSelectionParams
configure the generated voice.
1params = texttospeech.VoiceSelectionParams(language_code='en-US')
The language_code
keyword argument is required. There are currently 40 different language codes. Here the language code is set to United States English. You can also set the name of the voice to use one of the current 237 different voices. A voice is a combination of a language code and a gender. For example, the voice "en-GB-Standard-A" uses a language code of en-GB
for Great Britain English and a gender of 2, which is female. Alternatively, you can specify a gender with the ssml_gender
keyword argument.
1params = texttospeech.VoiceSelectionParams(language_code='en-US', ssml_gender=texttospeech.SsmlVoiceGender.FEMALE)
The SsmlVoiceGender
type is one of four values: SSML_VOICE_GENDER_UNSPECIFIED
, MALE
, FEMALE
, or NEUTRAL
.
The AudioConfig
will select a format for the generated audio file.
1audio = texttospeech.AudioConfig(audio_encoding=texttospeech.AudioEncoding.MP3)
Valid AudioEncoding
s are MP3
, LINEAR16
, OGG_OPUS
and AUDIO_ENCODING_UNSPECIFIED
.
The text to speak is in a SynthesisInput
.
1si = texttospeech.SynthesisInput(text='Peter Piper picked a peck of pickled peppers.')
And finally, the synthesize_speech
method will return a SynthesizeSpeechResponse
containing the generated audio.
1response = tts_client.synthesize_speech(input=si, voice=params, audio_config=audio)
The audio_content
property of the response
is the audio data that can be written to a file. Make sure to write binary data with the mode wb
.
1f = open('en_us_female.mp3', 'wb')
2f.write(response.audio_content)
3f.close()
You can download the generated audio file (en_us_female.mp3) from Github. I've also generated a male voice (en_us_male.mp3) and another with a British accent (en_gb_male.mp3) by setting the language_code
keyword argument to en-GB
.
What happens if we generate an audio file for the text "My zip code is 20202"? Listen to the file (zip_code_no_ssml.mp3) on Github.
The API reads "20202" as "twenty thousand two hundred two". But that is not how we read zip codes. We speak each number, as in "two oh two oh two". How can we make Google understand this?
The answer is SSML or Speech Synthesis Markup Language. It is a set of tags that is used to markup the text to be generated. Here is the SSML that would tell Google to read out each digit in the zip code.
1ssml = """
2<speak>
3 My zip code is
4 <say-as interpret-as="characters">
5 20202
6 </say-as>
7</speak>
8"""
By telling Google to interpret the zip code as characters, it will read each number. To have Google use SSML, look at the SynthesisInput
and change the text
keyword argument to ssml
.
1si = texttospeech.SynthesisInput(ssml=ssml)
The generated audio file (zip_code_ssml.mp3) can be found on Github.
This is just one example of what SSML supports. For more details, consult the Text-to-Speech API documentation.
The setup for the STT API is similar to the TTS API. You'll need to enable the Speech-to-Text API for a Cloud Console project and store the credentials for a service account in an environment variable named GOOGLE_APPLICATION_CREDENTIALS
. And finally, install the client library package with pip
.
1$ pip install google-cloud-speech
The audio file to be transcribed needs to be stored somewhere that the API can access it. A good choice is a bucket in Google Cloud Storage.
First import the speech package.
1from google.cloud import speech_v1p1beta1 as gcp_speech
A SpeechClient
handles all interaction with the API.
1stt_client = gcp_speech.SpeechClient()
You must also tell the API the language being spoken in the audio file and the sample rate. I'm going to use one of the generated files, which use a rate of 24000 Hertz.
1language = 'en-US'
2sr = 24000
For MP3 files the encoding must be given. The encoding is in the enums
module.
1from google.cloud.speech_v1p1beta1 import enums
2
3MP3 = enums.RecognitionConfig.AudioEncoding.MP3
The recognize
method will call the API.
1response = stt_client.recognize(
2 {
3 'language_code': language,
4 'sample_rate_hertz': sr,
5 'encoding': MP3
6 }, {
7 'uri': 'gs://ps-guide-speech/en_us_male.mp3'
8 }
9)
The second dictionary tells the API the location of the audio file to transcribe. The response
has a list of alternatives, each with a transcript
and confidence
score.
1[transcript: "Peter Piper picked a peck of pickled peppers"
2confidence: 0.9863739
3]
This is the same text that was used to generate the audio file. And the confidence
is quite high.
Don't forget that Python is not the only language the client libraries use. And it is always possible to call the API directly using any HTTP framework. If you combine the Google Cloud Text-to-Speech and Speech-to-Text API, you almost have enough to create a virtual assistant. The only thing remaining is to parse the text and extract meaning from it. The Google Cloud Natural API can provide that. The STT API lets you transcribe up to 60 minutes of audio every month for free. And you can generate audio files for up to 4 million characters of text free per month with the TTS API. Thanks for reading!