Faster Transcriptions, Lower Costs and Increased Accuracy Using Google Cloud Speech-to-Text API

SADA Says | Cloud Computing Blog

By SADA Says | Cloud Computing Blog


Following its 2018 overhaul, Google Cloud’s Speech-to-Text API quickly became a flexible, cost-effective and business-friendly transcription option. The billion dollar audio transcription industry, heavily relied upon by media and entertainment organizations, revolves around a cost, turnaround, and accuracy tradeoff that has seen major disruption with new AI-based services. The going rate for human-provided transcription services falls around $1 per minute of audio, with turnaround times between an hour and a few days.

But what if you just completed an interview that is supposed to hit the press tomorrow morning? Or, what if you are processing hours of lecture material for your university’s online learning platform, and $1 per minute adds up fast? Regardless of your transcription workflow, your organization may be feeling pressure from one of these three main factors, and Google’s highly accurate, affordable Speech-to-Text API could be your solution.

Let’s briefly explore the benefits to Google’s API.

Google’s Speech-to-Text API Hits the Cost to Accuracy Sweetspot

Google’s offering comes in at two price points – the outrageously cheap Speech Recognition at $.006 per 15 seconds, and the premium “Video” Speech Recognition (intended for higher quality audio with many speakers and crosstalk) at $.012 per 15 seconds. In accuracy comparisons, Google’s Video Speech recognition output achieves similar error rate to Rev’s comparison ASR (Automated Speech Recognition), Temi. Though even Google’s Video premium rate comes in at half the price.

The low cost even permits companies that didn’t previously transcribe their video content to get the most value from them – a practice that can be combined with the Google Cloud Video Intelligence API to create more valuable content with existing assets.

Turnaround Time: Seconds vs. Hours

While most human-powered transcriptions are returned with extremely high accuracy, longer transcriptions will require 12+ hours or even days. For media companies, this turnaround time can result in a missed window of opportunity for relevance.

With Google’s long audio file service, users can expect a turnaround time between 30 seconds and a few minutes (again, at a fraction of the cost of human transcriptions). Businesses can pair these quick results with a human review for accuracy and editing of any words or phrases that need changing. With this minimal effort, and at low relative cost, you’ll have a robust, accurate transcription much faster than a manual alternative.

Google’s AI-Powered Models are Increasingly Context Aware

Google knows that not all Speech-to-Text ML models should be trained equally. With its separate video, phone call, voice command, and default pre-built models, the Google Speech-to-Text API allows users to route their request to a model specifically training on that audio source. This way, the background noise and static typical of low sample rate phone calls is treated equally and unmixed with high quality audio typical of high sample rate videos. Short voice commands with a myriad of background noises receive the same custom treatment.

To further support this feature going forward, Google permits developers to also tag submitted audio, such as “boardroom meeting minutes”, to create a wider array of category types for future model training and accuracy.

Get What You Give With Google’s Data Logging Opt-In

Unlike Amazon’s voice recognition services, Google does not repurpose all recorded audio for ML training. Instead, users are permitted to “Opt-In to Data Logging” which ensures control over the third-party use of private audio data. For developers that opt in to data logging, Google makes available an enhanced model for Speech Recognition built upon opt-in data. This feature balances the option for privacy with the desire for increased accuracy by letting developers choose and option comfortable to the sensitivity of the data in their workflow.

Curious to Give it a Shot? Try the 60-Minute Free Tier

Google’s Speech-to-Text documentation and QwikLabs course makes audio transcription extremely approachable, especially when coupled with a free 60-minute tier to text the service. Once underway with Google Cloud’s language APIs, companies can also explore the larger related GCP platform offering of Dialogflow, which builds off both the Speech-to-Text and Text-to-Speech APIs. To learn where your organization can maximize value and minimize costs for transcribing audio content, contact SADA today to learn how to scalably integrate Speech-to-Text in your larger workflow.


Our expert teams of consultants, architects, and solutions engineers are ready to help with your bold ambitions, provide you with more information on our services, and answer your technical questions. Contact us today to get started.

Scroll to Top