Categories
AI

Google is taking sign-ups for Relate, a voice assistant that recognizes impaired speech

Google launched a beta app today that people with speech impairments can use as a voice assistant while contributing to a multiyear research effort to improve Google’s speech recognition. The goal is to make Google Assistant, as well as other features that use speech to text and speech to speech, more inclusive of users with neurological conditions that affect their speech.

The new app is called Project Relate, and volunteers can sign up at g.co/ProjectRelate. To be eligible to participate, volunteers need to be 18 or older and “have difficulty being understood by others.” They’ll also need a Google account and an Android phone using OS 8 or later. For now, it’s only available to English speakers in the US, Canada, Australia, and New Zealand. They’ll be tasked with recording 500 phrases, which should take between 30 to 90 minutes to record.

After sharing their voice samples, volunteers will get access to three new features on the Relate App. It can transcribe their speech in real time. It also has a feature called “Repeat” that will restate what the user said in “a clear, synthesized voice.” That can help people with speech impairments when having conversations or when using voice commands for home assistant devices. The Relate App also connects to Google Assistant to help users turn on the lights or play a song with their voices.

Without enough training data, other Google apps like Translate and Assistant haven’t been very accessible for people with conditions like ALS, traumatic brain injury (TBI), or Parkinson’s disease. In 2019, Google started Project Euphonia, a broad effort to improve its AI algorithms by collecting data from people with impaired speech. Google is also training its algorithms to recognize sounds and gestures so that it can better help people who cannot speak. That work is still ongoing; Google and its partners still appear to be collecting patients’ voices separately for Project Euphonia.

“I’m used to the look on people’s faces when they can’t understand what I’ve said,” Aubrie Lee, a brand manager at Google whose speech is affected by muscular dystrophy, said in a blog post today. “Project Relate can make the difference between a look of confusion and a friendly laugh of recognition.”

Repost: Original Source and Author Link

Categories
AI

Deepgram launches $10M speech recognition startup program

All the sessions from Transform 2021 are available on-demand now. Watch now.


Deepgram, a startup developing voice recognition models for the enterprise, today announced the Deepgram Startup Program to provide select companies with custom-trained speech models. Alongside this news, Deepgram has added a range of features to its platform, including a revamped developer console, software development kits, and API documentation.

“The pandemic accelerated the need for digital experiences, and voice technology emerged as a must-have for companies looking to stay connected to both employees and customers,” CEO and cofounder Scott Stephenson told VentureBeat via email. “With the Deepgram Startup Program, we’re arming the next generation of startups and developers with the tools they need to excel in the speech space.”

Deepgram Startup Program

According to Stephenson, the Deepgram Startup Program is designed to help entrepreneurs and developers “harness the power of speech recognition quickly and easily at no cost.” As a part of the program, Deepgram will offer $10 million in free credits, with a specific focus on startups in education and employee experiences.

While it’s smaller in scale than, say, Amazon’s Alexa Accelerator, companies participating in the Deepgram Startup Program will not need to pay a fee or give up equity. Moreover, recipients will be able to use the funds in conjunction with any existing grant, seed, incubator, and accelerator benefits.

“We firmly believe that speech is the next programmable interface, and with these updates we’re excited to see the opportunities speech recognition will continue to create,” Stephenson continued.

New features

Among the capabilities Deepgram released to coincide with the launch of the Deepgram Startup Program, the highlight is an enhanced developer console. One of the key features is Missions, which provides users with a learning path for getting started with Deepgram. The console also aims to simplify usage and billing by offering promotional credits and automated re-enrollment, as well as account management.

As for the software development kits, they enable developers to transcribe both real-time streaming and prerecorded audio, leveraging libraries, documentation, code samples, and processes written for Python and Node.js.

The Deepgram Startup Program and new features come after Deepgram raised $25 million in capital from Tiger Global and other investors. The San Francisco, California-based Y Combinator graduate, which was founded in 2015 by Noah Shutty and Stephenson, claims to have processed more than 100 billion words to date for its over 60 customers.

Deepgram is far from the only player in a voice and speech recognition tech market anticipated to be worth $31.82 billion by 2025. Tech giants like Nuance, Cisco, Google, Microsoft, and Amazon offer real-time voice transcription and captioning services, as do startups like Otter. There’s also Verbit, which recently raised $31 million for its human-in-the-loop AI transcription tech; Oto Systems, which in December 2019 snagged $5.3 million to improve speech recognition with intonation data; and Voicera, which has raked in tens of millions for AI that draws insights from meeting notes.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Microsoft buys AI speech tech company Nuance for $19.7 billion

Microsoft is buying AI speech tech firm Nuance for $19.7 billion, bolstering the Redmond, Washington-based tech giant’s prowess in voice recognition and giving it further leverage in the health care market, where Nuance sells many products. Microsoft will pay $56 per share for Nuance, a 23 percent premium over the company’s closing price last Friday. The deal includes Nuance’s net debt.

Nuance is best known for its Dragon software, which uses deep learning to transcribe speech and improves its accuracy over time by adapting to a user’s voice. Nuance has licensed this tech for many services and applications, including, most famously, Apple’s digital assistant Siri. (Though to what degree Siri currently relies on Dragon to answer users’ queries is unclear.) Dragon is an industry leader in terms of transcription accuracy.

The $19.7 billion acquisition of Nuance is Microsoft’s second-largest behind its purchase of LinkedIn in 2016 for $26 billion. It comes at a time when speech tech is improving rapidly, thanks to the deep learning boom in AI, and there are simultaneously more opportunities for its use.

Digital transcription has become more reliable in a range of settings, from medical consultations to board meetings and university lectures. The uptick in remote work has also created new opportunities. With so many meetings taking place via video, it is easier to offer customers transcriptions via software integrated directly into these calls. Zoom, for example, offers automatic transcription via integration with third-party services like Otter.

For Microsoft, which makes roughly two-thirds of its revenue from enterprise software sales and cloud computing, improving its transcription services for scenarios like these makes complete sense. The company could integrate Nuance’s technology into its existing software, like Teams, or offer it independently as part of its Azure cloud business.

The immediate focus, though, will be on health care, where the two companies have already worked together. In 2019, they announced a “strategic partnership” to use Nuance’s software to digitize health records for Microsoft’s clients. Nuance’s health tech, including its Dragon Medical One platform, which is tuned to identify medical terminology, is reportedly used by more than half a million physicians worldwide and in 77 percent of US hospitals.

“By augmenting the Microsoft Cloud for Healthcare with Nuance’s solutions, as well as the benefit of Nuance’s expertise and relationships with EHR systems providers, Microsoft will be better able to empower healthcare providers through the power of ambient clinical intelligence and other Microsoft cloud services,” said Microsoft in a blog post.

“Nuance provides the AI layer at the healthcare point of delivery and is a pioneer in the real-world application of enterprise AI,” said Microsoft CEO Satya Nadella in a statement. “AI is technology’s most important priority, and healthcare is its most urgent application. Together, with our partner ecosystem, we will put advanced AI solutions into the hands of professionals everywhere to drive better decision-making and create more meaningful connections, as we accelerate growth of Microsoft Cloud in Healthcare and Nuance.”

News of the Nuance acquisition was first reported over the weekend by Bloomberg, and it’s the latest example of Microsoft’s purchasing spree. Last month, the company completed its $7.5 billion acquisition of games company ZeniMax. Last year, it was considering an acquisition of social video app TikTok, and the company is also reportedly in “exclusive talks” to purchase communications app Discord.

Repost: Original Source and Author Link

Categories
AI

Facebook Wav2vec-U learns to recognize speech from unlabeled data

Elevate your enterprise data technology and strategy at Transform 2021.


Facebook today announced that it trained an AI model to build speech recognition systems that don’t require transcribed data. The company, which trained systems for Swahili, Tatar, Kyrgyz, and other languages, claims that the model, wav2vec Unsupervised (Wav2vec-U), is an important step toward building machines that can solve a range of tasks by learning from their observations.

AI-powered speech transcription platforms are a dime a dozen in a market estimated to be worth over $1.6 billion. Deepgram and Otter.ai build voice recognition models for cloud-based real-time processing, while Verbit offers tech not unlike that of Oto, which combines intonation with acoustic data to bolster speech understanding. Amazon, Google, Facebook, and Microsoft offer their own speech transcription services.

But the dominant form of AI for speech recognition falls into a category known as supervised learning. Supervised learning is defined by its use of labeled datasets to train algorithms to classify data and predict outcomes, which, while effective, is time-consuming and expensive. Companies have to obtain tens of thousands of hours of audio and recruit human teams to manually transcribe the data. And this same process has to be repeated for each language.

Unsupervised speech recognition

Facebook’s Wav2vec-U solves the challenges in supervised learning by taking a self-supervised (also known as unsupervised) approach. With unsupervised learning, Wav2vec-U is fed “unknown” data for which no previously defined labels exist. The system must teach itself to classify the data, processing it to learn from its structure.

While relatively underexplored in the speech domain, a growing body of research demonstrates the potential of learning from unlabeled data. Microsoft is using unsupervised learning to extract knowledge about disruptions to its cloud services. More recently, Facebook itself announced SEER, an unsupervised model trained on a billion images that achieves state-of-the-art results on a range of computer vision benchmarks.

Facebook Wav2vec-U

Wav2vec-U learns purely from recorded speech and text, eliminating the need for transcriptions. Using a self-supervised model and Facebook’s wav2vec 2.0 framework as well as what’s called a clustering method, Wav2vec-U segments recordings into units that loosely correspond to particular sounds.

To learn to recognize words in a recording, Facebook trained a generative adversarial network (GAN) consisting of a generator and a discriminator. The generator takes audio segments and predicts a phoneme (i.e., unit of sound) corresponding to a sound in language. It’s trained by trying to fool the discriminator, which assesses whether the predicted sequences seem realistic. As for the discriminator, it learns to distinguish between the speech recognition output of the generator and real text from examples of text from sources that were “phonemized,” in addition to the output of the generator.

While the GAN’s transitions are initially poor in quality, they improve with the feedback of the discriminator.

“It takes about half a day — roughly 12 to 15 hours on a single GPU — to train an average Wav2vec-U model. This excludes self-supervised pre-training of the model, but we previously made these models publicly available for others to use,” Facebook AI research scientist manager Michael Auli told VentureBeat via email. “Half a day on a single GPU is not very much, and this makes the technology accessible to a wider audience to build speech technology for many more languages of the world.”

Facebook Wav2vec-U

To get a sense of how well Wav2vec-U works in practice, Facebook says it evaluated it first on a benchmark called TIMIT. Trained on as little as 9.6 hours of speech and 3,000 sentences of text data, Wav2vec-U reduced the error rate by 63% compared with the next-best unsupervised method.

Wav2vec-U was also as accurate as the state-of-the-art supervised speech recognition method from only a few years ago, which was trained on hundreds of hours of speech data.

Future work

AI has a well-known bias problem, and unsupervised learning doesn’t eliminate the potential for bias in a system’s predictions. For example, unsupervised computer vision systems can pick up racial and gender stereotypes present in training datasets. Some experts, including Facebook chief scientist Yann LeCun, theorize that removing these biases might require a specialized training of unsupervised models with additional, smaller datasets curated to “unteach” specific biases.

Facebook Wav2vec-U

Facebook acknowledges that more research must be done to figure out the best way to address bias. “We have not yet investigated potential biases in the model. Our focus was on developing a method to remove the need for supervision,” Auli said. “A benefit of the self-supervised approach is that it may help avoid biases introduced through data labeling, but this is an important area that we are very interested in.”

In the meantime, Facebook is releasing the code for Wav2vec-U in open source to enable developers to build speech recognition systems using unlabeled speech audio recordings and unlabeled text. While Facebook didn’t use user data for the study, Auli says that there’s potential for the model to support future internal and external tools, like video transcription.

“AI technologies like speech recognition should not benefit only people who are fluent in one of the world’s most widely spoken languages. Reducing our dependence on annotated data is an important part of expanding access to these tools,” Facebook wrote in a blog post. “People learn many speech-related skills just by listening to others around them. This suggests that there is a better way to train speech recognition models, one that does not require large amounts of labeled data.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Soniox taps unsupervised learning to build speech recognition systems

Join Transform 2021 this July 12-16. Register for the AI event of the year.


AI-powered speech transcription platforms are a dime a dozen in a market estimated to be worth over $1.6 billion. Deepgram and Otter.ai build voice recognition models for cloud-based real-time processing, while Verbit offers tech not unlike that of Oto, which combines intonation with acoustic data to bolster speech understanding. Amazon, Google, Facebook, and Microsoft offer their own speech transcription services.

But a new entrant launching out of beta this week claims its approach yields superior accuracy. Called Soniox, the company leverages vast amounts of unlabeled audio and text to teach its algorithms to recognize speech with accents, background noises, and “fairfield” recording. In practice, Soniox says its system correctly transcribes 24% more words compared with other speech-to-text systems, achieving “super-human” recognition on “most domains of human knowledge.”

Those are bold claims, but Soniox founder and CEO Klemen Simonic says the accuracy improvements arise from the platform’s unsupervised learning techniques. With unsupervised learning, an algorithm — in Soniox’s case, a speech recognition algorithm — is fed “unknown” data for which no previously defined labels exist. The system must teach itself to classify the data, processing it to learn from its structure.

Unsupervised speech

At the advent of the modern AI era, when people realized powerful hardware and datasets could yield strong predictive results, the dominant form of machine learning fell into a category known as supervised learning. Supervised learning is defined by its use of labeled datasets to train algorithms to classify data, predict outcomes, and more.

Simonic, a former Facebook researcher and engineer who helped to build the speech team at the social network, notes that supervised learning in text-to-speech is both time-consuming and expensive. Companies have to obtain tens of thousands of hours of audio and recruit human teams to manually transcribe the data. And this same process has to be repeated for each language.

“Google and Facebook have more than 50,000 hours of transcribed audio. One has to invest millions — more like tens of millions — of dollars into collecting transcribed data,” Simonic told VentureBeat via email. “Only then one can train a speech recognition AI on the transcribed data.”

A technique known as semi-supervised learning offers a potential solution. It can accept partially labeled data, and Google recently used it to obtain state-of-the-art results in speech recognition. In the absence of labels, however, unsupervised learning — also known as self-supervised learning — is the only way to fill gaps in knowledge.

Soniox founder and CEO Klemen Simonic.

Above: Soniox founder and CEO Klemen Simonic.

According to Simonic, Soniox’s self-supervised learning pipeline sources audio and text from the internet. In the first iteration of training, the company used the Librispeech dataset, which contains 960 hours of transcribed audiobooks.

Soniox’s iterative approach continuously refines the platform’s algorithms, enabling them to recognize more words as the system gains access to additional data. Currently, Soniox’s vocabulary spans different people, places, and geography to domains including education, technology, engineering, medicine, health, law, science, art, history, food, sports, and more.

“To do fine-tuning of a particular model on a particular dataset, you would need an actual transcribed audio dataset. We do not require transcribed audio data to train our speech AI. We do not do fine-tuning,” Simonic said.

Dataset and infrastructure

Soniox claims to have a proprietary dataset containing over 88,000 hours of audio and 6.6 billion words of preprocessed text. By comparison, the latest speech recognition works from Facebook and Microsoft used between 13,100 and 65,000 hours of labeled and transcribed speech data. And Mozilla’s Common Voice, one of the largest public annotated voice corpora, has 9,000 hours of recordings.

While relatively underexplored in the speech domain, a growing body of research demonstrates the potential of learning from unlabeled data. Microsoft is using unsupervised learning to extract knowledge about disruptions to its cloud services. More recently, Facebook announced SEER, an unsupervised model trained on a billion images that ostensibly achieves state-of-the-art results on a range of computer vision benchmarks.

Soniox collects more data on a weekly basis, with the goal of expanding the range of vocabulary the platform can transcribe. However, Simonic points out that more audio and text isn’t necessarily required to improve word accuracy. Soniox’s algorithms can “extract” more about familiar words with multiple iterations, essentially learning to recognize particular words better than before.

Soniox

Above: Soniox’s cloud platform.

Image Credit: Soniox

AI has a well-known bias problem, and unsupervised learning doesn’t eliminate the potential for bias in a system’s predictions. For example, unsupervised computer vision systems can pick up racial and gender stereotypes present in training datasets. Simonic says Soniox has taken care to ensure its audio data is “extremely diverse,” with speakers from most countries and accents around the world represented. He admits that the data distribution across accents isn’t balanced but claims the system still manages to perform “extremely well” with different speakers.

Soniox also built its own training hardware infrastructure, which it stores across multiple servers located in a collocation datacenter facility. Simonic says the company’s engineering team installed and optimized the system and machine learning frameworks and wrote the inference engine from scratch.

“It is utterly important to have under control every single bit of transfer and computation when you are training AI models at large scale. You need a rather large amount of computation to do just one iteration over a dataset of more than 88,000 hours,” Simonic said. “[The inferencing engine] is highly optimized and can potentially run on any hardware. This is super important for production deployment because speech recognition is computationally expensive to run compared to most other AI models and saving every bit of computation on a large volume amounts to large sums in savings — think of millions of hours of audio and video per month.”

Scaling up

After launching in beta earlier this year, Soniox is making its platform generally available. New users get five hours per month of free speech recognition, which can be used in Soniox’s web or iOS app to record live audio from a microphone or upload and transcribe files. Soniox offers an unlimited number of free recognition sessions for up to 30 seconds per session, and developers can use the hours to transcribe audio through the Soniox API.

It’s early days, but Soniox says it recently signed its first customer in DeepScribe, a transcription startup targeting health care. DeepScribe switched from a Google speech-to-text model because Soniox’s transcriptions of doctor-patient conversations were more accurate, Simonic claims.

“To make a business, developing novel technology is not enough. Thus we developed services and products around our new speech recognition technology,” Simonic said. “I expect there will be a lot more customers like DeepScribe once the word about Soniox gets around.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Speech recognition system trains on radio archive to learn Niger Congo languages

Join Transform 2021 this July 12-16. Register for the AI event of the year.


For many of the 700 million illiterate people around the world, speech recognition technology could provide a bridge to valuable information. Yet in many countries, these people tend to speak only languages for which the datasets necessary to train a speech recognition model are scarce. This data deficit persists for several reasons, chief among them the fact that creating products for languages spoken by smaller populations can be less profitable.

Nonprofit efforts are underway to close the gap, including 1000 Words in 1000 Languages, Mozilla’s Common Voice, and the Masakhane project, which seeks to translate African languages using neural machine translation. But this week, researchers at Guinea-based tech accelerator GNCode and Stanford detailed a new initiative that uniquely advocates using radio archives in developing speech systems for “low-resource” languages, particularly Maninka, Pular, and Susu in the Niger Congo family.

“People who speak Niger Congo languages have among the lowest literacy rates in the world, and illiteracy rates are especially pronounced for women,” the coauthors note. “Maninka, Pular, and Susu are spoken by a combined 10 million people, primarily in seven African countries, including six where the majority of the adult population is illiterate.”

The idea behind the new initiative is to make use of unsupervised speech representation learning, demonstrating that representations learned from radio programs can be leveraged for speech recognition. Where labeled datasets don’t exist, unsupervised learning can help to fill in domain knowledge by determining the correlations between data points and then training based on the newly applied data labels.

New datasets

The researchers created two datasets, West African Speech Recognition Corpus and the West African Radio Corpus, intended for applications targeting West African languages. The West African Speech Recognition Corpus contains over 10,000 hours of recorded speech in French, Maninka, Susu, and Pular from roughly 49 speakers, including Guinean first names and voice commands like “update that,” “delete that,” “yes,” and “no.” As for the West African Radio Corpus, it consists of 17,000 audio clips sampled from archives collected from six Guinean radio stations. The broadcasts in the West African Radio Corpus span news and shows in languages including French, Guerze, Koniaka, Kissi, Kono, Maninka, Mano, Pular, Susu, and Toma.

To create a speech recognition system, the researchers tapped Facebook’s wav2vec, an open source framework for unsupervised speech processing. Wav2vec uses an encoder module that takes raw audio and outputs speech representations, which are fed into a Transformer that ensures the representations capture whole-audio-sequence information. Created by Google researchers in 2017, the Transformer network architecture was initially intended as a way to improve machine translation. To this end, it uses attention functions instead of a recurrent neural network to predict what comes next in a sequence.

speech recognition

Above: The accuracies of WAwav2vec.

Despite the fact that the radio dataset includes phone calls as well as background and foreground music, static, and interference, the researchers managed to train a wav2vec model with the West African Radio Corpus, which they call WAwav2vec. In one experiment with speech across French, Maninka, Pular, and Susu, the coauthors say that they achieved multilingual speech recognition accuracy (88.01%) on par with Facebook’s baseline wav2vec model (88.79%) — despite the fact that the baseline model was trained on 960 hours of speech versus WAwav2vec’s 142 hours.

Virtual assistant

As a proof of concept, the researchers used WAwav2vec to create a prototype of a speech assistant. The assistant — which is available in open source along with the datasets — can recognize basic contact management commands (e.g., “search,” “add,” “update,” and “delete”) in addition to names and digits. As the coauthors note, smartphone access has exploded in the Global South, with an estimated 24.5 million smartphone owners in South Africa alone, according to Statista, making this sort of assistant likely to be useful.

“To the best of our knowledge, the multilingual speech recognition models we trained are the first-ever to recognize speech in Maninka, Pular, and Susu. We also showed how this model can power a voice interface for contact management,” the coauthors wrote. “Future work could expand its vocabulary to application domains such as microfinance, agriculture, or education. We also hope to expand its capabilities to more languages from the Niger-Congo family and beyond, so that literacy or ability to speak a foreign language are not prerequisites for accessing the benefits of technology. The abundance of radio data should make it straightforward to extend the encoder to other languages.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Google researchers boost speech recognition accuracy with more datasets

Join Transform 2021 this July 12-16. Register for the AI event of the year.


What if the key to improving speech recognition accuracy is simply mixing all available speech datasets together to train one large AI model? That’s the hypothesis behind a recent study published by a team of researchers affiliated with Google Research and Google Brain. They claim an AI model named SpeechStew that was trained on a range of speech corpora achieves state-of-the-art or near-state-of-the-art results on a variety of speech recognition benchmarks.

Training models on more data tends to be difficult, as collecting and annotating new data is expensive — particularly in the speech domain. Moreover, training large models is expensive and impractical for many members of the AI community.

Dataset solution

In pursuit of a solution, the Google researchers combined all available labeled and unlabelled speech recognition data curated by the community over the years. They drew on AMI, a dataset containing about 100 hours of meeting recordings, as well as corpora that include Switchboard (approximately 2,000 hours of telephone calls), Broadcast News (50 hours of television news), Librispeech (960 hours of audiobooks), and Mozilla’s crowdsourced Common Voice. Their combined dataset had over 5,000 hours of speech — none of which was adjusted from its original form.

With the assembled dataset, the researchers used Google Cloud TPUs to train SpeechStew, yielding a model with more than 100 million parameters. In machine learning, parameters are the properties of the data that the model learned during the training process. The researchers also trained a 1-billion-parameter model, but it suffered from degraded performance.

Once the team had a general-purpose SpeechStew model, they tested it on a number of benchmarks and found that it not only outperformed previously developed models but demonstrated an ability to adapt to challenging new tasks. Leveraging Chime-6, a 40-hour dataset of distant conversations in homes recorded by microphones, the researchers fine-tuned SpeechStew to achieve accuracy in line with a much more sophisticated model.

Transfer learning entails transferring knowledge from one domain to a different domain with less data, and it has shown promise in many subfields of AI. By taking a model like SpeechStew that’s designed to understand generic speech and refining it at the margins, it’s possible for AI to, for example, understand speech in different accents and environments.

Future applications

When VentureBeat asked via email how speech models like SpeechStew might be used in production — like in consumer devices or cloud APIs — the researchers declined to speculate. But they envision the models serving as general-purpose representations that are transferrable to any number of downstream speech recognition tasks.

“This simple technique of fine-tuning a general-purpose model to new downstream speech recognition tasks is simple, practical, yet shockingly effective,” the researchers said. “It is important to realize that the distribution of other sources of data does not perfectly match the dataset of interest. But as long as there is some common representation needed to solve both tasks, we can hope to achieve improved results by combining both datasets.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Microsoft shopping for speech tech, in talks to buy Nuance for $16B

Join GamesBeat Summit 2021 this April 28-29. Register for a free or VIP pass today.


(Reuters) — Microsoft is in advanced talks to buy artificial intelligence and speech technology company Nuance Communications for about $16 billion, according to a source familiar with the matter.

The price being discussed could value Nuance at about $56 a share, the source said, adding that an agreement could be announced as soon as Monday.

Bloomberg News, which first reported the potential deal between Nuance and Microsoft, said talks are ongoing and the sale could still fall apart.

Burlington, Massachusetts-based Nuance, whose voice recognition technology helped launch Apple assistant Siri, makes software for sectors ranging from health care to the automotive industry.

A deal with Nuance would be Microsoft’s second-biggest after its $26.2 billion acquisition of LinkedIn in 2016.

Microsoft and Nuance did not immediately respond to Reuters’ request for comment.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Study finds that even the best speech recognition systems exhibit bias

Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.


Even state-of-the-art automatic speech recognition (ASR) algorithms struggle to recognize the accents of people from certain regions of the world. That’s the top-line finding of a new study published by researchers at the University of Amsterdam, the Netherlands Cancer Institute, and the Delft University of Technology, which found that an ASR system for the Dutch language recognized speakers of specific age groups, genders, and countries of origin better than others.

Speech recognition has come a long way since IBM’s Shoebox machine and Worlds of Wonder’s Julie doll. But despite progress made possible by AI, voice recognition systems today are at best imperfect — and at worst discriminatory. In a study commissioned by the Washington Post, popular smart speakers made by Google and Amazon were 30% less likely to understand non-American accents than those of native-born users. More recently, the Algorithmic Justice League’s Voice Erasure project found that that speech recognition systems from Apple, Amazon, Google, IBM, and Microsoft collectively achieve word error rates of 35% for African American voices versus 19% for white voices.

The coauthors of this latest research set out to investigate how well an ASR system for Dutch recognizes speech from different groups of speakers. In a series of experiments, they observed whether the ASR system could contend with diversity in speech along the dimensions of gender, age, and accent.

The researchers began by having an ASR system ingest sample data from CGN, an annotated corpus used to train AI language models to recognize the Dutch language. CGN contains recordings spoken by people ranging in age from 18 to 65 years old from Netherlands and the Flanders region of Belgium, covering speaking styles including broadcast news and telephone conversations.

CGN has a whopping 483 hours of speech spoken by 1,185 women and 1,678 men. But to make the system even more robust, the coauthors applied data augmentation techniques to increase the total hours of training data “ninefold.”

When the researchers ran the trained ASR system through a test set derived from the CGN, they found that it recognized female speech more reliably than male speech regardless of speaking style. Moreover, the system struggled to recognize speech from older people compared with younger, potentially because the former group wasn’t well-articulated. And it had an easier time detecting speech from native speakers versus non-native speakers. Indeed, the worst-recognized native speech — that of Dutch children — had a word error rate around 20% better than that of the best non-native age group.

In general, the results suggest that teenagers’ speech was most accurately interpreted by the system, followed by seniors’ (over the age of 65) and children’s. This held even for non-native speakers who were highly proficient in Dutch vocabulary and grammar.

As the researchers point out, while it’s to an extent impossible to remove the bias that creeps into datasets, one solution is mitigating this bias at the algorithmic level.

“[We recommend] framing the problem, developing the team composition and the implementation process from a point of anticipating, proactively spotting, and developing mitigation strategies for affective prejudice [to address bias in ASR systems],” the researchers wrote in a paper detailing their work. “A direct bias mitigation strategy concerns diversifying and aiming for a balanced representation in the dataset. An indirect bias mitigation strategy deals with diverse team composition: the variety in age, regions, gender, and more provides additional lenses of spotting potential bias in design. Together, they can help ensure a more inclusive developmental environment for ASR.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Sonantic uses AI to infuse emotion in automated speech for game prototypes

Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.


Sonantic has figured out how to use AI to turn written words into spoken dialogue in a script, and it can infuse those words with the proper emotion.

And it turns out this is a pretty good way to prototype the audio storytelling in triple-A video games. That’s why the Sonantic technology is finding use with 200 different video game companies for audio engineering.

The AI can provide true emotional depth to the words, conveying complex human emotions from fear and sadness to joy and surprise. The breakthrough advancement revolutionizes audio engineering capabilities for gaming and film studios, culminating in hyper-realistic, emotionally expressive and controllable artificial voices.

“Our first pilots were for triple-A companies, and then when we started building this,” said cofounder Zeena Qureshi in an interview with GamesBeat. “We went a lot more vertical and deeper into just working very closely with these types of partners. And what we found is the highest quality bar is for these studios. And so it’s really helped us bring our technology into a very great place.”

Building upon the existing framework of text-to-speech, London-based Sonantic’s approach is what differentiates a standard robotic voice from one that sounds genuinely human. Creating that “believability” factor is at the core of Sonantic’s voice platform, which captures the nuances of the human voice.

Obsidian Entertainment audio director Justin Bell said in a video that the tech will enable game companies such as his own to cut production timelines and costs with this new capability. Bell said that his team could send a script through Sonantic’s application programming interface (API) and then get something back that isn’t just robotic dialogue. It comes back as a real human conversation, and Bell said that could empower the team to tell a better story.

Above: Zeena Qureshi and John Flynn are the founders of Sonantic.

Image Credit: Sonantic

“It’s just really useful hearing something back very early in the process,” Qureshi said.

You could simply use these scripts and the voices generated to populate dialogue into the non-player characters of a game. But the point of this isn’t to put voice actors out of work, Qureshi said. Rather, it gives a readable, reviewable script to the creators much earlier in the creative process so that they can listen to the dialogue and change it much earlier in the process if it clearly doesn’t sound right, she said.

In order to demonstrate its voice-on-demand technology, Sonantic has released a demo video highlighting its partnership with Obsidian, maker of The Outer Worlds and a subsidiary of Microsoft’s Xbox Game Studios. Others using Sonantic include Splash Damage and Sumo Digital.

Sonantic partners with experienced actors to create voice models. Clients can choose from existing voice models or work with Sonantic to build custom voices for unique characters. Project scripts are then uploaded to Sonantic’s platform, where a client’s audio team can choose from a variety of high-fidelity speech synthesis options including pitch, pacing, projection, and an array of emotions.

Above: Sonantic’s tool helps audio engineers make better games and films.

Image Credit: Sonantic

Film and game studios are not the only beneficiaries of Sonantic’s platform. Actors can maximize both their time and talent by turning their voices into a scalable asset, as the Sonantic technology takes their voices and uses them to create different variations. Sonantic’s revenue share model empowers actors to generate passive income every time his or her voice model is used for a client’s project, spanning development, preproduction, production, and post-production.

“This technology isn’t made to replace actors,” Qureshi said. “What it actually helps with is at the very beginning of game development. Triple-A games can take up to 10 years to make. But they typically get in actors at the very early stages, because they’re constantly iterating. So they use text-to-speech that’s been an industry standard for the last few decades. But we’ve created a way that helps actors work virtually as well as in person. And it helps studios get voices into their game, highly realistic voices into their game from the very beginning to help them feel out the story arc, fill out the pacing, really understand what needs to change, so that their iteration cycles can continue to go really fast.”

Sonantic’s official launch follows last year’s beta release, which was captured in a video entitled Faith: The First AI That Can Cry.

The result is a streamlined production process. Teams won’t have to call back actors for reshoots or engage in re-edits of voices as much.

“Some of our studios have told us they save a week of time for their team every month,” Qureshi said.

An accelerator meeting

Qureshi met cofounder John Flynn in 2018. He had a great demo of the technology, and Qureshi had a background in speech and language therapy.

“When I heard his demo, I was like, ‘This is insane!’” Qureshi said. “It sounded better than any text-to-speech I’ve ever heard. And then he told me how he did it. And I thought, ‘This is exactly how I teach children.’”

Before that demo, all the speech-to-text algorithms Qureshi had heard flattened the delivery of the performance, so that it sounded robotic.

“The technology before didn’t captures the highs and lows of the voice,” Flynn said. “I changed it to make it work better by looking for those highs and lows and trying to like get the algorithm to focus on that more.”

Qureshi added, “The devil is in the details with communication. There are so many different ways to say something. So when I’m teaching a child, I have to teach them emotions. I have to teach them how to enunciate very clearly, how to project their voice, really use their voice as an instrument, and control it.”

Flynn said that most of the work of the past few years is to get models to do the same as what Qureshi could do with kids.

“Last year, we had the AI that could cry, with emotion and sadness,” Flynn said. “It’s really about the nuances in speech, that quiver of the voice for sadness, an exertion for anger. We try and model those really deeply. Once you add in those details and layer them on top, you start to get energy and it becomes really realistic.”

Besides games, Sonanctic works for films and TV production. The company has 12 employees, and it has raised $3.5 million to date from investors including AME Cloud Ventures, EQT Ventures, and Krafton Ventures.

GamesBeat

GamesBeat’s creed when covering the game industry is “where passion meets business.” What does this mean? We want to tell you how the news matters to you — not just as a decision-maker at a game studio, but also as a fan of games. Whether you read our articles, listen to our podcasts, or watch our videos, GamesBeat will help you learn about the industry and enjoy engaging with it.

How will you do that? Membership includes access to:

  • Newsletters, such as DeanBeat
  • The wonderful, educational, and fun speakers at our events
  • Networking opportunities
  • Special members-only interviews, chats, and “open office” events with GamesBeat staff
  • Chatting with community members, GamesBeat staff, and other guests in our Discord
  • And maybe even a fun prize or two
  • Introductions to like-minded parties

Become a member

Repost: Original Source and Author Link