Google has announced Project Euphonia at I/O in May, as an attempt to make speech recognition easy and capable of understanding people who’re voices differ from the standard speaking voices or impediments. The company has published a paper explaining some of the AI work showing off its new capability.
The AI interface sometimes functions problematically while hearing voices of people with motor impairments, like the ones produced by degenerative diseases like amyotrophic lateral sclerosis (ALS.
Google research scientist Dimitri Kanevsky, who himself has impaired speech, attempted to interact with one of the company’s own products with the help of related work Parrotron.
The research team describes it as:
ASR [automatic speech recognition] systems are most often trained from ‘typical’ speech, which means that underrepresented groups, such as those with speech impairments or heavy accents, don’t experience the same degree of utility.
…Current state-of-the-art ASR models can yield high word error rates (WER) for speakers with only a moderate speech impairment from ALS, effectively barring access to ASR reliant technologies.
For the research, the researchers at Google collected hours of spoken audio data from people with ALS. Each person is affected differently by their condition, so accommodating the effects of the disease is not the same as accommodating other speech problems such as an unusual accent.
A standard voice-recognition model was used as a base, then restructured in a few different experimental ways, training it on the new audio. This largely reduced the word error rates. This proved that with relatively little change in the original model and no heavy computation when adjusting to new voices, the technology can be better.
The researchers also found that the model had two kinds of errors as it is still confused by a given phoneme (an individual speech sound like an “e” or “f”). First, it failed to recognize the phoneme for what it was intended, hence not recognizing the word itself. Second, the model had to choose the phoneme which the speaker intended to speak and might select the wrong one in cases where two or more words sound relatively similar.
The second error can be rectified intelligently. The AI system may be able to use what it already knows of the human language and the speakers own voice or the context in which they are speaking to fill in the gaps intelligently.