Research

Is it possible to have a natural conversation with AI? The potential of speech recognition technology

Research Highlight — Five-minute Research Digest

Katsunobu Itou, Professor

Department of Digital Media, Faculty of Computer and Information Sciences
Posted Sep.21,2023

You are researching speech recognition. Could you tell us about this technology?

Speech recognition is a technology that is finding applications in various fields, such as voice assistants similar to the iPhone function Siri that is now used regularly or vehicle navigation. The history of research on speech recognition goes back quite far, having began in the 1940s. There were times when research stagnated due to limits on the performance abilities of computers. However, with the increase in the memory capacities and comuting speed and the emergence of AI, research has made a great deal of progress since the 1990s. And now, Siri represents the result of many years of research. I believe Siri has been widely accepted because, despite a degree of imperfection with the performance of speech recognition, people think it’s interesting, and also it excels at information retrieval operations that are difficult for humans. New technology becomes widely spread when people accept and use it this way.

What kind of research have you yourself been conducting?

To conduct speech recognition, information for both audio signals and language must be processed. I have primarily specialized in the field of natural language processing. When research on natural language processing first started, technicians tried to work out the process of how to make computers memorize grammar, but they came to learn that there were limits to that approach. Therefore, a method was developed to predict conversations. This was done by creating a statistical language model based on huge amounts of language data. It predicts the probability that, for example, the next words after "rain" would be "is falling." Recently, ChatGPT has been in the limelight, and it could be considered a recent model of natural language processing that makes use of this technology.

How is your research advancing the field of voice recognition?

As I mentioned previously, speech recognition requires a huge amount of language data. Older computers didn’t have the memory storage capacity we needed, so they weren’t up to the task. In the 1990s, computer performance improved, and I started research when the language data that had been accumulated in the United States was shared. However, there was no such shared data in Japan. Therefore, in cooperation with universities and corporations involved in speech recognition research, we constructed a statistical language model based on 10 years worth of newspaper data that we received from a newspaper publisher. From there, we made a speech recognition toolkit that could be used freely. It continued to be used until just a few years ago.

What kind of research are you conducting presently?

Right now, rather than language, I am focusing on research into sound. As an example, I collaborated with the Nogami Memorial Noh Theatre Research Institute of Hosei University on a digital analytical study of the musicality of Noh singing, and reproduced it using a Vocaloid synthesizer software. Noh singing uses unique melodies not found in Western music, and may seem difficult to understand. However, it is our hope that this approach will lead to an objective understanding of traditional art and serve as the foundation for furthering Noh research.

Going forward, how will voice recognition technology be developed?

Conventional speech recognition technology could not recognize the emotion contained in spoken language. However, if we accumulate data going forward, 
we can create voice recognition systems that can read, understand, and use emotion. Image expressions with computers—such as CG animation—are evolving, but we still need to rely on humans to deliver the voice lines. However, if we push technology forward, synthesized voice will be able to express emotion like the speech of a real human voice actor or actress. And if that is practically applied, people may be able to have natural conversations with an AI like they would with their family or friends.

Realize Future Dream through Research

As with my past research into speech recognition, I have been taking on the challenge of researching into unexplored fields. There are few researchers in the field of music research, like the analytical study of Noh recitation. However, I feel greatly encourages as results are being achieved, with students receiving encouragement awards from academic societies. Going forward, I will take on the challenge of new research, and hope to make possible that which humans have not been able to do until now through the power of technology.

Katsunobu Itou, Professor

Department of Digital Media, Faculty of Computer and Information Sciences

In 1993, he completed the doctoral program at Tokyo Institute of Technology Graduate School of Engineering, Department of Information and Communications Engineering, with a major in information engineering. In the same year, he joined the Electrotechnical Laboratory (Japan) (currently the National Institute of Advanced Industrial Science and Technology), and engaged in the development of a voice conversation system using voice recognition technology and a statistical language model. In 2003, he took up the position as assistant professor at the Nagoya University Graduate School of Informatics He took up his present post in 2006.