Chulalongkorn University’s Faculty of Engineering and the Faculty of Arts have jointly developed the “Thai Speech Emotion Recognition Data Sets and Models”, now available for free downloads, to help enhance sales operations and service systems to better respond to customers’ needs.
Thai Speech Emotion Recognition Model, a cutting-edge AI by Chula faculty members, currently available to the public to download, is interdisciplinary research between Dr Ekapol Chuangsuwanich, a computer engineering scholar from the Faculty of Engineering, along with Asst. Prof. Dangkamol Na-pombejra and Patsupang Kongbumrung, two dramatic arts experts from the Faculty of Arts.
“Speech Emotion Recognition AI involves so many applications, e.g., a call centre system can assess the mood of customers who call for service if they are angry or irritable and record their feelings from the tone of voice throughout the conversation as statistics of dissatisfied customers. An AI that can express more natural emotions while communicating with users can also be created to replace the monotonous, robotic voice that we are familiar with,” Dr. Ekapol explained the goals of the project that is a collaboration with Vidyasirimedhi Institute of Science and Technology (VISTEC) and funded by Digital Economy Promotion Agency (depa), and Advanced Info Service, Public Co., Ltd. (AIS).
A library of emotionally diverse voices from performers
Before models of emotional classifications can be established, an audio library is first required. This is where dramatic arts comes in to help create a Thai Speech Emotion Data Set.
Two hundred performers, both male and female performed speech patterns of five emotions: anger, sadness, frustration, happiness, and standard tones. Each performer recorded the speeches in all five emotions, both in a monologue and interactively as a dialogue.
“Usable voices have to be those that express the real emotions that occur in our daily lives, and not overacting ones. Therefore, a team of directors had to be present to help guide the actors to deliver realistic voices according to the moods,” said Asst. Prof. Dangkamol.
“Moreover, when it’s time to change the sound to convey another emotion, though some actors may still linger on the same mood, the team of directors would coach them to induce new emotions until the actors convey them in the most realistic manner.”
After completing the recording, sound patterns of all five types of emotions were created from the audio data sets and later developed into emotion-classifying models, which according to Dr. Ekapol, the computer engineering scholar, are up to 60-70% accurate.
“We perceive a speaker’s mood by observing the composition of the speech: tone, volume, cadence, whimpers, laughter. AI works almost similar to the way we sense emotions,” Dr. Ekapol explained.
“AI is tasked with classifying the input speech and matching it with corresponding types of emotions by comparing the input against baseline voices. Once the AI learns from the input, it will be able to detect the patterns, like the mournful voice would be slightly softer than normal; the happy sound would be mixed with laughter; while the angry voice would be louder than usual.”
Dr. Ekapol pointed out the opportunities to use the Speech Emotion Recognition Models in many types of work according to the users’ imagination as to what they want out of the mood analysis.
“Usage is not limited to only computer workers. You need to look at what users want to use the emotional assessment for. For example, it can be used in call centers to assess upset customers, and analyze the subjects about which customers are most upset, and what they talk about. It can also be developed into avatars or AI robots with facial expressions and moving lips and can respond to users.”
Lecturer Dr. Ekapol also added that speech-based emotion-classifying AI is also useful in various hotline operations, especially the mental health hotlines.
“In the future, we plan to develop it to be applicable in mental health hotlines with depression patients so that the robots can respond emotionally to humans.”
Future models to increase diversity in both sounds and moods
Certainly, the five emotions in the database are not large and varied enough to gauge all of the human feelings. Each gender and age group also expresses the same emotion in a different way which poses a new challenge for the researchers. They are poised to work on improving the effectiveness of the system and accuracy of emotion detection as well as expanding the models to cover people of all ages.
“We have plans to improve the current models’ efficiency and expand the coverage to more sections of people. Because most recording actors were students and working-age people, the model may not perform well when pitched against the voices of children and the elderly. Also, since the recordings were done in the studio, the models may not work as well if the input voices have too much ambient noise,” Dr. EkaPol said.