Why synthetic audio data is the sound of the future


18 Jan 2024

Image: © okrasiuk/Stock.adobe.com

Treble’s Dr Finnur Pind explains why the future of audio sounds synthetic.

Sound is an integral part of our daily lives. Whether we’re jumping on a video call, listening to music or using voice commands to interact with our devices, the quality of our audio is of paramount importance. As such, continuing to improve audio technologies is vital for the companies who create these platforms.

Artificial intelligence (AI) has added a new layer of potential for streamlining the processes of sound recognition, enhancement and suppression. The role AI performs within this sector largely concerns the use of data in training products – from speakers to voice assistants.

Data is the best tune

Acoustic engineers require audio data to train, test and validate machine learning (ML) models that are analysing, processing or generating audio. To train these models effectively, a diverse and representative dataset is required that covers the range of scenarios the model is expected to encounter. Data is fundamental for research and development too, aiding, for example, in designing better sound insulation materials or optimising the performance of audio devices. The accuracy and reliability of acoustic models, simulations, and technologies depend on the quality of the data used for training and testing.

Real-world data (RWD) is data collected from actual, non-simulated environments or situations. It represents observations and measurements obtained from genuine occurrences, behaviours and conditions in the external world. Synthetic data, on the other hand, is artificially generated information created using models, algorithms or other computational methods. RWD ensures that the models are grounded in practical scenarios, while synthetic data facilitates controlled experimentation.

Synthetic audio data

RWD does have its limitations, some of which are related to the inherent challenges associated with actual acoustics and potential human errors. These challenges may arise from poor recording equipment, introducing issues that compromise the integrity of the entire dataset. In addition to the technical aspects, real-world datasets may not comprehensively represent the diverse range of acoustic variations present in the target application, leading to less robust models. Moreover, while real audio recordings offer authenticity, they might also contain sensitive information, making them unsuitable for training or testing audio processing models.

Synthetic audio data, however, is audio content that mimics the characteristics of real-world audio recordings – for which there are several benefits. Training and testing audio processing algorithms and models become more effective through synthetic data. This aids in instructing AI algorithms to discern, refine or dampen specific audio attributes. Developers can evaluate their models in controlled settings, streamlining targeted enhancements.

Addressing challenges related to data scarcity is another advantage. Acquiring ample real-world audio data can pose logistical and financial difficulties. Synthetic audio data acts as a bridge, ensuring that AI models access a diverse array of audio samples. Enhancing privacy is another important aspect. Synthetic audio data mitigates privacy concerns by generating data devoid of actual individuals or confidential information. Simulating varied audio scenarios is also made possible through synthetic data, allowing developers to create diverse testing environments. This proves essential in refining and fine-tuning AI algorithms by incorporating different accents, languages, noise levels and other factors, thereby contributing to the development of more robust models.

Clearing the noise clutter

Every industry has evolved to require video conferencing and virtual meetings, and with this shift the need for clear audio has become increasingly important for effective communication. Background noise can muddle conversations and compromise audio quality, leading to misunderstandings and frustrations. To combat this, AI technology is employed to distinguish between relevant audio and background noise, allowing devices to dynamically suppress unwanted sounds. Synthetic audio data is used to train algorithms in making these distinctions.

One practical application of this technology is found in the development of noise-cancelling headphones. By analysing synthetic data, these headphones can effectively cancel out background noise, creating a cleaner listening experience. Voice assistants like Siri or Alexa also rely on AI algorithms, often powered by the same type of data, to ensure they respond accurately, even in noisy environments.

Software tools like Cleanvoice and Podcastle are also leveraging synthetic data to improve the audio quality of recordings, making it an invaluable asset for content creators, podcasters, and professionals in various audio-related industries.

Silencing the resonance

Echoes pose a considerable challenge in applications like video conferencing and telecommunication networks, impacting communication and audio quality. Addressing this concern, AI technologies, such as those used by the Microsoft Teams platform, employ advanced echo cancellation techniques. Microsoft utilised a colossal dataset of 30,000 hours of RWD to train their models, a process that involved incentivising users to record and playback their voices. This approach, though effective, was time-consuming and costly.

More efficient alternatives now exist. Acoustic engineering platforms, such as Treble, allow users to create vast quantities of synthetic data that can be customised for specific scenarios. Developers can configure various parameters, such as sound sources, receivers, environments and materials, to simulate diverse settings for training echo cancellation ML models. This approach eliminates the need for labour-intensive data collection, offering a cost-effective solution that minimises privacy concerns associated with user-recorded data. These platforms exemplify the potential of synthetic datasets in improving entire systems through efficient and customisable training scenarios.

Bridging linguistic divides

Voice commands have become ubiquitous, with smart assistants and automotive interfaces leading the way. However, for these systems to be truly universal, they must adapt to various accents, dialects and speech patterns. Synthetic audio data plays a crucial role in training speech recognition systems to allow for seamless communication between humans and machines, regardless of linguistic variations.

Companies such as OpenAI and Amazon have heavily invested in speech recognition technology. They utilise synthetic data to ensure their systems can understand and respond to users from diverse linguistic backgrounds. The potential applications are extensive, from home automation to multilingual customer service.

Meta is also on a mission to improve automatic speech recognition (ASR). They have faced similar challenges to Microsoft, using 27,000 command utterances collected from 595 paid US volunteers to train its models. Again, employing a platform dedicated to producing synthetic audio data such as Treble could have expedited the training process, eliminating the need for costly data collection and human effort.

Meta’s ambition extends to teaching ASRs to train without the aid of transcripts, recognise over 4,000 spoken languages, and lip-read more proficiently than human experts. However, the limitation of its training data – organised by demographics such as age group, gender, nationality and accents – can hinder their ASR tool’s ability to understand a broader cross-section of users. Synthetic data can offer a solution by introducing a wider range of pronunciations and linguistic diversity.

A harmonious future

As audio technology ushers in a new era of innovation, synthetic audio data emerges as a potent tool for enhancing sound quality and addressing the challenges that the integration of AI in audio technology faces. Whether it’s improving speech recognition, cancelling echoes or suppressing background noise, the synergy of AI and synthetic audio data will play a pivotal role in elevating audio quality.

This partnership will go a long way to transform the way we create, perceive and interact with sound. The impact of this integration extends beyond consumers to a wide array of industries, from entertainment to communication and beyond. As we journey further into the realm of AI-driven audio technology, the possibilities for enhancing sound quality and solving real-world problems keep extending. Synthetic audio data has cemented its place as a critical building block in this harmonious future, where the symphony of sound is clearer, richer and more accessible than ever before.

By Dr Finnur Pind

Graduating with a PhD in sound simulation technology from the Technical University of Denmark, Pind is an expert in the fields of acoustics, applied mathematics and software engineering. He is the co-founder and CEO of Treble Technologies, a company that makes acoustics simulation and spatial audio software.

10 things you need to know direct to your inbox every weekday. Sign up for the Daily Brief, Silicon Republic’s digest of essential sci-tech news.