Ubicoustics: Plug-and-Play Acoustic Activity Recognition

Microphones are the most common sensor found in consumer electronics today, from smart speakers and phones to tablets and televisions. Despite sound being an incredibly rich information source, offering powerful insights about physical and social context, modern computing devices do not utilize their microphones to understand what is going on around them. For example, a smart speaker sitting on a kitchen countertop cannot figure out if it is in a kitchen, let alone know what a user is doing in a kitchen. Likewise, a smartwatch worn on the wrist is oblivious to its use cooking or cleaning. This inability for “smart” devices to recognize what is happening around them in the physical world is a major impediment to them truly augmenting human activities.


Real-time, sound-based classification of activities and context is not new. There have been many previous application-specific efforts that focus on a constrained set of recognized classes. For example, Ward et al. developed a microphone-equipped necklace in conjunction with accelerometers mounted on arms that could distinguish between nine shop tools. In these types of constrained uses, the training data for machine learning is generally domain-specific and captured by the researchers themselves.


We sought to build a more general-purpose and flexible sound recognition pipeline – one that could be deployed to an existing device as a software update and work immediately, requiring no end-user or in situ data collection (i.e., no training or calibration). Such a system should be “plug-and-play” – e.g., plug in your Alexa, and it can immediately discern all of your kitchen appliances by sound. This is a challenging task, and very few sound-based recognition systems achieve usable end-user accuracies, despite offering pre-trained models that are meant to be integrated into applications (e.g., Youtube-8M, SoundNet).


We propose a novel approach that brings the vision of plug-and-play activity recognition closer to reality. Our process starts by taking an existing, state-of-the-art sound labeling model and tuning it with high-quality data from professional sound effect libraries for specific contexts (e.g., a kitchen and its appliances). We found professional sound effect libraries to be a particularly rich source of high-quality,well-segmented, and accurately-labeled data for everyday events. These large databases are employed in the entertainment industry for post-production sound design (and to a lesser extent in live broadcast and digital games).


Sound effects can also be easily transformed into hundreds of realistic variations (synthetically growing our dataset, as opposed to finding or recording more data) by adjusting key audio properties such as amplitude and persistence, as well mixing sounds with various background tracks. We show that models tuned on sound effects can achieve superior accuracy to those trained on internet-mined data alone. We also evaluate the robustness of our approach across different physical contexts and device categories. Results show that our system can achieve human-level performance, both in terms of recognition accuracy and false positive rejection.


Research Team: Gierad Laput, Karan Ahuja, Mayank Goel, Chris Harrison


Additional media can be found on Gierad Laput's site.

Citation

Laput, G., Ahuja, K., Goel, M. and Harrison, C. 2018. Ubicoustics: Plug-and-Play Acoustic Activity Recognition. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (Berlin, Germany, October 14 - 17, 2018). UIST '18. ACM, New York, NY. 213-224.