Using artificial intelligence (AI), engineers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a new system that can extract just one instrument sound from a whole video. Not only that, the deep learning system can also make those sounds louder or softer as required.

The system is completely self-sufficient and needs no human controller in which to do its job. Named, the “PixelPlayer”, it can identify certain instruments at the pixel level and then isolate any sounds that are linked with that instrument. Having the ability to do this means that someday we could potentially see huge improvements in terms of audio quality at concerts.


In a new paper, the researchers demonstrated how PixelPlayer can isolate more than 20 different instrument sounds. And, they’re quite confident that with more training, the system could easily identify more. However, they do say that the chances are that the system will still struggle with telling the difference between subclasses of instruments (i.e tenor sax versus alto sax).

While previous attempts at separating the individual sources of sound, most have focused purely on audio. The problem with this is that it requires a lot of human labeling. PixelPlayer, on the other hand, brings the vision into the mix, making human labeling unnecessary. The way it works is by locating the image regions responsible for making a particular sound then separates any input sounds into elements that represent each pixel’s sound. 

“We expected a best-case scenario where we could recognize which instruments make which kinds of sounds,” says Zhao, a CSAIL Ph.D. student. “We were surprised that we could actually spatially locate the instruments at the pixel level. Being able to do that opens up a lot of possibilities, like being able to edit the audio of individual instruments by a single click on the video.” 


The way PixelPlayer works is by using neural networks and deep learning techniques to find patterns in data. Researchers are calling the system “self-supervised” as they don’t yet fully understand every part of how it can learn every related sound to these instruments. But, Zhao does say that he can tell when the systems identify certain aspects of the music. For example, fast, pulsing patterns tend to be linked to instruments such as the xylophone, while smoother, more harmonic frequencies tend to correlate to instruments such as the violin.

More News to Read

Comments

comments