Audio compression & ear-modelling using convolutional and recurrent neural networks /

Sanket Kulkami

Audio data compression is the process of reducing the transmission bandwidth and storage requirements of audio data. Lossy audio compression algorithms provide great compression ratios as compared to lossless compression, but at the cost of fidelity but are practical in numerous audio applications. These techniques traditionally rely on the principle of psycho-acoustical masking, to reduce the total number of bits required to represent each audio frame, thereby reducing the space required to store or transmit them. In this thesis project, the purpose of this project was to try and utilize the modern recurrent neural network architectures with memory cells, like LSTM (Long Short-Term Memory) along with convolutional neural nets to develop a prediction model which could be used for speech or music compression. The feature selection for representation of original audio could be handled by neural networks as well, to further capture the psycho-acoustical masking effects to minimize perceptual loss in the audio quality, while still providing a better reconstruction at the decoder with low audible artifacts. Further, this also helped in the study of perceptual ear modeling and introduces a new scaled-additive reconstruction framework. This was then compared to previously established codecs like MP3 and AAC using peak RMS error, area under rate distortion curve and other perceptual metrics, with and without the help of binary entropy coding. The final architecture was a sub-band coding architecture similar to the format of MP3, but instead of the traditional psycho acoustical masking ear model, the mechanism was replaced with a neural net that first does feature selection on the frequency domain magnitude and phase of the input signal using dense NN, followed by a memory cell recurrent neural network oflayers like LSTM for temporal masking and subsequently a bottleneck architecture of convolutional neural network with residual connections for frequency masking effects and their reconstruction. The model was trained separately on different datasets (speech and music) and encoded using a adaptive Huffman encoder, to achieve industry standard compression ratios. There were some smaller nuances that take care of transients separately and helping get rid of windowing effects, too. The best outputs for the model were for speech that was trained from just one individual speaker, with almost no perceivable loss at around 98.2 percent compression ratio, which is very useful for transmitting speech over low bandwidth communication protocols like BLE, for example. For music, the compression ratio is in the range of 92 to 94 percent using variable bit-rate encoding. The perceptual evaluation of the final compression codec was analyzed with the help of the PEAQ standard and listening tests like MUSHRA. For the listening test, a R-score of 0.7 was obtained compared to MP3 at 128kbps encoding for a probability of 0.95.

Audio compression & ear-modelling using convolutional and recurrent neural networks /

Abstract

Files and links (1)

Metrics

Details