Deep Scattering Network features introduced for image processing have recently proved useful in speech recognition as an alternative to log-mel features for Deep Neural Network (DNN) acoustic models. Scattering features use wavelet decomposition directly producing log-frequency spectrograms which are robust to local time warping and provide additional information within higher order coefficients.
This paper extends previous works by showing how scattering features perform on a state-of-the-art spontaneous speech recognition utilizing DNN acoustic model. We revisit feature normalization and compression topics in an extensive study, putting emphasis on comparing models of the same size. We observe that scattering features outperform baseline log-mel in all conditions, with additional gains from multi-resolution processing.