Title:
Data2vec-SG: Improving Self-supervised Learning Representations for Speech Generation Tasks

Abstract:
Self-supervised learning has been successfully applied to various speech recognition and understanding tasks. However, for generative tasks such as speech enhancement and speech separation, most self-supervised speech representations did not show substantial improvements. To deal with this problem, in this paper, we propose data2vec-SG (Speech Generation), which is a teacher-student learning framework that addresses speech generation tasks. Our data2vec-SG introduces a reconstruction module into data2vec and enforces the representations to contain not only the semantic information but also the acoustic knowledge to generate clean speech waveforms. Experimental results demonstrate that the proposed framework boosts the performance of various speech generation tasks including speech enhancement, speech separation, and packet loss concealment. Meanwhile, the learned representation is also capable of helping other downstream tasks, which is demonstrated by the good performance in the speech recognition task in both clean and noisy conditions.

2023.02.26

Title:
Self-Supervised Learning with Bi-Label Masked Speech Prediction for Streaming Multi-Talker Speech Recognition

Abstract:
Self-supervised learning (SSL), which utilizes the input data itself for representation learning, has achieved state-of-the-art results for various downstream speech tasks. However, most of the previous studies focused on offline single-talker applications, with limited investigations in multi-talker cases, especially for streaming scenarios. In this paper, we investigate SSL for streaming multi-talker speech recognition, which generates transcriptions of overlapping speakers in a streaming fashion. We first observe that conventional SSL techniques do not work well on this task due to the poor representation of overlapping speech. We then propose a novel SSL training objective, referred to as bi-label masked speech prediction, which explicitly preserves representations of all speakers in overlapping speech. We investigate various aspects of the proposed system including data configuration and quantizer selection. The proposed SSL setup achieves substantially better word error rates on the LibriSpeechMix dataset.

2023.02.26

Title:
Large-Scale Streaming End-to-End Speech Translation with Neural Transducers

Abstract:
Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we introduce it to streaming end-to-end speech translation (ST), which aims to convert audio signals to texts in other languages directly. Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduces inference latency, exploits speech information, and avoids error propagation from ASR to MT. To improve the modeling capacity, we propose attention pooling for the joint network in TT. In addition, we extend TT-based ST to multilingual ST, which generates texts of multiple languages at the same time. Experimental results on a large-scale 50 thousand (K) hours pseudo-labeled training set show that TT-based ST not only significantly reduces inference time but also outperforms non-streaming cascaded ST for English-German translation.

2022.03.29

Title:
Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Abstract:
Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.

2022.03.29

Title:
A Conformer Based Acoustic Model for Robust Automatic Speech Recognition

Abstract:
This study addresses robust automatic speech recognition (ASR) by introducing a Conformer-based acoustic model. The proposed model builds on a state-of-the-art recognition system using a bi-directional long short-term memory (BLSTM) model with utterance-wise dropout and iterative speaker adaptation, but employs a Conformer encoder instead of the BLSTM network. The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling. The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus. Coupled with utterance-wise normalization and speaker adaptation, our model achieves 6.25% word error rate, which outperforms the previous best system by 8.4% relatively. In addition, the proposed Conformer-based model is 18.3% smaller in model size and reduces total training time by 79.6%.

2022.03.29

Title:
Continuous Speech Separation with Recurrent Selective Attention Network

Abstract:
While permutation invariant training (PIT) based continuous speech separation (CSS) significantly improves the conversation transcription accuracy, it often suffers from speech leakages and failures in separation at "hot spot" regions because it has a fixed number of output channels. In this paper, we propose to apply recurrent selective attention network (RSAN) to CSS, which generates a variable number of output channels based on active speaker counting. In addition, we propose a novel block-wise dependency extension of RSAN by introducing dependencies between adjacent processing blocks in the CSS framework. It enables the network to utilize the separation results from the previous blocks to facilitate the current block processing. Experimental results on the LibriCSS dataset show that the RSAN-based CSS (RSAN-CSS) network consistently improves the speech recognition accuracy over PIT-based models. The proposed block-wise dependency modeling further boosts the performance of RSAN-CSS.

2021.10.30

Title:
Multitask Training with Text Data for End-to-End Speech Recognition

Abstract:
We propose a multitask training method for attention-based end-to-end speech recognition models to better incorporate language level information. We regularize the decoder in a sequence-to-sequence architecture by multitask training it on both the speech recognition task and a next-token prediction language modeling task. Trained on either the 100 hour subset of LibriSpeech or the full 960 hour dataset, the proposed method leads to an 11% relative performance improvement over the baseline and is comparable to language model shallow fusion, without requiring an additional neural network during decoding. Analyses of sample output sentences and the word error rate on rare words demonstrate that the proposed method can incorporate language level information effectively.

2021.06.03

Title:
Multi-Microphone Complex Spectral Mapping for Utterance-wise and Continuous Speaker Separation

Abstract:
We propose multi-microphone complex spectral mapping, a simple way of applying deep learning for time-varying non-linear beamforming, for offline utterance-wise and block-online continuous speaker separation in reverberant conditions, aiming at both speaker separation and dereverberation. Assuming a fixed array geometry between training and testing, we train deep neural networks (DNN) to predict the real and imaginary (RI) components of target speech at a reference microphone from the RI components of multiple microphones. We then integrate multi-microphone complex spectral mapping with beamforming and post-filtering to further improve separation, and combine it with frame-level speaker counting for block-online continuous speaker separation (CSS). Although our system is trained on simulated room impulse responses (RIR) based on a fixed number of microphones arranged in a given geometry, it generalizes well to a real array with the same geometry. State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.

2021.06.03

Title:
Speaker Separation Using Speaker Inventories and Estimated Speech

Abstract:
We propose speaker separation using speaker inventories and estimated speech (SSUSIES), a framework leveraging speaker profiles and estimated speech for speaker separation. SSUSIES contains two methods, speaker separation using speaker inventories (SSUSI) and speaker separation using estimated speech (SSUES). SSUSI performs speaker separation with the help of speaker inventory. By combining the advantages of permutation invariant training (PIT) and speech extraction, SSUSI significantly outperforms conventional approaches. SSUES is a widely applicable technique that can substantially improve speaker separation performance using the output of first-pass separation. We evaluate the models on both speaker separation and speech recognition metrics.

2020.05.13

Title:
Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR

Abstract:
This study proposes a complex spectral mapping approach for single- and multi-channel speech enhancement, where deep neural networks (DNNs) are used to predict the real and imaginary (RI) components of the direct-path signal from noisy and reverberant ones. The proposed system contains two DNNs. The first one performs single-channel complex spectral mapping. The estimated complex spectra are used to compute a minimum variance distortion-less response (MVDR) beamformer. The RI components of beamforming results, which encode spatial information, are then combined with the RI components of the mixture to train the second DNN for multi-channel complex spectral mapping. With estimated complex spectra, we also propose a novel method of time-varying beamforming. State-of-the-Art performance is obtained on the speech enhancement and recognition tasks of the CHiME-4 corpus. More specifically, our system obtains 6.82%, 3.19% and 1.99% word error rates (WER) respectively on the single-, two-, and six-microphone tasks of CHiME-4, significantly surpassing the current best results of 9.15%, 3.91% and 2.24% WER.

2020.01.09

Title:
Speech Separation Using Speaker Inventory

Abstract:
Overlapped speech is one of the main challenges in conversational speech applications such as meeting transcription. Blind speech separation and speech extraction are two common approaches to this problem. Both of them, however, suffer from limitations resulting from the lack of abilities to either leverage additional information or process multiple speakers simultaneously. In this work, we propose a novel method called speech separation using speaker inventory (SSUSI), which combines the advantages of both approaches and thus solves their problems. SSUSI makes use of a speaker inventory, i.e. a pool of pre-enrolled speaker signals, and jointly separates all participating speakers. This is achieved by a specially designed attention mechanism, eliminating the need for accurate speaker identities. Experimental results show that SSUSI outperforms permutation invariant training based blind speech separation by up to 48% relatively in word error rate (WER). Compared with speech extraction, SSUSI reduces computation time by up to 70% and improves the WER by more than 13% relatively.

2019.09.16

Title:
Large Margin Training for Attention Based End-to-End Speech Recognition

Abstract:
End-to-end speech recognition systems are typically evaluated using the maximum a posterior criterion. Since only one hypothesis is involved during evaluation, the ideal number of hypotheses for training should also be one. In this study, we propose a large margin training scheme for attention based end-to-end speech recognition. Using only one training hypothesis, the large margin training strategy achieves the same performance as the minimum word error rate criterion using four hypotheses. The theoretical derivation in this study is widely applicable to other sequence discriminative criteria such as maximum mutual information. In addition, this paper provides a more succinct formulation of the large margin concept, paving the road towards a better combination of support vector machine and deep neural network.

2019.07.05

Title:
Bridging the Gap Between Monaural Speech Enhancement and Recognition with Distortion-Independent Acoustic Modeling

Abstract:
Monaural speech enhancement has made dramatic advances in recent years. Although enhanced speech has been demonstrated to have better intelligibility and quality for human listeners, feeding it directly to automatic speech recognition (ASR) systems trained with noisy speech has not produced expected improvements in ASR performance. The lack of an enhancement benefit on recognition, or the gap between monaural speech enhancement and recognition, is often attributed to speech distortions introduced in the enhancement process. In this study, we analyze the distortion problem and propose a distortion-independent acoustic modeling scheme. Experimental results show that the distortion-independent acoustic model is able to overcome the distortion problem. Moreover, it can be used with various speech enhancement models. Both the distortion-independent and a noise-dependent acoustic model perform better than the previous best system on the CHiME-2 corpus. The noise-dependent acoustic model achieves a word error rate of 8.7%, outperforming the previous best result by 6.5% relatively.

2019.07.05

Title:
Enhanced Spectral Features for Distortion-Independent Acoustic Modeling

Abstract:
It has recently been shown that a distortion-independent acoustic modeling method is able to overcome the distortion problem caused by speech enhancement. In this study, we improve the distortion-independent acoustic model by feeding it with enhanced spectral features. Using enhanced magnitude spectra, the automatic speech recognition (ASR) system achieves a word error rate of 7.8% on the CHiME-2 corpus, outperforming the previous best system by more than 10% relatively. Compared with the corresponding enhanced waveform signal based system, systems using enhanced spectral features obtain up to 24% relative improvement. These comparisons show that speech enhancement is helpful for robust ASR and that enhanced spectral features are more suitable for ASR tasks than enhanced waveform signals.

2019.07.05

Title:
Bridging the Gap Between Monaural Speech Enhancement and Recognition with Distortion-Independent Acoustic Modeling

Abstract:
Monaural speech enhancement has made dramatic advances since the introduction of deep learning a few years ago. Although enhanced speech has been demonstrated to have better intelligibility and quality for human listeners, feeding it directly to automatic speech recognition (ASR) systems trained with noisy speech has not produced expected improvements in ASR performance. The lack of an enhancement benefit on recognition, or the gap between monaural speech enhancement and recognition, is often attributed to speech distortions introduced in the enhancement process. In this study, we analyze the distortion problem, compare different acoustic models, and investigate a distortion-independent training scheme for monaural speech recognition. Experimental results suggest that distortion-independent acoustic modeling is able to overcome the distortion problem. Such an acoustic model can also work with speech enhancement models different from the one used during training. Moreover, the models investigated in this paper outperform the previous best system on the CHiME-2 corpus.

2019.02.07

Title:
Improving Speech Recognition Error Prediction For Modern and Off-the-Shelf Speech Recognizers

Abstract:
Modeling the errors of a speech recognizer can help simulate errorful recognized speech data from plain text, which has proven useful for tasks like discriminative language modeling, improving robustness of NLP systems, where limited or even no audio data is available at train time. Previous work typically considered replicating behavior of GMM-HMM based systems, but the behavior of more modern posterior-based neural network acoustic models is not the same and requires adjustments to the error prediction model. In this work, we extend a prior phonetic confusion based model for predicting speech recognition errors in two ways: first, we introduce a sampling-based paradigm that better simulates the behavior of a posterior-based acoustic model. Second, we investigate replacing the confusion matrix with a sequence-to-sequence model in order to introduce context dependency into the prediction. We evaluate the error predictors in two ways: first by predicting the errors made by a Switchboard ASR system on unseen data (Fisher), and then using that same predictor to estimate the behavior of an unrelated cloud-based ASR system on a novel task. Sampling greatly improves predictive accuracy within a 100-guess paradigm, while the sequence model performs similarly to the confusion matrix.

2018.10.31

Title:
Token-Wise Training for Attention Based End-to-End Speech Recognition

Abstract:
In attention based end-to-end (A-E2E) speech recognition systems, the dependency between output tokens is typically formulated as an input-output mapping in decoder. Due to such dependency, decoding errors can easily propagate along output sequence. In this paper, we propose a token-wise training (TWT) method for A-E2E models. The new method is flexible and can be combined with a variety of loss functions. Applying TWT to multiple hypotheses, we propose a novel TWT in beam (TWTiB) training scheme. Trained on the benchmark Switchboard (SWBD) 300h corpus, TWTiB outperforms the previous best training scheme on the SWBD evaluation subset.

2018.10.31

Title:
Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions

Abstract:
Acoustic model and language model (LM) have been two major components in conventional speech recognition systems. They are normally trained independently, but recently there has been a trend to optimize both components simultaneously in a unified end-to-end (E2E) framework. However, the performance gap between the E2E systems and the traditional hybrid systems suggests that some knowledge has not yet been fully utilized in the new framework. An observation is that the current attention-based E2E systems could produce better recognition results when decoded with LMs which are independently trained with the same resource.

In this paper, we focus on how to improve attention-based E2E systems without increasing model complexity or resorting to extra data. A novel training strategy is proposed for multi-task training with the connectionist temporal classification (CTC) loss. The sequence-based minimum Bayes risk (MBR) loss is also investigated. Our experiments on SWB 300hrs showed that both loss functions could significantly improve the baseline model performance. The additional gain from joint-LM decoding remains the same for CTC trained model but is only marginal for MBR trained model. This implies that while CTC loss function is able to capture more acoustic knowledge, MBR loss function exploits more lexicon dependency.

2018.07.30

Title:
Filter-and-Convolve: A CNN Based Multichannel Complex Concatenation Acoustic Model

Abstract:
We propose a convolutional neural network (CNN) based multichannel complex-domain concatenation acoustic model. The proposed model extracts speech-specific information from multichannel noisy speech signals. In addition, we design two CNN templates that have wide applicability and several speaker adaptation methods for the multichannel complex concatenation acoustic model. Even with a simple BeamformIt beamformer and the baseline language model, our method obtains a word error rate (WER) of 5.39% on the CHiME-4 corpus, outperforming the previous best result by 13.06% relatively. Using an MVDR beamformer, our model outperforms the corresponding best system by 9.77% relatively.

2017.11.06

Title:
Utterance-Wise Recurrent Dropout and Iterative Speaker Adaptation for Robust Monaural Speech Recognition

Abstract:
This study addresses monaural (single-microphone) automatic speech recognition (ASR) in adverse acoustic conditions. Our study builds on a state-of-the-art monaural robust ASR method that uses a wide residual network with bidirectional long short-term memory (BLSTM). We propose a novel utterance-wise dropout method for training LSTM networks and an iterative speaker adaptation technique. When evaluated on the monaural speech recognition task of the CHiME-4 corpus, our model yields a word error rate (WER) of 8.28% using the baseline language model, outperforming the previous best monaural ASR by 16.19% relatively.

2017.11.06