In speech recognition, it is essential to model the phonetic content of the input signal while discarding irrelevant factors such as speaker variations and noise, which is challenging in low-resource settings. Self-supervised pretraining has been proposed as a way to improve both supervised and unsupervised speech recognition, including frame-level feature representations and Acoustic Word Embeddings (AWE) for variable-length segments. However, self-supervised models alone cannot learn perfect separation of the linguistic content as they are trained to optimize indirect objectives. In this work, we experiment with different pre-trained self-supervised features as input to AWE models and show that they work best within a supervised framework. Models trained on English can be transferred to other languages with no adaptation and outperform self-supervised models trained solely on the target languages. © ICNLSP 2022.All rights reserved
- Linguistics,
- Natural language processing systems,
- Speech recognition,
- Acoustic word embedding,
- Embeddings,
- Feature representation,
- Learn+,
- Low-resource settings,
- Pre-training,
- Speaker variation,
- Transfer learning,
- Unsupervised ASR,
- Variable-length segments,
- Embeddings
Archived, thanks to ACL Anthology
License: CC BY 4.0
Uploaded 29 November 2023