Phoné: An Initiative to Develop a Dataset for the Automatic Recognition of Spoken Italian
Published 2025-02-24
Keywords
- speech technology,
- automatic speech recognition,
- Large Language Models,
- dataset
How to Cite
Copyright (c) 2025 Gianpaolo Coro, Francesco Cutugno, Loredana Schettino, Emilia Tanda, Alessandro Vietti, Vincenzo Norman Vitale

This work is licensed under a Creative Commons Attribution 4.0 International License.
Abstract
Large Language Models (LLM) have revolutionised natural language processing and its applications. However, high-performance LLMs require copious data and computing resources for their development and are rarely public. This also concerns Large Acoustic Models (LAM) for processing spoken language. The Phoné initiative seeks to build an open Italian speech dataset to advance Automatic Speech Recognition (ASR) systems and support public research. Spearheaded by institutions in Naples, Pisa, and Bolzano, the project gathers diverse Italian audio sources and applies advanced ASR architectures, including supervised and self-supervised models. This paper details Phoné’s dataset creation, ASR model evaluation, and ethical considerations, aiming to democratise access to Italian-language resources and foster innovation in ASR technologies.
References
- Baevski, Alexei, Zhou Yuhao, Mohamed Abdelrahman, e Michael Auli. 2020. “Wav2vec 2.0: A Framework for Self-supervised Learning of Speech Representations”. Advances in Neural Information Processing Systems, 33: 12449-60.
- Bridle, John Scott, e Michael D. Brown. 1974. “An Experimental Automatic Word-Recognition System”. Joint Speech Report, 1003.5: 33.
- Chang, Yupeng, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, et al. 2024. “A Survey on Evaluation of Large Language Models”. ACM Transactions on Intelligent Systems and Technology, 15(3): 1-45. https://doi.org/10.1145/3641289.
- Ghai, Wiqas, e Navdeep Singh. 2012. “Literature Review on Automatic Speech Recognition”. International Journal of Computer Applications, 41(8): 42-50. https://doi.org/10.5120/5565-7646.
- Giordano Orsini, Luigi Maria, Vincenzo Norman Vitale, e Francesco Cutugno. 2023. “Large Scale Acoustic Models: A New Perspective”. Sistemi Intelligenti, 35(2): 401-12. https://doi.org/10.1422/108137.
- Goodfellow, Ian, Yoshua Bengio, e Aaron Courville. 2016. Deep Learning. MIT Press.
- Graves, Alex, Santiago Fernández, Faustino Gomez, e Jürgen Schmidhuber. 2006. “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks”. Proceedings of the 23rd International Conference on Machine Learning: 369-76.
- Graves, Alex. 2012. “Sequence Transduction with Recurrent Neural Networks”. arXiv preprint. https://doi.org/10.48550/arXiv.1211.3711.
- Graves, Alex 2014. “Generating Sequences with Recurrent Neural Networks”. arXiv preprint. https://doi.org/10.48550/arXiv.1308.0850.
- Gulati, Anmol, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, e Ruoming Pang. 2020. “Conformer: Convolution-augmented Transformer for Speech Recognition”. Proceedings of INTERSPEECH 2020, October 25-29, 2020, Shanghai, China: 5036-5040. https://dx.doi.org/10.21437/Interspeech.2020-3015.
- Hinton, Geoffrey, Li Deng, Dong Yu, George E. Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, e Brian Kingsbury. 2012. “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups”. IEEE Signal Processing Magazine, 29(6): 82-97. https://doi.org/10.1109/MSP.2012.2205597.
- Honnibal, Matthew, e Ines Montani. 2017. “spaCy 2: Natural Language Understanding with Bloom Embeddings, Convolutional Neural Networks and Incremental Parsing”. https://spacy.io/.
- Hsu, Wei-Ning, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, e Abdelrahman Mohamed. 2021. “Hubert: Self-supervised Speech Representation Learning by Masked Prediction of Hidden Units”. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 3451-60. https://doi.org/10.1109/TASLP.2021.3122291.
- Ježek, Elisabetta, e Rachele Sprugnoli. 2023. Linguistica computazionale. Introduzione all’analisi automatica dei testi. Bologna: il Mulino.
- Juang, Biing-Hwang, e Lawrence R. Rabiner. “Automatic speech recognition–a brief history of the technology development”. Georgia Institute of Technology. Atlanta Rutgers University and the University of California. Santa Barbara, 1.67(2005): 1.
- Jurafsky, Daniel, e James H. Martin. 2009. Speech and Language Processing. Pearson Education, Inc.
- Karpagavalli, Shunmugam, e Erick Chandra. 2016. “A Review on Automatic Speech Recognition Architecture and Approaches”. International Journal of Signal Processing, Image Processing and Pattern Recognition, 9: 393-404. https://doi.org/10.14257/ijsip.2016.9.4.34.
- Këpuska, Veton Z., e Hussien A. Elharati. 2015. “Robust Speech Recognition System Using Conventional and Hybrid Features of MFCC, LPCC, PLP, RATA-PLP and Hidden Markov Model Classifier in Noisy Conditions”. Journal of Computer and Communications, 3: 1-9. https://doi.org/10.4236/jcc.2015.36001.
- Kuchaiev, Oleksii, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman et al. 2019. “Nemo: A Toolkit for Building AI Applications Using Neural Modules”. arXiv preprint. https://arxiv.org/abs/1909.09577.
- Li, Jinyu. 2022. “Recent Advances in End-to-end Automatic Speech Recognition”. APSIPA Transactions on Signal and Information Processing 11: 1-64. https://doi.org/10.1561/116.00000050.
- Malik, Mishaim, Muhammad Kamran Malik, Khawar Mehmood, e Imran Makhdoom. 2020. “Automatic Speech Recognition: a survey”. Multimedia Tools and Applications, 80: 9411-57. https://doi.org/10.1007/s11042-020-10073-7
- McCowan, Iain, Darren Moore, John Dines, Daniel Gatica-Perez, Mike Flynn, Pierre Wellner, e Hervé Bourlard. 2005. “On the Use of Information Retrieval Measures for Speech Recognition Evaluation”. IDIAP Research Report, 04-73. IDIAP, Martigny, Switzerland.
- Morris, Andrew Cameron, Viktoria Maier, e Phil Green. 2004. “From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition”. Proceedings of Interspeech 2004. Jeju Island, Korea, 4-8 October 2004: 2765-68. https://doi.org/10.21437/Interspeech.2004-668.
- Nissim, Malvina, e Ludovica Pannitto. 2022. Che cos’è la linguistica computazionale. Carocci editore.
- Palmerini, Maria, e Renata Savy. 2014. “Gli errori di un sistema di riconoscimento automatico del parlato: analisi linguistica e primi risultati di una ricerca interdisciplinare”. Proceedings of the First Italian Conference on Computational Linguistics CLiC-it 2014 and of the Fourth International Workshop EVALITA 2014. Pisa, 9-11 December 2014: 281-5.
- Povey, Daniel, Arnab Ghoshal, Gilles Boulianne, Lukáš Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlíček, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, e Karel Veselý. 2011. “The Kaldi Speech Recognition Toolkit”. Proceedings of IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Hawaii, US. IEEE Signal Processing Society.
- Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, e Ilya Sutskever. 2023. “Robust Speech Recognition via Large-scale Weak Supervision”. Proceedings of the 40th International Conference on Machine Learning (ICML’23): 28492-518.
- Rekesh, Dima, Nithin Rao Koluguri, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Huang, Oleksii Hrinchuk, Krishna Puvvada, Ankur Kumar, Jagadeesh Balam, e Boris Ginsburg. 2023. “Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition”. Proceedings of 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU): 1-8. https://doi.org/10.1109/ASRU57964.2023.10389701.
- Savy, Renata e Francesco Cutugno. 2009. “CLIPS. Diatopic, Diamesic and Diaphasic Variations in Spoken Italian”. Proceedings of the 5th Corpus Linguistics Conference (CL2009): 20-23.
- Turing, Alan Mathison. 1950. “I.—COMPUTING MACHINERY AND INTELLIGENCE.” Mind, LIX(236): 433-60. https://doi.org/10.1093/mind/LIX.236.433.
- Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, e Illia Polosukhin. 2017. “Attention is All You Need”. Proceedings of the 31st International Conference on Neural Information Processing Systems: 6000-10.
- Vitale, Norman, Emilia Tanda e Francesco Cutugno. 2024. “Towards a Responsible Usage of AI-based Large Acoustic Models for Automatic Speech Recognition: On the Importance of Data in the Self-supervised Era”. Atti quarto Convegno Nazionale CINI sull’Intelligenza Artificiale – Ital-IA 2024. https://hdl.handle.net/11588/974957.