Skip to main content

Advertisement

Log in

G-Cocktail: An Algorithm to Address Cocktail Party Problem of Gujarati Language Using Cat Boost

  • Published:
Wireless Personal Communications Aims and scope Submit manuscript

Abstract

This paper is an attempt to address to the problem of native language in a mixed voice environment. G- Cocktail would aid these applications in identifying commands given in Gujarati, even from a mixed voice stream. There are two phases of G-cocktail in the first phase, it creates features after filtering the voices and in the second it trains and classifies the dataset. This trained dataset helps in recognizing the new voice signal. The challenge in training a native language is the availability of a small dataset. A single-word input is used in model and phrase benchmark dataset from Microsoft and the Linguistic Data Consortium for Indian Languages (LDC-IL). To overcome the over fitting problem due to smaller dataset we used CatBoost algorithm. And fine-tuned the classification model to avoid the over fitting issue. Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). MFCC is good for human voices but noises in the sound makes it less productive. To avoid this shortcoming of MFCC, first filtered the voices are used and then calculated the MFCC. The most relevant features are retained to make it more robust. With MFCC features, the pitch of the voices is also added, as pitch could vary with regional changes, mood of the person, age, and knowledge of the language to the speaker. A voice print of the whole sound files is constructed and fed it as features to the classification model. For training and testing 70% and 30% ratio is used in algorithms like K-means, Naïve Bayes, and Light GBM. Proposed model is compared with given data set and results proved that G-cocktail using XBoost performed better than the others under the given scenario in all parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Availability of Data and Material

LDC-IL: The Indian repository of resources for language technology. Language Resources and Evaluation, 1–13. https://www.ldcil.org/publications.aspx.

Code Availability

There is no such software application or custom code copied for the work.

References

  1. Bhaskararao, P. (2011). Salient phonetic features of Indian languages in speech technology. Sadhana, 36(5), 587–599. https://doi.org/10.1007/s12046-011-0039-z,Oct

    Article  Google Scholar 

  2. Bhat, G. S., Shankar, N., Reddy, C. K. A., & Panahi, I. M. S. (2019). A real-time convolutional neural network based speech enhancement for hearing impaired listeners using smartphone. IEEE Access, 7, 78421–78433. https://doi.org/10.1109/ACCESS.2019.2922370

    Article  Google Scholar 

  3. Yarra, C., Aggarwal, R., Rajpal, A., & Ghosh, P. K. (2019). Indic TIMIT and Indic English lexicon: A speech database of Indian speakers using TIMIT stimuli and a lexicon from their mispronunciations. In: 2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA), Cebu, Philippines, pp. 1–6, doi: https://doi.org/10.1109/O-COCOSDA46868.2019.9041230.

  4. Jeeva, M. P. A., Nagarajan, T., & Vijayalakshmi, P. (2020). Adaptive multi-band filter structure-based far-end speech enhancement. IET Signal Process, 14(5), 288–299. https://doi.org/10.1049/iet-spr.2019.0226,Jul

    Article  Google Scholar 

  5. Panda, S. P., Nayak, A. K., & Rai, S. C. (2020). A survey on speech synthesis techniques in Indian languages. Multimedia Systems, 26(4), 453–478. https://doi.org/10.1007/s00530-020-00659-4

    Article  Google Scholar 

  6. Sarkar, P., Haque, A., Dutta, A. K., Gurunath Reddy, M., Harikrishna D. M., Dhara, P., Rashmi, V., Narendra, N. P., Sunil Kr. S. B., Yadav, J., & Sreenivasa Rao, K. (2014). Designing prosody rule-set for converting neutral TTS speech to storytelling style speech for Indian languages: Bengali, Hindi and Telugu,” in Seventh International Conference on Contemporary Computing (IC3), Noida,, pp. 473–477,doi: https://doi.org/10.1109/IC3.2014.6897219.

  7. Mishra, N., Tech, M., Shrawankar, U., & Thakare, D. V. M. (2010). AN OVERVIEW OF HINDI SPEECH RECOGNITION. In: Proceedings of the International Conference on Computational Systems and Communication Technology -Tamil Nadu, p. 6, May 5 2010.

  8. Shri Shrimal, P. P., Deshmukh, R. R., & Waghmare, V. B. (2012). Indian language speech database: A review. IJCA, 47(5), 17–21. https://doi.org/10.5120/7184-9893

    Article  Google Scholar 

  9. ud Khan, S. D. (2012). The phonetics of contrastive phonation in Gujarati. Journal of Phonetics, 40(6), 780–795. https://doi.org/10.1016/j.wocn.2012.07.001

    Article  Google Scholar 

  10. Wang, D., & Chen, J. (2018). Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10), 1702–1726. https://doi.org/10.1109/TASLP.2018.2842159,Oct

    Article  Google Scholar 

  11. Roy, B. K., Biswas, S. C., & Mukhopadhyay, P. (2018). Designing unicode-compliant Indic-script based institutional digital repository with special reference to Bengali. International Journal of Knowledge Content Development & Technology, 8(3), 53–67. https://doi.org/10.5865/IJKCT.2018.8.3.053,Sep

    Article  Google Scholar 

  12. Sproat, R., (2003). A formal computational analysis of indic scripts. In: International Symposium on Indic Scripts: Past and Future, Tokyo, Dec. 2003.

  13. Upadhyay, N., & Karmakar, A. (2015). Speech enhancement using spectral subtraction-type algorithms: A comparison and simulation study. Procedia Computer Science, 54, 574–584. https://doi.org/10.1016/j.procs.2015.06.066

    Article  Google Scholar 

  14. Upadhyay, N. (2014). An improved multi-band speech enhancement utilizing masking properties of human hearing system. In: 2014 Fifth International Symposium on Electronic System Design, Surathkal, Mangalore, India, pp. 150–155, doi: https://doi.org/10.1109/ISED.2014.38.

  15. Jo, J., Yoo, H., & Park, I. (2016). Energy-efficient floating-point MFCC extraction architecture for speech recognition systems. IEEE Transactions on Very Large-Scale Integration (VLSI) Systems, 24(2), 754–758.

    Article  Google Scholar 

  16. Chakroborty, S., Roy, A., & Saha, G. (2006). Fusion of a complementary feature set with MFCC for improved closed set text-independent speaker identification. In: 2006 IEEE International Conference on Industrial Technology, Mumbai, India, pp. 387–390, doi: https://doi.org/10.1109/ICIT.2006.372388.

  17. Das, A., Guha, S., Singh, P. K., Ahmadian, A., Senu, N., & Sarkar, R. (2020). A hybrid meta-heuristic feature selection method for identification of Indian spoken languages from audio signals. IEEE Access, 8, 181432–181449. https://doi.org/10.1109/ACCESS.2020.3028241

    Article  Google Scholar 

  18. Garg, K., & Jain, G. (2016). A comparative study of noise reduction techniques for automatic speech recognition systems. In: 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Jaipur, India, pp. 2098–2103, doi: https://doi.org/10.1109/ICACCI.2016.7732361

  19. Alim, S. A., & Rashid, N. K. A. (2018). Some commonly used speech feature extraction algorithms. From Natural to Artificial Intelligence - Algorithms and Applications. https://doi.org/10.5772/intechopen.80419

    Article  Google Scholar 

  20. Nehe, N. S., & Holambe, R. S. (2012). DWT and LPC based feature extraction methods for isolated word recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2012(1), 7. https://doi.org/10.1186/1687-4722-2012-7

    Article  Google Scholar 

  21. Hung, J., & Fan, H. (2009). Subband feature statistics normalization techniques based on a discrete wavelet transform for robust speech recognition. IEEE Signal Processing Letters, 16(9), 806–809. https://doi.org/10.1109/LSP.2009.2024113

    Article  MathSciNet  Google Scholar 

  22. Eltiraifi, O., Elbasheer, E., & Nawari, M. (2018). A comparative study of MFCC and LPCC features for speech activity detection using deep belief network. In: 2018 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE), Khartoum, pp. 1–5, doi: https://doi.org/10.1109/ICCCEEE.2018.8515821

  23. Dehak, N., Torres-Carrasquillo, P., Reynolds, D., & Dehak, R, (2011). Language recognition via I-vectors and dimensionality reduction. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 857–860. https://doi.org/10.21437/Interspeech.2011-328.

  24. Mohammad Amini, M., & Matrouf, D. (2021). Data augmentation versus noise compensation for x-vector speaker recognition systems in noisy environments," 2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam, Netherlands, pp. 1–5, doi: https://doi.org/10.23919/Eusipco47968.2020.9287690

  25. Wu, J., Hua, Y., Yang, S., Qin, H., & Qin, H. (2019). Speech enhancement using generative adversarial network by distilling knowledge from statistical method. Applied Sciences, 9(16), 3396. https://doi.org/10.3390/app9163396

    Article  Google Scholar 

  26. Pulugundla, B., Karthick, M., Kesiraju, S., & Kgorova, K. (2018). BUT system for low resource Indian language ASR. Interspeech, 2018, 3182–3186. https://doi.org/10.21437/Interspeech.2018-1302

    Article  Google Scholar 

  27. Gogoi, S., & Bhattacharjee, U., (2017). Vocal tract length normalization and sub-band spectral subtraction based robust assamese vowel recognition system. In: 2017 International Conference on Computing Methodologies and Communication (ICCMC), Erode, pp. 32–35, doi: https://doi.org/10.1109/ICCMC.2017.8282709

  28. Wang, J., Zhang, J., Honda, K., Wei, J., & Dang, J. (2016). Audio-visual speech recognition integrating 3D lip information obtained from the Kinect. Multimedia Systems, 22(3), 315–323. https://doi.org/10.1007/s00530-015-0499-9

    Article  Google Scholar 

  29. Varalwar, M., & Patel, N. (2006). Characteristics of Indian Languages. Bhrigus Inc.

    Google Scholar 

  30. Sirsa, H., & Redford, M. A. (2013). The effects of native language on Indian English sounds and timing patterns. Journal of Phonetics, 41(6), 393–406. https://doi.org/10.1016/j.wocn.2013.07.004

    Article  Google Scholar 

  31. Singh, J. & Kaur, K. (2019). Speech eEnhancement for Punjabi language using deep neural network. In: 2019 International Conference on Signal Processing and Communication (ICSC), NOIDA, India, pp. 202–204, doi: https://doi.org/10.1109/ICSC45622.2019.8938309.

  32. Reddy, M. G., Sen, Manjunath, K., Sarkar, P., & Rao, K. S. (2015). Automatic pitch accent contour transcription for Indian languages. In: 2015 International Conference on Computer, Communication and Control (IC4), Indore, India, pp. 1–6, doi: https://doi.org/10.1109/IC4.2015.7375669.

  33. Polasi, P. K., & Sri Rama Krishna, K. (2016). Combining the evidences of temporal and spectral enhancement techniques for improving the performance of Indian language identification system in the presence of background noise. International Journal of Speech Technology, 19(1), 75–85. https://doi.org/10.1007/s10772-015-9326-0

    Article  Google Scholar 

  34. Patil, A., More, P., & Sasikumar, M. (2019). Incorporating finer acoustic phonetic features in lexicon for Hindi language speech recognition. Journal of Information and Optimization Sciences, 40(8), 1731–1739. https://doi.org/10.1080/02522667.2019.1703266

    Article  Google Scholar 

  35. Parikh, R. B., & Joshi, D. H. (2020). Gujarati speech recognition – A review. No. 549, p. 6.

  36. Nath, S., Chakraborty, J., Sarmah, P. (2018) Machine identification of spoken Indian languages,” p. 6.

  37. Mullah, H. U., Pyrtuh, F., & Singh, L. J. (2015). Development of an HMM-based speech synthesis system for Indian English language. In: 2015 International Symposium on Advanced Computing and Communication (ISACC), Silchar, India, pp. 124–127, doi: https://doi.org/10.1109/ISACC.2015.7377327, 2015.

  38. Morris, A., Maier, V., Green, P. (2004). From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition, https://doi.org/10.21437/Interspeech.2004-668.

  39. Londhe, N. D., Ahirwal, M. K., & Lodha, P. (2016). Machine learning paradigms for speech recognition of an Indian dialect. In: 2016 International Conference on Communication and Signal Processing (ICCSP), Melmaruvathur, Tamilnadu, India, pp. 0780–0786, doi: https://doi.org/10.1109/ICCSP.2016.7754251.

  40. Li, Q., Yang, Y., Lan, F., Zhu, H., Wei, Q., Qia, F., Liu, Z., & Yang, H. (2020). MSP-MFCC: Energy-efficient MFCC feature extraction method with mixed-signal processing architecture for wearable speech recognition applications. IEEE Access, 8, 48720–48730. https://doi.org/10.1109/ACCESS.2020.2979799

    Article  Google Scholar 

  41. Lavanya, T., Nagarajan, T., & Vijayalakshmi, P. (2020). Multi-level single-channel speech enhancement using a unified framework for estimating magnitude and phase spectra. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1315–1327. https://doi.org/10.1109/TASLP.2020.2986877

    Article  Google Scholar 

  42. S. Kiruthiga and K. Krishnamoorthy, “Design issues in developing speech corpus for Indian languages — A survey. In: 2012 International Conference on Computer Communication and Informatics, Coimbatore, India, pp. 1–4, doi: https://doi.org/10.1109/ICCCI.2012.6158831.

  43. Khan, M. K. S., & Al-Khatib, W. G. (2006). Machine-learning based classification of speech and music. Multimedia Systems, 12(1), 55–67. https://doi.org/10.1007/s00530-006-0034-0

    Article  Google Scholar 

  44. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.- Y. (2017). Light GBM: A highly efficient gradient boosting decision tree. In: 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, p. 9, 2017.

  45. Joshi, M., Iyer, M., & Gupta, N. (2010). Effect of accent on speech intelligibility in multiple speaker environment with sound spatialization. In: 2010 Seventh International Conference on Information Technology: New Generations, Las Vegas, NV, USA, pp. 338–342, doi: https://doi.org/10.1109/ITNG.2010.11.

  46. Hao, X., Wen, S., Su, X., Liu, Y., Gao, G., & Li, X. (2020). Sub-band knowledge distillation framework for speech enhancement. Interspeech, 2020, 2687–2691. https://doi.org/10.21437/Interspeech.2020-1539

    Article  Google Scholar 

  47. Yang, C., Xie, L., Su, C., &Yuille, A. L, (2019). Snapshot distillation: Teacher-student optimization in one generation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2854–2863.

  48. Desai Vijayendra, A., & Thakar, V. K. (2016). Neural network based Gujarati speech recognition for dataset collected by in-ear microphone. Procedia Computer Science, 93, 668–675. https://doi.org/10.1016/j.procs.2016.07.259

    Article  Google Scholar 

  49. Patel, H. N., & Virparia, P. V. (2011) A small vocabulary speech recognition for Gujarati. vol. 2, no. 1.

  50. Pipaliahoomikaave, D. S. (2015). An approach to increase word recognition accuracy in Gujarati language. International Journal of Innovative Research in Computer and Communication Engineering, 3297(9), 6442–6450.

    Google Scholar 

  51. Jinal, H., & Dipti, B. (2016). Speech recognition system architecture for Gujarati language. International Journal of Computer Applications, 138(12), 28–31.

    Article  Google Scholar 

  52. Valaki, S., & Jethva, H. (2017). A hybrid HMM/ANN approach for automatic Gujarati speech recognition. Proc. 2017 Int. Conf. Innov. Information, Embed. Commun. Syst. ICIIECS 2017, vol. 2018-Janua, pp. 1–5

  53. Tailor, J. H., & Shah, D. B. (2017). HMM-based lightweight speech recognition system for Gujarati language. pp. 451–461.

  54. Sainath, T. N., Weiss, R. J., Li, B., Moreno, P., Weinstein, E., & Rao, K. (2018). Multilingual speech recognition with a single end-to-end model Shubham Toshniwal∗ Toyota Technological Institute at Chicago, ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 4904–4908, 2018.

  55. Vydana, H. K., Gurugubelli, K., Raju, V. V. V., Vuppala, A. K. (2018). An exploration towards joint acoustic modeling for Indian languages: IIIT-H submission for Low Resource Speech Recognition Challenge for Indian languages, INTERSPEECH 2018,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. September, pp. 3192–3196, 2018.

  56. Sailor, H. B., Siva Krishna, M. V., Chhabra, D., Patil, A. T., Kamble, M. R., & Patil, H. A. (2018). DA-IICT/IIITV system for low resource speech recognition challenge 2018. Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. September, pp. 3187–3191, 2018.

  57. Billa, J. (2018). ISI ASR system for the low resource speech recognition challenge for Indian languages,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. September, pp. 3207–3211, 2018

  58. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2021). CatBoost: Unbiased boosting with categorical features,” arXiv:1706.09516 [cs], Jan. 2019, Accessed: Mar. 03, 2021. [Online]. Available https://arxiv.org/abs/1706.09516.

  59. Padmapriya. J., Sasilatha, T., Karthickmano, J. R., Aagash, G., & Bharathi, V. (2021). Voice extraction from background noise using filter bank analysis for voice communication applications. In: 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), 2021, pp. 269–273, doi: https://doi.org/10.1109/ICICV50876.2021.9388453.

  60. Choudhary, N. (2021). LDC-IL: The Indian repository of resources for language technology. Language Resources and Evaluation, pp 1–13. https://www.ldcil.org/publications.aspx.

  61. Bahmaninezhad, F., Wu, J., Gu, R., Zhang, S. -X., Xu, Y., Yu, M., & Yu, D. (2021). A comprehensive study of speech separation: spectrogram vs waveform separation,” arXiv:1905.07497 [cs, eess], p. 2, May 2019, Accessed: Nov. 11, 2021. [Online]. Available: https://arxiv.org/abs/1905.07497.

  62. Fischer, T., Caversaccio, M., & Wimmer, W. (2021). Speech signal enhancement in cocktail party scenarios by deep learning based virtual sensing of head-mounted microphones. Hearing Research, 408, 108294. https://doi.org/10.1016/j.heares.2021.108294

    Article  Google Scholar 

Download references

Funding

This research work is not supported by any funding agency.

Author information

Authors and Affiliations

Authors

Contributions

Authors implemented “G-Cocktail: An algorithm to address cocktail party problem of Gujarati language using Cat Boost”.

Corresponding author

Correspondence to Monika Gupta.

Ethics declarations

Conflict of interest

There is no Conflicts of interest/Competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, M., Singh, R.K. & Singh, S. G-Cocktail: An Algorithm to Address Cocktail Party Problem of Gujarati Language Using Cat Boost. Wireless Pers Commun 125, 261–280 (2022). https://doi.org/10.1007/s11277-022-09549-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11277-022-09549-6

Keywords

Navigation