CONDITIONAL FASTSPEECH 2 FOR LOW-RESOURCE INDIAN ACCENT SYNTHESIS: A PHONETIC ADAPTATION APPROACH

Authors

  • Ritesh Kumar Yadav Department of Computer Science, Babasaheb Bhimrao Ambedkar University, Lucknow Satellite Centre, Amethi, UP, India Author
  • Divyanshi Srivastava Department of Computer Science, Babasaheb Bhimrao Ambedkar University, Lucknow Satellite Centre, Amethi, UP, India Author
  • Sweacha Verma Department of Computer Science, Babasaheb Bhimrao Ambedkar University, Lucknow Satellite Centre, Amethi, UP, India Author
  • Chaudhary Surya Prakash Department of Computer Science, Babasaheb Bhimrao Ambedkar University, Lucknow Satellite Centre, Amethi, UP, India Author

DOI:

https://doi.org/10.63503/c.acset.2025.25

Keywords:

Text-to-Speech, Accent Synthesis, FastSpeech 2, Indian English, Bhojpuri, Phonetic Mapping, LLM, Speech Processing

Abstract

In essence, text-to-speech (TTS) applications that read aloud text are becoming more and more crucial for digital accessibility in order to access digital content. The issue is that they struggle to understand Indian languages and our regional accents (when peaks English with Indian accent features). This paper discusses the FastSpeech2 system. To it, we applied an accent embedding layer. It can speak in three different languages thanks to this layer: standard Hindi, Indian English, and Bhojpuri, a low-resource language. In order to identify accent-appropriate pronunciations for a particular accent, particularly the less common Bhojpuri ones, we employed a transformer-based LLM-adapted tool for speech and phonetic mapping. Graphemes to phonemes (G2P) mapping is accomplished by this adaptive phonetic mapping module! We tested it, and the findings show notable advancements.  Enhancement of 27.5 Essential Points in Plain English for the Bhojpuri Accent  The primary issue is that low-resource languages and a variety of Indian accents perform poorly on current text-to-speech (TTS) systems.

 Our Method: By incorporating an Accent Embedding Layer, we enhanced a TTS system (FastSpeech2).

 What it can do: It can now communicate in Standard Hindi, Bhojpuri, and Indian English.

 Method: To modify pronunciations for those distinct accents, we employed a large language model (LLM).

 As a result, human listeners judged the voices as natural, and it speaks Bhojpuri much more accurately.

References

1. Y. Ren et al., “FastSpeech: Fast, Robust and Controllable Text to Speech,” NeurIPS, 2019.

2. Y. Ren et al., “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” ICLR, 2021.

3. Y. Wang et al., “Style Tokens: Unsupervised Style Modeling, Control and Transfer in Endto-End Speech Synthesis,” ICML, 2018.

4. R. Kumar, S. Singh, et al., “Annotated Speech Corpus for Low Resource Indian Languages,” Interspeech, 2022.

5. TensorFlowTTS, “Real-Time Speech Synthesis Toolkit for TensorFlow 2,” 2025.

6. G2P Model Reference, “Open Grapheme-to-Phoneme Toolkit,” 2020.

7. K. Kumar et al., “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” NeurIPS, 2019.

8. (Low-Resource Transfer Learning) (2021). Reference on cross-lingual transfer learning techniques for low-resource TTS.

9. (Accent Conversion Survey) (2020). Survey paper on Voice Conversion or Accent Conversion methodologies.

10. (Real-Time TTS) (2022). Reference for Real-Time and Low-Latency Speech Synthesis.

11. (Indian TTS Challenges) (2018). Paper discussing challenges and approaches for Text-to-Speech in diverse Indian languages.

12. (Disentangled Representation) (2020). Research on disentangling content and style (e.g., accent) features in speech.

13. (Knowledge Distillation) (2019). Paper on Knowledge Distillation for compressing large models into smaller, efficient ones (relevant for low-GPU and live integration).

14. (LLM in G2P) (2023). Recent work on leveraging Large Language Models (LLMs) for improved Grapheme-toPhoneme conversion.

15. (Voice Cloning/Timbre Transfer) (2019). Reference for transferring voice timbre while keeping linguistic content (relevant for future live application)

Downloads

Published

2025-11-24

How to Cite

Ritesh Kumar Yadav, Divyanshi Srivastava, Sweacha Verma, & Chaudhary Surya Prakash. (2025). CONDITIONAL FASTSPEECH 2 FOR LOW-RESOURCE INDIAN ACCENT SYNTHESIS: A PHONETIC ADAPTATION APPROACH. Adroid Conference Series: Engineering and Technology, 1, 230-235. https://doi.org/10.63503/c.acset.2025.25