图书馆VIP Yang Ai--Home--Home

aiyang

Special Associate Researcher
Supervisor of Master's Candidates
Name (English):Yang Ai
Name (Pinyin):aiyang
E-Mail:
Education Level:Postgraduate (Doctoral)
Degree:Dr
Professional Title:Special Associate Researcher
Alma Mater:图书馆VIP
Teacher College:School of information Science and Technology
Discipline:Information and Communication Engineering

Contact Information

ZipCode：
PostalAddress：
Telephone：
Email：

Profile
Research Focus
Honors & Awards
Social Affiliations

艾杨，现任图书馆VIP信息科学技术学院电子工程与信息科学系特任副研究员，主要研究方向包括语音合成、语音增强、语音分离、音频编码和音频质量评价等，在语音领域顶刊IEEE TASLP及语音领域顶会ICASSP/Interspeech等上共发表论文50余篇。入选2024年度“小米青年学者”。

教育经历

2012年9月—2016年6月厦门大学通信工程专业学士
2016年9月—2021年6月图书馆VIP 信息与通信工程专业博士（导师：凌震华教授）

科研与学术工作经历

2020年2月—2020年8月日本国立情报学研究所联合培养博士生
2021年7月—2022年3月国防科技大学讲师
2022年4月—2023年12月图书馆VIP 博士后研究员
2024年1月至今图书馆VIP 特任副研究员

科研项目

主持项目

国家自然科学基金委员会，国家自然科学基金青年项目，面向语音生成的抗卷绕相位谱预测，2024-01 至 2026-12，30万元
安徽省科学技术厅，安徽省自然科学基金青年项目，结合相位预测的高质量高效率辅助式语音增强方法研究，2023-09 至 2025-08，8万元
图书馆VIP, 青年创新基金, 高效率高鲁棒的神经网络声码器研究, 2023-01 至 2024-12，9万元

参与项目

科技部，科技部攻关项目子课题，智能语音移植模型和算法工具包研发，2022-01至2024-12，500万元（排名2/34）
国家自然科学基金委员会，国家自然科学基金联合项目，感知驱动的细粒度语音表征解耦与跨模态可控语音语音合成，2024-01 至 2027-12，260万元（排名7/21）
中科院, 战略性先导科技专项（C类）课题，多语种语音合成关键技术，2020-01 至 2022-12，1632万元（排名2/35）
科技部, 国家重点研发计划项目课题，面向冬奥场景的多语种语音处理关键技术，2019-10 至 2022-06，338万元 (3/31)
国家自然科学基金委员会, 国家自然科学基金面上项目，面向语音合成的神经网络声码器研究，2019-01-01 至 2022-12-31， 63万元（排名7/8）

论文发表

2022年及以后

第一作者+通讯作者论文列表

Yang Ai, Xiao-Hang Jiang, Ye-Xin Lu, Hui-Peng Du, and Zhen-Hua Ling*, “APCodec: A neural audio codec with parallel amplitude and phase spectrum encoding and decoding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3256–3269, 2024.
Yang Ai, and Zhen-Hua Ling*, “Low-latency neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses for speech generation tasks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2283–2296, 2024.
Yang Ai, and Zhen-Hua Ling*, “APNet: An all-frame-level neural vocoder incorporating direct prediction of amplitude and phase spectra,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2145–2157, 2023.
Yang Ai*, Zhen-Hua Ling, Wei-Lu Wu and Ang Li, “Denoising-and-dereverberation hierarchical neural vocoder for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2036–2048, 2022.
Yang Ai, Ye-Xin Lu and Zhen-Hua Ling*, “Long-frame-shift neural speech phase prediction with spectral continuity enhancement and interpolation error compensation,” IEEE Signal Processing Letters, vol. 30, pp. 1097-1101, 2023.
Yang Ai, Ye-Xin Lu, Xiao-Hang Jiang, Zheng-Yan Sheng, Rui-Chen Zheng, and Zhen-Hua Ling*, “A low-bitrate neural audio codec framework with bandwidth reduction and recovery for high-sampling-rate waveforms,” in Proc. Interspeech, 2024, pp. 1765-1769.
Yang Ai, and Zhen-Hua Ling*, “Neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses,” in Proc. ICASSP, 2023, pp. 1-5.
Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “Explicit estimation of magnitude and phase spectra in parallel for high-quality speech enhancement,” Neural Networks, vol. 189, pp. 107562, 2025.
Ye-Xin Lu, Yang Ai*, Hui-Peng Du, and Zhen-Hua Ling, “Towards high-quality and efficient speech bandwidth extension with parallel amplitude and phase prediction,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 236–250, 2025.
Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Yang Ai*, and Zhen-Hua Ling, “ERVQ: Enhanced residual vector quantization with intra-and-inter-codebook optimization for neural audio codecs,” accepted by IEEE Transactions on Audio, Speech, and Language Processing, 2025.
Xiao-Hang Jiang, Yang Ai*, Rui-Chen Zheng, and Zhen-Hua Ling, “A streamable neural audio codec with residual scalar-vector quantization for real-time communication,” accepted by IEEE Signal Processing Letters, 2025.
Rui-Chen Zheng, Yang Ai*, Zhen-Hua Ling, “Speech reconstruction from silent lip and tongue articulation by diffusion models and text-guided pseudo target generation,” in Proc. ACM MM, 2024, pp. 6559-6568.
Ye-Xin Lu, Yang Ai*, Zheng-Yan Sheng, and Zhen-Hua Ling, “Multi-stage speech bandwidth extension with flexible sampling rates control,” in Proc. Interspeech, 2024, pp. 2270-2274.
Hui-Peng Du, Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “BiVocoder: A bidirectional neural vocoder integrating feature extraction and waveform generation,” in Proc. Interspeech, 2024, pp. 3894-3898.
Fei Liu, Yang Ai*, Hui-Peng Du, Ye-Xin Lu, Rui-Chen Zheng, and Zhen-Hua Ling, “Stage-wise and prior-aware neural speech phase prediction,” in Proc. SLT, 2024, pp. 648-654.
Xiao-Hang Jiang, Yang Ai*, Rui-Chen Zheng, Hui-Peng Du, Ye-Xin Lu, and Zhen-Hua Ling, “MDCTCodec: A lightweight MDCT-based neural audio codec towards high sampling rate and low bitrate scenarios,” in Proc. SLT, 2024, pp. 550-557.
Yu-Fei Shi, Yang Ai*, Ye-Xin Lu, Hui-Peng Du, and Zhen-Hua Ling, “Pitch-and-spectrum-aware singing quality assessment with bias correction and model fusion,” in Proc. SLT, 2024, pp. 821-827.
Hui-Peng Du, Yang Ai*, Rui-Chen Zheng, and Zhen-Hua Ling, “APCodec+: A spectrum-coding-based high-fidelity and high-compression-rate neural audio codec with staged training paradigm,” in Proc. ISCSLP, 2024, pp. 676-680.
Yu-Fei Shi, Ye-Xin Lu, Yang Ai*, Hui-Peng Du, and Zhen-Hua Ling, “SAMOS: A neural MOS prediction model leveraging semantic representations and acoustic features,” in Proc. ISCSLP, 2024, pp. 199-203.
Xiao-Hang Jiang, Hui-Peng Du, Yang Ai*, Ye-Xin Lu, and Zhen-Hua Ling, “ESTVocoder: An excitation-spectral-transformed neural vocoder conditioned on mel spectrogram,” in Proc. NCMMSC, 2024, pp. 114-128.
Hui-Peng Du, Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “A neural denoising vocoder for clean waveform generation from noisy mel-spectrogram based on amplitude and phase predictions,” in Proc. NCMMSC, 2024, pp. 144-152.
Rui-Chen Zheng, Yang Ai*, and Zhen-Hua Ling, “Speech reconstruction from silent tongue and lip articulation by pseudo target generation and domain adversarial training,” in Proc. ICASSP, 2023, pp. 1-5.
Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “MP-SENet: A speech enhancement model with parallel denoising of magnitude and phase spectra,” in Proc. Interspeech, 2023, pp. 3834-3838.
Hui-Peng Du, Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “APNet2: High-quality and high-efficiency neural vocoder with direct prediction of amplitude and phase spectra,” in Proc. NCMMSC, 2023, pp. 66-80.
Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “Source-filter-based generative adversarial neural vocoder for high fidelity speech synthesis,” in Proc. NCMMSC, 2022, pp. 68-80.
Ye-Xin Lu, Hui-Peng Du, Zheng-Yan Sheng, Yang Ai*, Zhen-Hua Ling, “Incremental disentanglement for environment-aware zero-shot text-to-speech synthesis,” in Proc. ICASSP, 2025, pp. 1-5.
Yu Guan, Yang Ai*, Zuoliang Li, Shengyu Peng, Wu Guo, “Recursive feature learning from pre-trained models for spoofing speech detection,” in Proc. ICASSP, 2025, pp. 1-5.
Shengyu Peng, Wu Guo, Jie Zhang, Zuoliang Li, Yu Guan, Bin Gu, Yang Ai*, “A study of multi-scale feature learning from pre-trained models on speaker verification,” in Proc. ICASSP, 2025, pp. 1-5.
Zuoliang Li, Yang Ai*, Jie Zhang, Shengyu Peng, Yu Guan, Bin Gu, Wu Guo, “Aligning noisy-clean speech pairs at feature and embedding levels for learning noise-invariant speaker representations,” in Proc. ICASSP, 2025, pp. 1-5.
Yao Guo, Yang Ai*, Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Zhen-Hua Ling, “Vision-integrated high-quality neural speech coding,” accepted by Interspeech, 2025.
Ye-Xin Lu, Hui-Peng Du, Fei Liu, Yang Ai*, Zhen-Hua Ling, “Improving noise robustness of LLM-based zero-shot TTS via discrete acoustic token denoising,” accepted by Interspeech, 2025.
Yu-Fei Shi, Yang Ai*, Zhen-Hua Ling, “Universal preference-score-based pairwise speech quality assessment,” accepted by Interspeech, 2025.

其他论文列表

Rui-Chen Zheng, Yang Ai, and Zhen-Hua Ling, “Incorporating ultrasound tongue images for audio-visual speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1430–1444, 2024.
Shi-Ming Wang, Yang Ai, Li-Ping Chen, Ya-Jun Hu, and Zhen-Hua Ling, “TEAR: A Cross-modal Pre-trained Text Encoder Enhanced by Acoustic Representations for Speech Synthesis,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 24, no. 3, pp. 1–15, 2025.
Shi-Ming Wang, Li-Ping Chen, Yang Ai, Ya-Jun Hu, and Zhen-Hua Ling, “PhonemeVec: A Phoneme-level contextual prosody representation for speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 1117–1128, 2025.
Zheng-Yan Sheng, Li-Juan Liu, Yang Ai, Jia Pan, and Zhen-Hua Ling, “Voice attribute editing with text prompt,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 1641–1652, 2025.
Kang-Di Mei, Zhao-Ci Liu, Hui-Peng Du, Heng-Yu Li, Yang Ai, Li-Ping Chen, Zhen-Hua Ling, “Considering temporal connection between turns for conversational speech synthesis,” in Proc. ICASSP, 2024, pp. 11426-11430.
Heng-Yu Li, Kang-Di Mei, Zhao-Ci Liu, Yang Ai, Li-Ping Chen, Zhen-Hua Ling, “Refining self-supervised learnt speech representation using brain activations,” in Proc. Interspeech, 2024, pp. 1480-1484.
Yuan Jiang, Shun Bao, Ya-Jun Hu, Li-Juan Liu, Guo-Ping Hu, Yang Ai, and Zhen-Hua Ling, “Online speaker adaptation for WaveNet-based neural vocoders,” in Proc. ICDSP, 2024, pp. 112-117.
Zheng-Yan Sheng, Yang Ai, Yan-Nian Chen, and Zhen-Hua Ling, “Face-driven zero-shot voice conversion with memory-based face-voice alignment,” in Proc. ACM MM, 2023, pp. 8443-8452.
Rui-Chen Zheng, Yang Ai, and Zhen-Hua Ling, “Incorporating ultrasound tongue images for audio-visual speech enhancement through knowledge distillation,” in Proc. Interspeech, 2023, pp. 844-848.
Zheng-Yan Sheng, Yang Ai, and Zhen-Hua Ling, “Zero-shot personalized lip-to-speech synthesis with face image based voice control,” in Proc. ICASSP, 2023, pp. 1-5.
Hao-Chen Wu, Zhu-Hai Li, Lu-Zhen Xu, Zhen-Tao Zhang, Wen-Ting Zhao, Bin Gu, Yang Ai, Ye-Xin Lu, Jie-Zhang, Zhen-Hua Ling and Wu Guo, “The USTC-NERCSLIP system for the track 1.2 of audio deepfake detection (ADD 2023) challenge,” in Proc. IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis, 2023, pp. 119-124.
Hao-Jian Lin, Yang Ai, and Zhen-Hua Ling, “A light CNN with split batch normalization for spoofed speech detection using data augmentation,” in Proc. APSIPA, 1684 – 1689, 2022.
Han-Jie Guo, Hui-Peng Du, Zheng-Yan Sheng, Li-Ping Chen, Yang Ai, Zhen-Hua Ling, “CASC-XVC: Zero-shot cross-lingual voice conversion with content accordant and speaker contrastive losses,” accepted by ICASSP, 2025.
Yin-Long Liu, Rui Feng, Ye-Xin Lu, Jia-Xin Chen, Yang Ai, Jia-Hong Yuan, Zhen-Hua Ling, “Can automated speech recognition errors provide valuable clues for alzheimer’s disease detection?,” accepted by ICASSP, 2025.

2022年以前

论文列表

Yang Ai and Zhen-Hua Ling, “A neural vocoder with hierarchical generation of amplitude and phase spectra for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 839–851, 2020.
Yang Ai, Hong-Chuan Wu, and Zhen-Hua Ling, “SampleRNN-based neural vocoder for statistical parametric speech synthesis,” in Proc. ICASSP, 2018, pp. 5659-5663.
Yang Ai, Jing-Xuan Zhang, Liang Chen, and Zhen-Hua Ling, “DNN-based spectral enhancement for neural waveform generators with low-bit quantization,” in Proc. ICASSP, 2019, pp. 7025-7029.
Yang Ai and Zhen-Hua Ling, “Knowledge-and-data-driven amplitude spectrum prediction for hierarchical neural vocoders,” in Proc. Interspeech, 2020, pp. 190-194.
Yang Ai, Xin Wang, Junichi Yamagishi and Zhen-Hua Ling, “Reverberation modeling for source-filter-based neural vocoder,” in Proc. Interspeech, 2020, pp.3560-3564.
Yang Ai, Hao-Yu Li, Xin Wang, Junichi Yamagishi and Zhen-Hua Ling, “Denoising-and-dereverberation hierarchical neural vocoder for robust waveform generation,” in Proc. SLT, 2021, pp. 477-484.
Zhen-Hua Ling, Yang Ai, Yu Gu, and Li-Rong Dai, “Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 5, pp. 883–894, 2018.
Yuan Jiang, Ya-Jun Hu, Li-Juan Liu, Hong-Chuan Wu, Zhi-Kun Wang, Yang Ai, Zhen-Hua Ling, and Li-Rong Dai, “The USTC system for blizzard challenge 2019,”in Blizzard Challenge Workshop, 2019.
Yuan-Hao Yi, Yang Ai, Zhen-Hua Ling, and Li-Rong Dai, “Singing voice synthesis using deep autoregressive neural networks for acoustic modeling,” in Proc. Interspeech, 2019, pp. 2593–2597.
Qiu-Chen Huang, Yang Ai, and Zhen-Hua Ling, “Online speaker adaptation for WaveNet-based neural vocoders,” in Proc. APSIPA, 2020, pp. 815-820.
Hao-Yu Li, Yang Ai, and Junichi Yamagishi, “Enhancing low-quality voice recordings using disentangled channel factor and neural waveform model,” in Proc. SLT, 2021, pp. 2452-2456.
Chang Liu, Yang Ai, and Zhen-Hua Ling, “Phase spectrum recovery for enhancing low-quality speech captured by laser microphones,” in Proc. ISCSLP, 2021, pp. 1-5.
Kun Shao, Jun-An Yang, Yang Ai, Hui Liu and Yu Zhang, “BDDR: An effective defense against textual backdoor attacks,” Computers & Security, vol. 110, pp. 102433, 2021.