VAR-Safe: Safety-Gated Variational Alignment for Chinese digital psychological counseling
Research Article
Open Access
CC BY

VAR-Safe: Safety-Gated Variational Alignment for Chinese digital psychological counseling

Xinyu Song 1* Zhengjie Gao 2
1 School of Electronic Information Engineering, Geely University of China
2 School of Electronic Information Engineering, Geely University of China
*Corresponding author: songxinyu@guc.edu.cn
Published on 30 October 2025
Volume Cover
AEI Vol.16 Issue 10
ISSN (Print): 2977-3911
ISSN (Online): 2977-3903
Download Cover

Abstract

Large Language Models (LLMs) show immense potential in Chinese digital psychological counseling services. However, their training alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF), face challenges including implementation complexity, high computational cost, and training instability. These issues are particularly critical in the high-safety-requirement context of psychological counseling, where model Hallucination and ethical risks urgently need to be addressed.Guided by the safety-first principle, this paper proposes the Safety-Gated Variational Alignment (VAR-Safe) method, built upon the foundation of the Variational Alignment (VAR) technique. VAR-Safe introduces a safety-gated reward transformation mechanism that converts the professional ethics and harmlessness constraints encoded in the reward model into hard penalty terms, thereby more effectively suppressing harmful or unprofessional hallucinated responses.From the perspective of variational inference, VAR-Safe transforms the complex objective of RLHF into an offline, safety-driven, re-weighted Supervised Fine-Tuning (SFT) format. This ensures that all weights during the optimization process remain positive, fundamentally enhancing the robustness and convergence stability of the alignment training.We trained a Chinese digital psychological counselor based on the Chinese SoulChat corpus. Experimental results show that while significantly improving the model's empathy and professionalism, VAR-Safe reduces the critical safety metric—the rate of professional knowledge hallucination—to a level much lower than that of the baseline models, demonstrating its superior applicability in high-safety applications.

Keywords:

LLMs, psychological counseling, reinforcement learning

View PDF
Song,X.;Gao,Z. (2025). VAR-Safe: Safety-Gated Variational Alignment for Chinese digital psychological counseling. Advances in Engineering Innovation,16(10),21-27.

References

[1]. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv: 2203.02155. https: //arxiv.org/abs/2203.02155

[2]. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. ArXiv, abs/1707.06347.

[3]. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv preprint arXiv: 2305.18290. https: //arxiv.org/abs/2305.18290

[4]. Du, Y., Li, Z., Cheng, P., Chen, Z., Xie, Y., Wan, X., & Gao, A. (2025). Simplify RLHF as Reward-Weighted SFT: A Variational Method. arXiv preprint arXiv: 2502.11026. https: //arxiv.org/abs/2502.11026

[5]. Iftikhar, Z., Xiao, A., Ransom, S., Huang, J., & Suresh, H. (2025). How LLM Counselors Violate Ethical Standards in Mental Health Practice: A Practitioner-Informed Framework.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 8(2), 1311-1323. https: //doi.org/10.1609/aies.v8i2.36632

[6]. Ziebart, B. D. , Maas, A. L. , Bagnell, J. A. , & Dey, A. K. . (2008). Maximum Entropy Inverse Reinforcement Learning. Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13-17, 2008.

[7]. Wang, X., Zhou, Y., & Zhou, G. (2025). The Application and Ethical Implication of Generative AI in Mental Health: Systematic Review.JMIR mental health, 12, e70610. https: //doi.org/10.2196/70610

[8]. Xie, H., Chen, Y., Xing, X., Lin, J., & Xu, X. (2024). PsyDT: Using LLMs to Construct the Digital Twin of Psychological Counselor with Personalized Counseling Style for Psychological Counseling. ArXiv, abs/2412.13660.

[9]. Kim, Y., Choi, C. H., Cho, S., Sohn, J. Y., & Kim, B. H. (2025). Aligning large language models for cognitive behavioral therapy: a proof-of-concept study.Frontiers in psychiatry, 16, 1583739. https: //doi.org/10.3389/fpsyt.2025.1583739

[10]. Weidinger, L., Mellor, J.F., Rauh, M., Griffin, C., Uesato, J., Huang, P., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S.M., Hawkins, W.T., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L., Hendricks, L.A., Isaac, W.S., Legassick, S., Irving, G., & Gabriel, I. (2021). Ethical and social risks of harm from Language Models. ArXiv, abs/2112.04359.

[11]. Lin, S.C., Hilton, J., & Evans, O. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. Annual Meeting of the Association for Computational Linguistics.

[12]. Yang, Q.A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Liu, Y., Cui, Z., Zhang, Z., Qiu, Z., Quan, S., & Wang, Z. (2024). Qwen2.5 Technical Report. ArXiv, abs/2412.15115.

[13]. Chen, Y., Xing, X., Lin, J., Zheng, H., Wang, Z., Liu, Q., & Xu, X. (2023). SoulChat: Improving LLMs' empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations. arXiv preprint arXiv: 2311.00273. https: //arxiv.org/abs/2311.00273

Cite this article

Song,X.;Gao,Z. (2025). VAR-Safe: Safety-Gated Variational Alignment for Chinese digital psychological counseling. Advances in Engineering Innovation,16(10),21-27.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

About volume

Journal: Advances in Engineering Innovation

Volume number: Vol.16
Issue number: Issue 10
ISSN: 2977-3903(Print) / 2977-3911(Online)