VAR-Safe: Safety-Gated Variational Alignment for Chinese digital psychological counseling

Xinyu Song; Zhengjie Gao

doi:10.54254/2977-3903/2025.28910

Research Article

Open Access

VAR-Safe: Safety-Gated Variational Alignment for Chinese digital psychological counseling

Xinyu Song ^1* Zhengjie Gao ²

¹ School of Electronic Information Engineering, Geely University of China

² School of Electronic Information Engineering, Geely University of China

^*Corresponding author: songxinyu@guc.edu.cn

Published on 30 October 2025

AEI Vol.16 Issue 10

ISSN (Print): 2977-3911

ISSN (Online): 2977-3903

Download Cover

Abstract

Large Language Models (LLMs) show immense potential in Chinese digital psychological counseling services. However, their training alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF), face challenges including implementation complexity, high computational cost, and training instability. These issues are particularly critical in the high-safety-requirement context of psychological counseling, where model Hallucination and ethical risks urgently need to be addressed.Guided by the safety-first principle, this paper proposes the Safety-Gated Variational Alignment (VAR-Safe) method, built upon the foundation of the Variational Alignment (VAR) technique. VAR-Safe introduces a safety-gated reward transformation mechanism that converts the professional ethics and harmlessness constraints encoded in the reward model into hard penalty terms, thereby more effectively suppressing harmful or unprofessional hallucinated responses.From the perspective of variational inference, VAR-Safe transforms the complex objective of RLHF into an offline, safety-driven, re-weighted Supervised Fine-Tuning (SFT) format. This ensures that all weights during the optimization process remain positive, fundamentally enhancing the robustness and convergence stability of the alignment training.We trained a Chinese digital psychological counselor based on the Chinese SoulChat corpus. Experimental results show that while significantly improving the model's empathy and professionalism, VAR-Safe reduces the critical safety metric—the rate of professional knowledge hallucination—to a level much lower than that of the baseline models, demonstrating its superior applicability in high-safety applications.

Keywords:

LLMs, psychological counseling, reinforcement learning

View PDF

References

[1]. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv: 2203.02155. https: //arxiv.org/abs/2203.02155

[2]. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. ArXiv, abs/1707.06347.

[3]. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv preprint arXiv: 2305.18290. https: //arxiv.org/abs/2305.18290

[4]. Du, Y., Li, Z., Cheng, P., Chen, Z., Xie, Y., Wan, X., & Gao, A. (2025). Simplify RLHF as Reward-Weighted SFT: A Variational Method. arXiv preprint arXiv: 2502.11026. https: //arxiv.org/abs/2502.11026

[5]. Iftikhar, Z., Xiao, A., Ransom, S., Huang, J., & Suresh, H. (2025). How LLM Counselors Violate Ethical Standards in Mental Health Practice: A Practitioner-Informed Framework.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 8(2), 1311-1323. https: //doi.org/10.1609/aies.v8i2.36632

[6]. Ziebart, B. D. , Maas, A. L. , Bagnell, J. A. , & Dey, A. K. . (2008). Maximum Entropy Inverse Reinforcement Learning. Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13-17, 2008.

[7]. Wang, X., Zhou, Y., & Zhou, G. (2025). The Application and Ethical Implication of Generative AI in Mental Health: Systematic Review.JMIR mental health, 12, e70610. https: //doi.org/10.2196/70610

[8]. Xie, H., Chen, Y., Xing, X., Lin, J., & Xu, X. (2024). PsyDT: Using LLMs to Construct the Digital Twin of Psychological Counselor with Personalized Counseling Style for Psychological Counseling. ArXiv, abs/2412.13660.

[9]. Kim, Y., Choi, C. H., Cho, S., Sohn, J. Y., & Kim, B. H. (2025). Aligning large language models for cognitive behavioral therapy: a proof-of-concept study.Frontiers in psychiatry, 16, 1583739. https: //doi.org/10.3389/fpsyt.2025.1583739

[10]. Weidinger, L., Mellor, J.F., Rauh, M., Griffin, C., Uesato, J., Huang, P., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S.M., Hawkins, W.T., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L., Hendricks, L.A., Isaac, W.S., Legassick, S., Irving, G., & Gabriel, I. (2021). Ethical and social risks of harm from Language Models. ArXiv, abs/2112.04359.

[11]. Lin, S.C., Hilton, J., & Evans, O. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. Annual Meeting of the Association for Computational Linguistics.

[12]. Yang, Q.A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Liu, Y., Cui, Z., Zhang, Z., Qiu, Z., Quan, S., & Wang, Z. (2024). Qwen2.5 Technical Report. ArXiv, abs/2412.15115.

[13]. Chen, Y., Xing, X., Lin, J., Zheng, H., Wang, Z., Liu, Q., & Xu, X. (2023). SoulChat: Improving LLMs' empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations. arXiv preprint arXiv: 2311.00273. https: //arxiv.org/abs/2311.00273

References

[2]. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. ArXiv, abs/1707.06347.

[7]. Wang, X., Zhou, Y., & Zhou, G. (2025). The Application and Ethical Implication of Generative AI in Mental Health: Systematic Review.JMIR mental health, 12, e70610. https: //doi.org/10.2196/70610

[11]. Lin, S.C., Hilton, J., & Evans, O. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. Annual Meeting of the Association for Computational Linguistics.

Cite this article

Song,X.;Gao,Z. (2025). VAR-Safe: Safety-Gated Variational Alignment for Chinese digital psychological counseling. Advances in Engineering Innovation,16(10),21-27.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

About volume

Journal: Advances in Engineering Innovation

Volume number: Vol.16

Issue number: Issue 10

ISSN: 2977-3903(Print) / 2977-3911(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:

1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.

2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.

3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).