Detection and Mitigation of Factual Hallucinations in Large Language Models: A Comparative Review of the Timing and Effectiveness of External Retrieval, Post-hoc Verification, and Evaluation Methods
Research Article
Open Access
CC BY

Detection and Mitigation of Factual Hallucinations in Large Language Models: A Comparative Review of the Timing and Effectiveness of External Retrieval, Post-hoc Verification, and Evaluation Methods

Bingchen Zhou 1*
1 Institute of School of Artificial Intelligence and Advanced Computing, Xi’an Jiaotong-Liverpool University, Suzhou, China
*Corresponding author: Bingchen.Zhou22@student.xjtlu.edu.cn
Published on 19 November 2025
Volume Cover
ACE Vol.207
ISSN (Print): 2755-273X
ISSN (Online): 2755-2721
ISBN (Print): 978-1-80590-539-4
ISBN (Online): 978-1-80590-540-0
Download Cover

Abstract

Large language models (LLMs) excel at reasoning and generation but remain susceptible to factual hallucinations. This survey categorizes recent progress around the timing of intervention: First, through retrieval augmentation and explicit reasoning during generation; second, through post-hoc verification and self-correction after generation. Regarding the former, this survey reviews IRCoT, SELF-RAG, ReAct, and Atlas, highlighting how combining retrieval with thought chaining can reduce multi-layer errors and improve answer grounding in knowledge-intensive question answering. Regarding the latter, this study examines RARR, Chain-of-Verification (CoVe), CRITIC, and Reflexion, which explore evidence, generate verification questions, or revise outputs through repeated criticism while minimizing stylistic bias. This work also summarizes detection and evaluation practices (e.g., SelfCheckGPT, FACTSCORE) that quantify attribution, support, and completeness. Across various approaches, this study analyzes effectiveness, computational cost, latency, and robustness to noisy retrieval, identifying failure modes such as retrieval dependency, ungrounded verification, and domain-shift gaps. This survey provides practical guidelines: for constrained question answering and multi-step reasoning, a combined retrieval-and-reasoning approach is most effective; whereas post-hoc validation is more suitable for long-form generation and reporting, and for citing external sources. Finally, this work proposes open problems: a unified cost-aware controller that can adjust intervention frequency; stronger evidence attribution; richer benchmarks beyond open-domain question answering; and a clearer trade-off between efficiency and realism for more reliable and scalable hallucination suppression.

Keywords:

Large language models, Factual hallucinations, Retrieval enhancement, Verification and self-correction

View PDF
Zhou,B. (2025). Detection and Mitigation of Factual Hallucinations in Large Language Models: A Comparative Review of the Timing and Effectiveness of External Retrieval, Post-hoc Verification, and Evaluation Methods. Applied and Computational Engineering,207,83-90.

References

[1]. Li, J., Chen, J., Ren, R., Cheng, X., Zhao, W. X., Nie, J.-Y., & Wen, J.-R. (2024). The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 10879–10899.

[2]. Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2023). Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, Canada, 10014–10037.

[3]. Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). SELF-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv: 2310.11511.

[4]. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations (ICLR), Kigali, Rwanda.

[5]. Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., & Grave, E. (2023). Atlas: Few-shot Learning with Retrieval-Augmented Language Models. Journal of Machine Learning Research, 24(251), 1–43.

[6]. Gao, L., Dai, Z., Pasupat, P., Chen, A., Chaganty, A. T., Fan, Y., Zhao, V. Y., Lao, N., Lee, H., Juan, D.-C., & Guu, K. (2023). RARR: Researching and Revising What Language Models Say, Using Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, Canada, 16477–16508.

[7]. Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. Advances in Neural Information Processing Systems 36 (NeurIPS), New Orleans, LA, USA.

[8]. Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., & Weston, J. (2024). Chain-of-Verification Reduces Hallucination in Large Language Models. Findings of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 3563–3578.

[9]. Gou, Z., Shao, Z., Gong, Y., Shen, Y., Yang, Y., Duan, N., & Chen, W. (2024). CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. International Conference on Learning Representations (ICLR).

[10]. Manakul, P., Liusie, A., & Gales, M. J. F. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, 9004–9017.

[11]. Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.-T., Koh, P. W., Iyyer, M., Zettlemoyer, L., & Hajishirzi, H. (2023). FActScore: Fine-Grained Atomic Evaluation of Factual Precision in Long-Form Text Generation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, 12076–12100.

[12]. Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X., & Zhou, D. (2024). Large Language Models Cannot Self-Correct Reasoning Yet. International Conference on Learning Representations (ICLR), Vienna, Austria.

Cite this article

Zhou,B. (2025). Detection and Mitigation of Factual Hallucinations in Large Language Models: A Comparative Review of the Timing and Effectiveness of External Retrieval, Post-hoc Verification, and Evaluation Methods. Applied and Computational Engineering,207,83-90.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

About volume

Volume title: Proceedings of CONF-SPML 2026 Symposium: The 2nd Neural Computing and Applications Workshop 2025

ISBN: 978-1-80590-539-4(Print) / 978-1-80590-540-0(Online)
Editor: Marwan Omar, Guozheng Rao
Conference date: 21 December 2025
Series: Applied and Computational Engineering
Volume number: Vol.207
ISSN: 2755-2721(Print) / 2755-273X(Online)