Vol. 19, No. 6, June 30, 2025
10.3837/tiis.2025.06.008,
Download Paper (Free):
Abstract
The visual question answering task is to predict the correct answer to a question based on natural language questions related to images. In recent years, although significant achievements and progress have been made in the field of visual question answering, complex visual question answering models have consistently performed poorly in relatively difficult and small sample count tasks. In response to this issue, this article proposes a technique and method based on training strategies to assist visual question answering model learning in order to improve the performance of counting tasks. The algorithm steps are as follows: Firstly, in the data preprocessing stage, this paper divides the dataset into five types of tasks, including counting tasks, based on the task type of the dataset, and calculates the answer distribution information entropy for different task types. Secondly, information entropy is used to sort the divided task types, and the model training order is generated using this order to improve the performance of counting tasks through this training method. Finally, in the multimodal feature fusion stage, the module results are fused with the basic visual question answering model, and a parallel classifier method is designed to reduce the impact of other task types on counting problems during model prediction. The experimental results show that based on the VQA-cp v2 dataset, the method proposed in this paper has significantly improved the recognition and answer accuracy of counting problems, especially in the most widely used basic model UpDn, which has the best performance, with an accuracy improvement of up to 32.5%.
Statistics
Show / Hide Statistics
Statistics (Cumulative Counts from December 1st, 2015)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.
Cite this article
[IEEE Style]
F. Wang, T. Zhou, J. Zhao, "The Effect of Sequential Training on Visual Question Answering Counting Task," KSII Transactions on Internet and Information Systems, vol. 19, no. 6, pp. 1908-1921, 2025. DOI: 10.3837/tiis.2025.06.008.
[ACM Style]
Feng Wang, Tong Zhou, and Jia Zhao. 2025. The Effect of Sequential Training on Visual Question Answering Counting Task. KSII Transactions on Internet and Information Systems, 19, 6, (2025), 1908-1921. DOI: 10.3837/tiis.2025.06.008.
[BibTeX Style]
@article{tiis:102777, title="The Effect of Sequential Training on Visual Question Answering Counting Task", author="Feng Wang and Tong Zhou and Jia Zhao and ", journal="KSII Transactions on Internet and Information Systems", DOI={10.3837/tiis.2025.06.008}, volume={19}, number={6}, year="2025", month={June}, pages={1908-1921}}