Developing a System for Free-Form Visual Question Answering

Shevchenko, Violetta

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/138267

Type:	Thesis
Title:	Developing a System for Free-Form Visual Question Answering
Author:	Shevchenko, Violetta
Issue Date:	2022
School/Discipline:	School of Computer and Mathematical Sciences
Abstract:	In the past few years, we have witnessed significant advances in the field of visual question answering (VQA). This complex task connects the areas of computer vision and natural language processing research to build a step towards solving artificial intelligence (AI). A crucial feature of any AI-complete problem is the ability to scale for real-world applications. For VQA, in particular, it implies that a model should answer any question about any possible image. However, it seems unlikely that any model would be able to learn all the required knowledge from a single training set, so the use of external knowledge has become a promising direction for VQA research. In this thesis, we investigate techniques to exploit external information to improve the performance of visual question answering methods. First, we explore the benefits of unsupervised image pre-training for VQA. We create a dataset of simple images, where only a small fraction is annotated with VQA questions. We experiment with two selfsupervised approaches and show that they can be used for VQA pre-training and generalise well from little annotated data. Next, we frame VQA as a multi-task problem and complement the traditional classification objective with an additional regression loss that aims to learn vector representations of answers. This novel learning branch allows a model to embed prior knowledge about answer semantics, and we show that the information captured in the relations between answer embeddings is important for VQA. This method not only shows clear improvements in accuracy and consistency over a range of different question types but also unlocks the potential for novel answer prediction. Finally, we implement a method that embeds information from external knowledge bases into vision-and-language transformers. This method proposes to optimise an additional objective that aligns learned word representations with the matching knowledge embeddings. We evaluate the applicability of various knowledge bases to multiple downstream tasks and show that the method brings clear improvement on knowledge-demanding and general visual reasoning datasets.
Advisor:	Dick, Anthony van den Hengel, Anton Teney, Damien
Dissertation Note:	Thesis (Ph.D.) -- University of Adelaide, School of Computer and Mathematical Sciences, 2022
Keywords:	Visual question answering image understanding natural language understanding multi-modal reasoning
Provenance:	This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals
Appears in Collections:	Research Theses

Files in This Item:

File	Description	Size	Format
Shevchenko2022_PhD.pdf	Thesis	14.89 MB	Adobe PDF	View/Open

Show full item record

Adelaide Research & Scholarship