![]() | Only 14 pages are availabe for public view |
Abstract VQA is a challenging research area where a model must be able to understand image semantics along with the asked question in order to infer the correct answer. The ability of a VQA model of generalization to new questions about new images that have not seen before in the training stage is called zero shot capability and also there is a need for good evaluation metrics to compensate for dataset bias. In this thesis, TDIUC dataset is redistributed for this purpose to test this capability and apply good evaluation metrics. Also, Using transformer models for vqa task takes long training time, substituting selfattention layers by FNet sublayers shows improvement to training speed by 24% and testing speed by 12.7% with a limited accuracy cost by 5.61%. |