The Effect of Training Data Set Composition on the Performance of a Neural Image Caption Generator
Army Research Laboratory Adelphi United States
Pagination or Media Count:
This research seeks to determine how many images of a particular object in a training data set are necessary to achieve caption quality saturation in neural image caption generators. Understanding the relationship between caption quality and the size and composition of training data sets could improve efficiency in model training and lead to the development of optimized data sets for different tasks. We hypothesize that increasing the exposure of a neural network to an object will improve its performance, up to a point, after which the caption quality will saturate and that this may vary based on the objects visual homogeneity. We trained several image captioning models, using an existing code Neuraltalk2, on subsets of the Microsoft Common Objects in Context data set, which contained a precise number of some common object categories e.g., cat and pizza. The performance with different levels of exposure to the selected objects was compared using the Metric for Evaluation of Translation with Explicit Ordering METEOR and Consensus-Based Image Description Evaluation CIDEr automated scoring metrics. The data indicate that increasing the quantity of images of a particular object in the training data set improved the performance up to 1,500 images, but not beyond that.
- Electrical and Electronic Equipment