ImageInWords: Unlocking Hyper-Detailed Image Descriptions
Dataset and annotation framework for hyper-detailed image descriptions,
Addressing the challenge of creating accurate and detailed image descriptions for training Vision-Language models, the paper presents ImageInWords (IIW), a meticulously designed human-in-the-loop annotation framework and dataset. Unlike current datasets with short and low-granularity descriptions, IIW aims to curate hyper-detailed descriptions through a carefully guided annotation process.
The annotation methodology of IIW involves a seeded process, utilizing machine-generated captions from PaLI-3 5B outputs to initiate human annotation. This kickstarts the correction of VLM hallucinations and fills in missing details, providing a starting point for human annotators to create comprehensive descriptions of images. Annotators focus on salient objects, considering various attributes such as function, shape, size, color, texture, location, and relationships to other components.
In describing salient objects, annotators pay close attention to unique features that distinguish them, while omitting default features unless visually significant. Text elements in images are accurately quoted, noting attributes like casing, alignment, font color, type, and shadow effects. Descriptions of people in images focus on visible body parts, facial visibility, activities, and attire, avoiding assumptions about personal attributes.
The annotation process aims to provide detailed and coherent descriptions that capture the visual details of images effectively. IIW significantly improves upon existing datasets, demonstrating superior performance in fine-tuning models (+31% compared to prior work) and generating text-to-image descriptions that closely match the originals. Moreover, IIW-produced descriptions exhibit more compositionally rich content, outperforming baselines by up to 6% on various datasets, showcasing its efficacy in enhancing vision-language understanding tasks.


Comments
None