Recently, LAION type dataset research have discovered that it has insufficient information, alttexts
were disasterously lacking information about actual images.
Rather we can call it miracle that models which were dependent at the open dataset, working despite of those discrepencies.
Google, OpenAi has discovered that synthetic captions are far beneficial for dependent task, and CapsFusion project tries to annotate the large scale dataset with synthetic way.
Meta also tries to make a high-quality refined dataset, with hierarchical way.
Unfortunately, locally we don't have enough resources to process all the large database.
But we can cover it in several way,
- Tag-retrieval
- Focal crop - Tag - Grouping
- Tag relevance based reordering.
Which should help understanding what the tags actually belongs to.
The file supports Gradio Demo to to extract Stealth-PNGInfo type image metadata.
The file is example template to query GPT-4V API to get annnotation, based on image and tag.
In the directory, the image should have same name with tag .txt file.
The txt file format:
copyright:
character: erica_blandelli
general tags: 1girl arm_up blonde_hair blue_eyes breasts choker cleavage closed_mouth day full_body high_heels long_hair looking_at_viewer miniskirt outdoors pleated_skirt red_skirt sitting skirt smile solo
Gradio Demo of the annotation. Hooman should refine the GPT4V annotation. The sanitize cell, which shows the 'unused tags' will be added soon.
For those work, one might need company contact, or fair-use agreement for those annotations.
Unfortunately, most of the open source / crawled dataset has potential risk to contain unclassified data.
And since Data Poisoning is being a severe issue, the problem will be important soon ™️.
If booru database supported 'where' is related with specific tag, then it could have been a novel dataset, which also have semantic information too.