Topic: [Tool] Linux Perl script to tag downloaded e621 images with EXIF metadata

Posted under e621 Tools and Applications

I wrote a bash perl script to embed tags from e621 into image exif metadata. This enables you to search the images directly on your computer in windows explorer.

tag_e621.pl (v2 using Perl) (FAST)


A small perl script to embed tags from e621.net into image EXIF metadata.

Prerequisites:
1) Text::CSV_XS
2) Path::Class
3) File::Basename
4) Image::ExifTool

How to use:

1) Download images from e621.net to some directory, for example /home/uwubanana/e621_files/
2) Download a posts db dump from https://e621.net/db_export/
3) Update "posts_csv" and "file_path" variables to the db dump and your e621 images.
4) Make the script executable: chmod +x tag_e621.pl
5) Run the script: ./tag_e621.pl

Notes:
EXIF METADATA TAGS ONLY WORK ON .JPG FOR WINDOWS SEARCH!
PNG and GIF files will still be tagged but Windows search doesn't support metadata for these files.

1) This perl script is much faster than the old BASH version, on an NVMe drive I was tagging over 150 images per second.
2) Parsing the CSV file takes a couple minutes, don't worry if it hangs for a while.
3) You may already have all prerequisites installed, if you're missing one try installing via:

perl -MCPAN -e 'install Text::CSV_XS'

perl -MCPAN -e 'install Path::Class'

perl -MCPAN -e 'install File::Basename'

perl -MCPAN -e 'install Image::ExifTool'

tag_e621.pl
#!/usr/bin/perl

# Add tags from e621.net to local image EXIF metadata.
#
# $post_csv = posts db dump from https://e621.net/db_export/
# $file_path = path containing images downloaded from e621.net
#
# Authors
# [email protected]
# kora [email protected]
#
# 2023-03-26

#### CHANGE THESE VARIABLES

my $posts_csv = "posts-2023-03-24.csv";
my $file_path = "files/e621_popular/2007";

#### DO NOT CHANGE BELOW THIS LINE

use Text::CSV_XS;
use Path::Class;
use File::Basename;
use Image::ExifTool;

print "Parsing " . $posts_csv . "...\n";
binmode STDOUT, ":utf8";
my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
open my $fh, "<", "$posts_csv" or die "$posts_csv $!";
$junk = <$fh>;

while (my $row = $csv->getline ($fh))
  {
  $tags{@$row[3]} = @$row[8];
  }
close $fh or die "$posts_csv: $!";

print "Searching: " . $file_path . "...\n";
my @files;
dir($file_path)->recurse(callback => sub {
  my $file = shift;
  if($file =~ /(\.jpg|\.png|\.gif)\z/) {
    push @files, $file->absolute->stringify;
  }
});

print "Tagging files...\n";
for my $file (@files) {
  print "Tagging: " . $file . "\n";
  my $base_file_name = basename($file);
  (my $file_md5 = $base_file_name) =~ s/\.[^.]+$//;
  (my $formatted_tags = $tags{$file_md5}) =~ s/\ /; /g;
  my $exifTool = Image::ExifTool->new;
  if($file =~ /\.(jpg|gif)\z/) {
    $exifTool->Options(IgnoreMinorErrors => '1', OverwriteOriginal => '1', FastScan => '5');
    $exifTool->SetNewValue(Keywords => [$formatted_tags]);
    $exifTool->SetNewValue(Subject => [$formatted_tags]);
  }
  if($file =~ /\.png\z/) {
    $exifTool->Options(IgnoreMinorErrors => '1', OverwriteOriginal => '1', FastScan => '1');
    $exifTool->SetNewValue(Keywords => [$formatted_tags]);
    $exifTool->SetNewValue(Subject => [$formatted_tags]);
  }
  $exifTool->WriteInfo($file);
}

print "Done!\n"


GitHub link: https://github.com/bananahand/tag_e621

Let me know what you think!

Updated

I just updated the script, fixing issue where double quotes were being attached to first and last tags. Tags should be good now.

kora_viridian said:
General comments:

If you have exiftool, there's a 99.44% chance you have Perl. If you have Linux, there's a 99.9% chance you have Perl.

If you have Perl, then a Perl script that does this will probably be faster overall than a shell script that calls sed, awk, and mlr.

Also, if you have Perl, then pulling the tag list for each image from the posts.csv file is probably a lot simpler than converting to JSON first. Text::CSV is your friend. (You've probably already discovered that posts.csv has newlines in unexpected places, which makes simple-minded parsing difficult...)

Specific comments:

This seems like it depends on the problematic posts always being at those line numbers in the posts.csv file. If an early post, or a post in between the two currently-problematic posts, gets deleted, then this code probably won't do what you want. It might be better to have an array of the e621 post numbers of the problem posts, and look for those in the post-ID column of posts.csv, to know when you need to edit the description.

There's room for improvement here for sure. I can look into perl to see if I can get it working without jq. The issue is I'm not sure how fast perl would be searching a 3GB csv file. Also on removing the bugged posts, I think its safe to assume these line numbers will never change. Deleted posts remain in the CSV with the "is_deleted" flag being true or false:

"is_deleted": "f",

Also a big reason why I wanted to convert to json is so I could feed the md5 into grep for fast string searching in files which also returns the tags since its part of the same object on the same line:

uwubanana@e621 ~ $ time grep -m1 "076f60bfa74b4ec9ab73dd3d06dc2a00" /tmp/tags.json
{"md5":"076f60bfa74b4ec9ab73dd3d06dc2a00","tags":["3_toes accessory anthro barcode belly belt big_ears biped brown_belt butt collar collar_only corner cub digital_media_(artwork) dipstick_tail domestic_cat eyebrows feet felid feline felis full-length_portrait fur furgonomics half-closed_eyes handpaw hindpaw kneeling legband looking_at_viewer looking_back looking_back_at_viewer male mammal markings mirror monotone_butt multicolored_body multicolored_fur multicolored_tail narrowed_eyes nude pawpads paws pink_eyes pink_pawpads portrait pupils red_pupils reflection shadow short_snout side_view signature simple_background slim soles solo strapped straps submissive submissive_male tail tail_accessory tail_markings tailband tattoo toes totem.v tribal tribal_markings tribal_tattoo two_tone_body two_tone_fur two_tone_tail tysontan wall_(structure) white_background white_belly white_body white_fur white_tail yellow_body yellow_butt yellow_feet yellow_fur yellow_paws yellow_tail young"]}

real    0m0.051s
user    0m0.000s
sys     0m0.050s
uwubanana@e621 ~ $ wc -l /tmp/tags.json
3939530 /tmp/tags.json

Using my trimmed json file I can lookup tags for any file within ~50ms (on spinning HDD) with minimal formatting, just need to convert the spaces between the tags to commas:

tags="$(grep -m1 "${md5}" ${tags_json} | jq -r ".tags[]" | sed -e "s/ /,/g")"

I'm sure I could make this part faster by exporting the tags.json file to an array so lookups are happening directly in memory but its quite a lot of data and was OOMing during testing which is why I stuck with the file. From my tests, disregarding the couple minutes it takes to build the trimmed json file on the initial run, my script tags images on average in about ~300ms, so about 3 per second on my spinning hard drive. If you have an NVMe SSD this should be even faster.

Edit: Tested on an NVMe drive and I'm tagging images on average around ~200ms so 5 images a second.

Updated

Fixed another bug. I was referencing "test.json" for the ${tags} variable from my testing, updated this to correctly reference ${tags_json} which the script generates.

kora_viridian said:
General comments:

If you have exiftool, there's a 99.44% chance you have Perl. If you have Linux, there's a 99.9% chance you have Perl.

If you have Perl, then a Perl script that does this will probably be faster overall than a shell script that calls sed, awk, and mlr.

Where did you get those numbers from? Are the numbers based on what package is available on a per distro basis?
Exiftool and Perl are not installed on my system, but they're available in my package manager.

Script now prompts if you want to delete or keep temp files at the end so they're cached for the next run. Also moved the temp file posts.json to /tmp/ by default
Bugfix for prompting if you want to remake tags_json, variable was incorrectly set to posts_json. Prompt should now work correctly.

Updated

UPDATE: Complete script rewrite in Perl. Thanks kora viridian for recommendations! Tagging is now MUCH faster, over 150 images per second testing on an NVMe SSD.
BUGFIX: Added $exifTool->Options(IgnoreMinorErrors => '1'); to exiftool object options to enable writing of keywords beyond 64 character limit.

Updated

kora_viridian said:
The power of the camel compels you! :D

You just discovered Perl the same way I did. :) At a previous job, I was asked to write scripts to mess around with different kinds of text files. The very first one ended up being a shell script that made a simple call to sed, and the next few ended up being shell scripts that used various combinations of sed, awk, and grep. I ended up writing a script that had to write a couple of different temporary files, because it needed to make multiple passes through the input data. It worked, but it seemed clunky to me. I had heard of this Perl thingy so I decided to re-write that one in Perl, and it turned out much simpler - I could read in the data to memory and then do whatever I wanted, without temporary files.

It is good to know how to write a script using just things that are "always there", like the shell, grep, sed, and awk. Every once in a while you run into an embedded or otherwise low-resource system that may not have a full Perl or Python interpreter available, but will have at least Busybox versions of the standard stuff. Those can get you pretty far, especially if you're not handling a ton of data.

One drawback of both Perl and Python is deploying them to Windows users. There are .exe versions of both interpreters that work well, but then you have to talk people into 1) installing the .exe, 2) using something like cpan or pip to install the modules/libraries your script needs, and finally 3) running your script. WSL (embrace, extend, extinguish) may make this easier than it was in the past - I haven't used it, so I don't know.

Yeah this is much simpler and faster when using perl for sure. I've messed around with perl before, its just not something I've used much in my day as I tend to just stick with bash/zsh for most of my use cases. For csv parsing its clearly the better choice lol.

Side note, I tried the script on my 2014 popular downloads and its running a lot slower than my test on 2007 where I got 150+/s tagged, 2014 is struggling a lot more, hanging on a file for a second and bursting a couple and hanging again. I think this is just from the md5 lookup on the massive list in memory. I have an idea to fix this possibly... Need to look up the syntax for perl but I think I can do nested arrays and structure the array similar to how the static files are stored on static1.e621.net where file abcdef.jpg lives at static1.e621.net/data/ab/cd/abcdef.jpg. If i break the massive array into a couple thousand smaller arrays lookups will be much faster and consistent across all of the posts being tagged. Worst case I could skip the nesting and just take first 4 chars of md5 as the line is being read and feed that into array $abcd{$md5[tags]} for abcdef.jpg.

kora_viridian said:
You just discovered Perl the same way I did. :) At a previous job, I was asked to write scripts to mess around with different kinds of text files. The very first one ended up being a shell script that made a simple call to sed, and the next few ended up being shell scripts that used various combinations of sed, awk, and grep. I ended up writing a script that had to write a couple of different temporary files, because it needed to make multiple passes through the input data. It worked, but it seemed clunky to me. I had heard of this Perl thingy so I decided to re-write that one in Perl, and it turned out much simpler - I could read in the data to memory and then do whatever I wanted, without temporary files.

It *is* less clunky to avoid temp files. But if you're putting them in /tmp/, which you usually should, then they will be in-memory anyway, so, if there is a big performance difference between shell scripts and perl / python, then you might have missed something.

As for the current perl script. I have my doubts about whether partitioning by first digits will help.. at all. But there is a big optimization missing here: you don't want all the items, you only want the items that relate to files in file_path.
If you start by compiling a list of the md5s that you actually want (from file_path), you can just throw away all records other than that, at load time, and never have a huge array in memory.

Update: Made script a lot faster by adding FastScan exiftool option to minimize data read from file before write. Changed tag delimiter from "," to "; " so Windows now detects individual tags instead of one large tag string.
Updated OP with info stating .png and .gif searching will not work due to Windows not supporting exif data from these file types, only .jpg is searchable. The script still adds the tags to these file types even if Windows cannot utilize them.

kora_viridian said:
Are you sure it's not swapping? How much memory is the Perl process taking? Maybe close a few hundred tabs in your Web browser and try again. :D

I checked your popular-downloads post. 2007 had about 11,200 files, while 2014 had about 78,100 files. That's just about exactly seven times as many, but that doesn't seem to me like it should be that much of a difference.

Do you have the files it's tagging in the same year/month/day directory structure as mentioned in your popular-posts thread? If so, that should be OK. If you've got all 78,100 files for 2014 in one directory, maybe the filesystem being slow is contributing a little.

If nothing else, breaking it up on the first hex digit of the MD5 would give you sixteen hashes with about 250,000 items each, rather than one hash with almost 4,000,000 items. If that speeds it up enough, then you're done. If not, then break it up on the first two hex digits to get 256 hashes with about 15,600 items each, and see if you like that better.

I was curious, so I looked at the distribution of the digits in the MD5s in the (older) all-posts file I have. If I look only at the first digit of the MD5, I get all 16 possibilities. There are about 212,900 of the least-popular digit and 214,800 of the most-popular digit. If I look only at the last digit, I again get all 16 possibilities, with a range of about 213,000 to 214,200. (A perfect distribution would be about 213,700 for each one.)

If I look at the first four digits of the MD5, I get all 65,536 combinations, but the counts are a little lopsided - 25 for the least-popular and 90 of the most-popular. Cecking the last four digits also gives me all 65,536 combinations, but a similar lopsidedness - 22 for the least-popular and 84 of the most-popular. (A perfect distribution would be about 52 for each one.)

The slowness was due to not having FastScan option set so exiftools had to load all of the file metadata before writing. I guess the early 2007 photos just had less metadata which is why that was so much faster. Tagging is now extremely fast, over 150 images per second consistently.

  • 1