Topic: e6collector.py: A very bare-bones CLI tag downloader for Python 3 (updated for 2020)

Posted under e621 Tools and Applications

This is a bare-bones Python script I'm using to mirror my favourites. You might not find this useful or even usable. This is what it can do:

  • Work with just Python 3.6 (no extra modules needed!) or higher on Linux (other systems probably are fine as well)
  • Download images by tag using the API. You need to make sure that your tag list is URL encoded already if you want multiple ones.
  • Save images as ID-SOME-TAGS.EXT to a directory
  • Save all sources and tags associated with the image to tags.csv in the destination folder (it does not update those though, at least ATM). Great for grep-ing!
  • Don't redownload images because the tags changed, it strictly goes by ID.
  • Optionally log in with user name and API key (required to be able to see some images)
  • Spam your terminal
  • Fail hard if e621 is slow/down

Usage example: python3 e6collector.py /home/me/pics/e621/ "fav:myname" "myusername" "mYAp1k3y1234"

I might actually work on this a little more now, had planned to add a less verbose mode at least from the beginning but never did. Oh, and putting sources in the CSV file works again as well, still in the same, janky format.

Download version 2.1: https://pastebin.com/pf7snA9k

Have fun I guess!

Updated

Finally updated this thing to work with the not-recent-at-all API changes. It might blow up though, so be warned.

Updated by anonymous

stealthmode said:
Fail hard if e621 is slow/down

Sound like fun :P

Also this is gonna sound really nerdy, but where are all the goody CLI flags? Any good CLI-tool needs those.

Important note: I'm not going to use this tool, so don't change anything for my sake, just found it interesting and gave me an idea or two for my own tool(s)!

Updated by anonymous

Chessax said:

Also this is gonna sound really nerdy, but where are all the goody CLI flags? Any good CLI-tool needs those.

I honestly don't know what I would add, except maybe verbose/quiet mode. It is meant to be bare-bones, after all.

Updated by anonymous

stealthmode said:
I honestly don't know what I would add, except maybe verbose/quiet mode. It is meant to be bare-bones, after all.

--help, -h        - Display this help message and also register you for psychiatric help
--quiet, -q       - Omit useful error messages; all other output is retained
--verbose, -v     - Leave comments describing, in haunting detail, exactly how much the image turns you on
--no-nsfw, -n     - Do not download anything at all
--contribute, -c  - Complain of missing tags but do not add them yourself
--desperate, -d   - Also search FurAffinity

Updated by anonymous

It would be nice to generate a metalink file instead of the script doing the image downloading itself. That way you could import it into a download manager like DownThemAll, and easily rate-limit and cope with downtime.

  • Spam your terminal

Learn to use '\r', and output right-padded strings.
Your code looks simple enough that you should be able to use something like this:

_termwidth = None
def say(message, _width = None):
    global _termwidth
    if not _termwidth:
        import shutil
        _termwidth = shutil.get_terminal_size()[0]
    fmt = '%-' + str(_termwidth - 1) + 's'
    padded = fmt % message
    print(padded, end='\r')

Which will show only the latest message, rewriting the line each time a new message needs to be shown.
You could also alter it to dedicate part of the line to message, part of the line to progress indicator, whenever you implement progress.

(disclaimer: not tested on Windows. Used many many times for different scripts on Linux)

  • updating the tag list

If you mean the CSV rather than the filenames,
I suggest you look at TMSU -- if you want, all tagging can be managed by simply shelling out to it (updates can be done just by two shell-outs : 1. removing all tags from the file, 2. tagging again with the new set of tags)

Updated by anonymous

Maxpizzle said:

--help, -h        - Display this help message and also register you for psychiatric help
--quiet, -q       - Omit useful error messages; all other output is retained
--verbose, -v     - Leave comments describing, in haunting detail, exactly how much the image turns you on
--no-nsfw, -n     - Do not download anything at all
--contribute, -c  - Complain of missing tags but do not add them yourself
--desperate, -d   - Also search FurAffinity

This post put me in verbose mode.

savageorange said:
It would be nice to generate a metalink file instead of the script doing the image downloading itself. That way you could import it into a download manager like DownThemAll, and easily rate-limit and cope with downtime.

That actually sounds like an useful feature, I'll think about it.

savageorange said:
Which will show only the latest message, rewriting the line each time a new message needs to be shown.
You could also alter it to dedicate part of the line to message, part of the line to progress indicator, whenever you implement progress.

Nah, I'd rather just skip messages for already existing images to reduce the spamming.

savageorange said:
If you mean the CSV rather than the filenames,
I suggest you look at TMSU -- if you want, all tagging can be managed by simply shelling out to it (updates can be done just by two shell-outs : 1. removing all tags from the file, 2. tagging again with the new set of tags)

Interesting project, I'll look at it. Won't put it in as a dependency, though.

Updated by anonymous

hi i took this and made a quieter, faster, parallel downloadin', rate-limited, error handling version: https://gist.github.com/anonymous/f9936e74cedca08368561e3e6d505b91

$ ./e6collector.py --help
usage: e6collector.py [-h] [--jobs JOBS] [--verbose] [--quiet]
                      destination tags [tags ...]

Download files by tag from e621

positional arguments:
  destination           Directory to store the files in
  tags                  Tags to look for. Try "fav:yourname"

optional arguments:
  -h, --help            show this help message and exit
  --jobs JOBS, -j JOBS  Downloads to run in parallel
  --verbose, -v
  --quiet, -q

Updated by anonymous

TheLuggage said:
hi i took this and made a quieter, parallel downloadin' version: https://gist.github.com/anonymous/8956030a367323d673943868bba3c076

$ ./e6collector.py -h
usage: e6collector.py [-h] [--jobs JOBS] [--verbose]
                      destination tags [tags ...]

Download files by tag from e621

positional arguments:
  destination           Directory to store the files in
  tags                  Tags to look for. Try "fav:yourname"

optional arguments:
  -h, --help            show this help message and exit
  --jobs JOBS, -j JOBS  Downloads to run in parallel
  --verbose, -v

Please don't use a while True: with no rate limits and no error handling to fetch posts. You're almost guaranteed to get the tool blocked doing that. Infinite loops and HTTP requests are bad, mmkay!

Some ideas:
The maximum number of requests you make for more posts can never be more than the maximum post id returned on the first request divided by the number of posts requested, add one.
The before_id should change on every request, if it is not changing, something is wrong.
Test for response codes other than 200 and delay a few seconds, if you continue to get non 200 more than 5 times, abort, because something is dreadfully wrong. A wrapper class around the requests would make this fairly trivial to implement.

Updated by anonymous

Really good points. Thank you. I've added request rate limiting, and HTTP error handling with retry and exponential backoff; see updated link.

Updated by anonymous

TheLuggage said:
Really good points. Thank you. I've added request rate limiting, and HTTP error handling with retry and exponential backoff; see updated link.

It should be noted that the urllib library does not throw an exception on non 200 response statuses from the server. The request may be successful(you get a response), but the server may have rejected it for rate limiting or other error reasons. Exceptions are primarily limited to protocol violations and connection errors.

https://docs.python.org/3/library/http.client.html#http.client.HTTPResponse.status should be checked in this case for the value 200.

Updated by anonymous

KiraNoot said:
It should be noted that the urllib library does not throw an exception on non 200 response statuses from the server.

urlopen

uses the globally installed `OpenerRedirector`. The default global OpenerRedirector has a HTTPErrorProcessor step that raises `HTTPError` on non-200 responses. I also tested that it raises these against https://httpbin.org/status/ just in case: https://gist.github.com/anonymous/5c3fbd0cee301973f9c26002dc4854da

edit: new ver with much faster checks if a post is already downloaded or tagged, a quiet mode, and stats: https://gist.github.com/anonymous/82f1512434d66d68e9d5cfa9fd6933c7

Updated by anonymous

TheLuggage said:
urlopen uses the globally installed `OpenerRedirector`. The default global OpenerRedirector has a HTTPErrorProcessor step that raises `HTTPError` on non-200 responses. I also tested that it raises these against https://httpbin.org/status/ just in case: https://gist.github.com/anonymous/5c3fbd0cee301973f9c26002dc4854da

edit: new ver with much faster checks if a post is already downloaded or tagged, a quiet mode, and stats: https://gist.github.com/anonymous/f9936e74cedca08368561e3e6d505b91

Two thumbs up. I learned something. I'm way too used to using requests.

Updated by anonymous

Not really sure if this whole script works anymore, but I just get SSL errors on whatever I try to download.

  • 1