Topic: Search how many artists are attached to a tag?

Posted under General

Let's say I search for a tag that has more than 10 pages. depending on the tag's popularity, there can be multiple artists who create posts for it.

For example, MLP has over 400 pages worth of posts which is like 100,000 posts. No doubt there are at least 2,0000 artists involved. So my question is if there is a way to get a list of those artists that have posted MLP or whatever else I may be looking for.

Thanks.

Updated

I could probably get that information via the API for you but I wouldn't really want to make so many requests without an admin telling me it's okay.

There's a few thousand less than 60,000 artist tags, so I'm not sure whether MLP artists could really account for more than a third of all artists here.

Updated by anonymous

Gotcha. Thanks. I really appreciate it. So I assume this isn't something a general user can do?

I think I get whatchu mean though, it'll cause a strain on the website right?

Updated by anonymous

Qmannn said:
If you have some competency using scripting languages, you probably can.

https://e621.net/help/show/api

I'm afraid that I don't understand these kinds of things but I will definitely take a look at what you posted to at least give it a shot.

Updated by anonymous

To expand on what Tuvalu said, this is basically what is involved:

1.Fetch all the JSON for all 400 pages of MLP stuff (ie. make 400 requests to the post/index API and store the results)
2. You would also have to get a list of all tags of 'artist' type on the site. I'm not sure whether you can use API for this or whether scraping HTML is necessary. Anyway, at the most optimistic, this would be at least 600 requests.
3. Then you would need to iterate over all the post data returned by step 1, building an artistname:number_of_posts histogram by correlating the post data's list of tags with the overall list of artist tags
4. Then you could use that histogram for whatever. The size of the histogram would equal the number of MLP artists. Or you could sort it, so you get artist_with_most_mlp_posts:count_of_posts, artist_with_2nd_most_mlp_posts:count_of_posts, etc.

Updated by anonymous

savageorange said:
To expand on what Tuvalu said, this is basically what is involved:

1.Fetch all the JSON for all 400 pages of MLP stuff (ie. make 400 requests to the post/index API and store the results)
2. You would also have to get a list of all tags of 'artist' type on the site. I'm not sure whether you can use API for this or whether scraping HTML is necessary. Anyway, at the most optimistic, this would be at least 600 requests.
3. Then you would need to iterate over all the post data returned by step 1, building an artistname:number_of_posts histogram by correlating the post data's list of tags with the overall list of artist tags
4. Then you could use that histogram for whatever. The size of the histogram would equal the number of MLP artists. Or you could sort it, so you get artist_with_most_mlp_posts:count_of_posts, artist_with_2nd_most_mlp_posts:count_of_posts, etc.

You only need to make the 400 requests, since the API returns a field named "artist" which is a list of the artist tags attached to the post. If you do it slowly(2 or fewer requests a second,) then go for it.

Because the pages will move over time for popular subjects, I suggest using the "before_id" pagination system when doing this many requests so that you obtain a stable pagination. If you turn the limit up to 320 you can reduce the number of requests required to a fairly low and easy to process count.

Updated by anonymous

Wow this all sounds like heavy duty stuff I've never done before and it'll take me a bit of time to get it but I'll definitely give it a shot. Thanks guys.

Updated by anonymous

So out of curiosity, I wrote a little program to do this for a given tag set.

After about twenty minutes, the following popped out for 'my_little_pony'
https://gist.github.com/zwagoth/518dbff79f0f8586a978683629f65692

Code for the program:
https://gist.github.com/zwagoth/77b394fe6fa8834fa5549e42bf786bba

I didn't add explicit rate limiting because it runs serially and e6 doesn't respond that quickly when you ask for 320 results(less than one request a second)

Updated by anonymous

That's interesting about the (non) rate limiting. I asked tony 'What is the ideal page size, from the PoV of server load?' awhile ago, and he didn't know. Your comment suggests that maybe larger is better?

EDIT: I clicked on the link thinking .. "Is this gonna be written in Python with Requests?" and sure enough...

Requests is awesome. Thanks for the before_id demo.

Updated by anonymous

savageorange said:
That's interesting about the (non) rate limiting. I asked tony 'What is the ideal page size, from the PoV of server load?' awhile ago, and he didn't know. Your comment suggests that maybe larger is better?

EDIT: I clicked on the link thinking .. "Is this gonna be written in Python with Requests?" and sure enough...

Requests is awesome. Thanks for the before_id demo.

I'd say that for bulk loading of information, larger requests are better, but only if not done in parallel. I pretty much subscribe to the 'one request at a time, process the results, check the rate limit, and issue another' crowd, since you can only pull information out of the systems so quickly, and doing it in parallel doesn't actually improve the overall data rate.

Updated by anonymous

The answer is 194 artists with 125 or more MLP posts, 974 artists with 25+, 3126 artists with 5+, and 10240 artists with at least 1.

Updated by anonymous

KiraNoot said:
I'd say that for bulk loading of information, larger requests are better, but only if not done in parallel.

Sure. I was actually mainly talking about the e621 profile setting that controls how many posts appear on a page when browsing. 'one request at a time, process the results, check the rate limit, and issue another' is pretty much how I do it already for automated stuff.

Updated by anonymous

TonyCoon

Former Staff

KiraNoot said:
So out of curiosity, I wrote a little program to do this for a given tag set.

After about twenty minutes, the following popped out for 'my_little_pony'
https://gist.github.com/zwagoth/518dbff79f0f8586a978683629f65692

Code for the program:
https://gist.github.com/zwagoth/77b394fe6fa8834fa5549e42bf786bba

I didn't add explicit rate limiting because it runs serially and e6 doesn't respond that quickly when you ask for 320 results(less than one request a second)

Now modify it to show (and sort by) what percent of each artist's posts includes the given tag instead of raw post counts :D

i.e. for MLP, something like braeburned: 97.6% (338/346)

Updated by anonymous

TonyLemur said:
Now modify it to show (and sort by) what percent of each artist's posts includes the given tag instead of raw post counts :D

i.e. for MLP, something like braeburned: 97.6% (338/346)

D: That would take a suuuper long time to run, since it needs to collect information about the tag post counts. I mean, you could do it with /tag/show.json?name= but that is where rate limiting is going to have to come in.

Updated by anonymous

I'm curious, doesn't the "Related X" lists when you upload scripts function in a somewhat similar ways?

Updated by anonymous

Circeus said:
I'm curious, doesn't the "Related X" lists when you upload scripts function in a somewhat similar ways?

Yes, and no, the numbers attached to them are not the total sum of the posts. So it finds the same information, but it isn't complete. Because related search samples the input, you get the number of posts within the sample that matched. You also can't get the complete list of related tags from those functions. There is potential for missing artist tags as a result of this behavior.

Updated by anonymous

I just wanted to say thanks for this script. It's been a while but it's really handy. I had to rig up my own way to input all of the artists into hyperlinks in a word doc. I usually don't do things like this but when there are a lot of artists for a tag, I want to be able to browse each one at my own leisure.

Updated by anonymous

kiranoot said:
So out of curiosity, I wrote a little program to do this for a given tag set.

After about twenty minutes, the following popped out for 'my_little_pony'
https://gist.github.com/zwagoth/518dbff79f0f8586a978683629f65692

Code for the program:
https://gist.github.com/zwagoth/77b394fe6fa8834fa5549e42bf786bba

I didn't add explicit rate limiting because it runs serially and e6 doesn't respond that quickly when you ask for 320 results(less than one request a second)

This is exactly the tool I was looking for, but the link is dead and there does not seem to be a backup of this git. Does anyone have a replacement?

  • 1