Crowdsourcing Image Labeling for Anime and Cartoon Dataset
This post is the continuation of Anime or Cartoon? – Let the AI decide. In it I describe the project I’m collecting the data for.
In Deep Learning we often work with huge datasets and this project is no exception. I collected around 300.000 images from both anime and cartoons. Processing this many images is not an easy task though. After downloading the images from various sources, I quickly looked into them and found a few potential issues…
I designed this simple web app to crowdsouce the labeling of these images. I was heavily inspired by GalaxyZoo which is made for classifying different kinds of galaxies.
Now let’s talk about the problem. I wanted to go through every image so that I can filter out the problematic ones but it would have been impossible to do so manually. I found 4 main features that I wanted to assign to each image:
- Some images contain text. This is bad, because I especially don’t want the network to associate Japanese with anime and English with cartoon.
- Again, some images contain the logo of the TV channel. This is similar to the previous case, I don’t want specific TV channels to be associated with either category.
- Some images contain one or more Characters. This is good because it is easier to classify based on the style of the characters (the backgrounds can be more similar between the two categories)
- Sometimes the images are completely black or unrecognizable. In this case I don’t want the Neural Network to learn them.
For the “empty” images with minimal detail (only one color) I ended up checking the file size and deleting everything below ~1 kB (JPEG compression). This was a quick and pretty effective method.
I made a tutorial to teach users what to look for in the images.
For the database I chose MySql since I was already familiar with it and I didn’t have THAT much data.
Partly from what I had learned at university I designed the Database Scheme. The simplified version can be seen below:
images (images_id, series_id, filename)
ratings (ratings_id, image_id, text, person, logo, empty)
series (series_id, folder, is_anime)
I add a row for every series with its title, directory name and whether it’s anime or not. I store the images on the server every series having their own directory. For every image I insert a row in the images table and link back to the series through a foreign key.
Each submission is a row in the ratings table where I store the 4 options explained above and the image_id, time, etc.
Every time someone presses the next button I need to send out a new image to the user. This needs to be very fast or the app would look very slow. My first thought was to sort the entries by the number of “votes” and then select the one with the fewest. This had a couple of disadvantages.
- The query went through the images in order, because non of the images had any votes at first.
- It didn’t scale well. With 100.000 images it took 1 second on my machine and I was going to add a lot more of them.
After a little bit of experimentation and profiling I settled at randomly selecting the images. This solved the problems but I’m not sure if every image will be selected over time. And here is the final Query:
SELECT image_id,filename,series.title,series.folder FROM images AS r1 JOIN series USING(series_id) JOIN (SELECT CEIL(RAND() * (SELECT MAX(image_id) FROM images)) AS id) AS r2 WHERE r1.image_id >= r2.id ORDER BY r1.image_id ASC LIMIT 1
As a starting point I used the Material Design Light library, which as the name suggests provides Material Design themed elements. With this I coded the design. I choose this card form to make it easy to use on mobile devices.
But of course if the page doesn’t load the next page I cannot go back, right? Wrong. This feature was new to me but I experienced it on other pages so I knew where to look. We can manipulate the browser history with JS by pushing to the window.history stack and we can even store data in it. In my case the image url and the title of the series.
I’m very proud of myself to finish a project at this scale from backend to frontend all by myself. I was very excited for Instant Android apps which became public just recently. But unfortunately its support is very limited and even restricted to Nexus devices at this time so I gave upon it. It would have given a native feel to mobile users.
The project is on GitHub, if you want to check it out: https://github.com/Dawars/Anime-Image-Labeling