Few weeks ago I posted on Twitter few rather bizarre screenshots. A composition of all the submissions for #ScreenshotSaturday, loosely ordered by colour. In this series of posts I’ll briefly explain how I did that using Python.
— Alan Zucconi (@AlanZucconi) May 9, 2015
You can download the original pictures (16Mb, 71Mb, 40Mb, 13Mb) here.
Step 1: How to retrieve all the screenshots
The first step is, of course, to download all the screenshots. After few attempts, I decided to physically copy them on my HDD so that I could attempt several different visualisation techniques and analysis without querying ScreenshotSaturday every time. All the pages with previous screenshots are reachable from www.screenshotsaturday.com/week_x.html, making it very easy to access. Then, we just had retrieve all the images linked in the page.
from skimage import io
from bs4 import BeautifulSoup
# Url of the screenshot page
url = "http://screenshotsaturday.com/week%d.html" % week
r = requests.get(url)
# Finds all the references to screenshots
url = 'http://screenshotsaturday.com/' + link.get('href')
# Avoids GIFs because not supported in cv2
# Downloads and converts into the right format
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# Saves the file
i = url.rfind("/")
file_name = url[i:] # File name, not file address
Line 9 uses BeautifulSoup to navigate the DOM of the retrieved HTML document. For ScreenshotSaturday there isn’t a specific tag which identifies the screenshots. However, they are the only element to using milkbox for the zoom-in effect. Line 12 queries all of these hyperlinks.
The code refers directly to ScreenshotSaturday, but it can be easily adapted to any other website which has a similar structure.
A problem I’ve been experiencing is that some pictures are decoded wrongly. I believe this might be a problem with line 21, which makes strong assumptions on the way pixels are encoded.
Step 2: Download from different pages
Now, to download all the screenshots:
for week in range(1, 237):
Leave it on overnight and get 10Gb of space on your HDD.
This series of post explains how to use Python to download all the images from a website. In the next posts I’ll discuss a more interesting (and less programming-oriented) topics: how to find the main colours in a image, using clustering techniques.
Before everybody tries to attempt this, I want to remember that the guys at ScreenshotSaturday might not be too happy if you overload their servers with too many requests. I grabbed the screenshots over a week, having some delay in between requests to spread out the traffic. Download responsibly.
- Part 1: How to retrieve all the images from a website
- Part 2: How to find the main colours in an image
- Part 3: The incredibly challenging task of sorting colours
Support this blog! ♥
For the past three years I've been dedicating more and more of my time to the creation of quality tutorials, mainly about game development and machine learning. If you think these posts have either helped or inspired you, please consider supporting this blog.