How to retrieve all the images from a website

Few weeks ago I posted on Twitter few rather bizarre screenshots. A composition of all the submissions for #ScreenshotSaturday, loosely ordered by colour. In this series of posts I’ll briefly explain how I did that using Python.

You can download the original pictures (16Mb, 71Mb, 40Mb, 13Mb) here.

Step 1: How to retrieve all the screenshots

The first step is, of course, to download all the screenshots. After few attempts, I decided to physically copy them on my HDD so that I could attempt several different visualisation techniques and analysis without querying ScreenshotSaturday every time. All the pages with previous screenshots are reachable from, making it very easy to access. Then, we just had retrieve all the images linked in the page.

Line 9 uses BeautifulSoup to navigate the DOM of the retrieved HTML document. For ScreenshotSaturday there isn’t a specific tag which identifies the screenshots. However, they are the only element to using milkbox for the zoom-in effect. Line 12 queries all of these hyperlinks.

The code refers directly to ScreenshotSaturday, but it can be easily adapted to any other website which has a similar structure.

A problem I’ve been experiencing is that some pictures are decoded wrongly. I believe this might be a problem with line 21, which makes strong assumptions on the way pixels are encoded.

Step 2: Download from different pages

Now, to download all the screenshots:

Leave it on overnight and get 10Gb of space on your HDD.


This series of post explains how to use Python to download all the images from a website. In the next posts I’ll discuss a more interesting (and less programming-oriented) topics: how to find the main colours in a image, using clustering techniques.

Before everybody tries to attempt this, I want to remember that the guys at ScreenshotSaturday might not be too happy if you overload their servers with too many requests. I grabbed the screenshots over a week, having some delay in between requests to spread out the traffic. Download responsibly.

    This is great. How did you create the thumbnail embeddings into one image?

