How to retrieve all the images from a website

Few weeks ago I posted on Twitter few rather bizarre screenshots. A composition of all the submissions for #ScreenshotSaturday, loosely ordered by colour. In this series of posts I’ll briefly explain how I did that using Python.

You can download the original pictures (16Mb, 71Mb, 40Mb, 13Mb) here.

Step 1: How to retrieve all the screenshots

The first step is, of course, to download all the screenshots. After few attempts, I decided to physically copy them on my HDD so that I could attempt several different visualisation techniques and analysis without querying ScreenshotSaturday every time. All the pages with previous screenshots are reachable from www.screenshotsaturday.com/week_x.html, making it very easy to access. Then, we just had retrieve all the images linked in the page.

import cv2
from skimage import io
from bs4 import BeautifulSoup

def downloadScreenshotsFromWeek(week):
	# Url of the screenshot page
	url = "http://screenshotsaturday.com/week%d.html" % week
	r  = requests.get(url)
	soup = BeautifulSoup(r.text)

	# Finds all the references to screenshots
	for link in soup.find_all('a', {"data-milkbox":"gall1"}):
		url = 'http://screenshotsaturday.com/' + link.get('href')

		# Avoids GIFs because not supported in cv2
		if url.endswith('.gif'):
			continue

		# Downloads and converts into the right format
		image = io.imread(url)
		image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

		# Saves the file
		i = url.rfind("/")
		file_name = url[i:]	# File name, not file address
		cv2.imwrite(file_name, image)

Line 9 uses BeautifulSoup to navigate the DOM of the retrieved HTML document. For ScreenshotSaturday there isn’t a specific tag which identifies the screenshots. However, they are the only element to using milkbox for the zoom-in effect. Line 12 queries all of these hyperlinks.

The code refers directly to ScreenshotSaturday, but it can be easily adapted to any other website which has a similar structure.

A problem I’ve been experiencing is that some pictures are decoded wrongly. I believe this might be a problem with line 21, which makes strong assumptions on the way pixels are encoded.

📰 Ad Break

Step 2: Download from different pages

Now, to download all the screenshots:

for week in range(1, 237):
	downloadScreenshotsFromWeek(week)

Leave it on overnight and get 10Gb of space on your HDD.

Conclusion

This series of post explains how to use Python to download all the images from a website. In the next posts I’ll discuss a more interesting (and less programming-oriented) topics: how to find the main colours in a image, using clustering techniques.

Before everybody tries to attempt this, I want to remember that the guys at ScreenshotSaturday might not be too happy if you overload their servers with too many requests. I grabbed the screenshots over a week, having some delay in between requests to spread out the traffic. Download responsibly.


Comments

2 responses to “How to retrieve all the images from a website”

  1. This is great. How did you create the thumbnail embeddings into one image?

  2. […] Part 1: How to retrieve all the images from a website […]

Leave a Reply

Your email address will not be published. Required fields are marked *