Wget All Recent Images From a Tumblr

Written by

I love tumblr. I’m following a few blogs that are filled with pictures that inspire me. Economy of Space, for example is a tumblr about using tiny spaces well with an emphasis on tiny houses and small apartments. I’d like to be able to save the images from the blog and think about them later. Since I know I love this blog, I’d like the images to download automatically. Wget to the rescue!

The command

<code># Download the images using wget
wget --quiet -rH -Dmedia.tumblr.com,economyofspace.tumblr.com -R "*avatar*" -A "[0-9]" \
 -A "*index*" -A jpeg,jpg,bmp,gif,png --level=10 -nd -nc \
http://economyofspace.tumblr.com/
</code>

Explanation of the options

--quiet tell wget not to output what it’s doing, it’s useful because this wget is part of a cron job for me. I know it works, I don’t need to see the output. If you’re debugging or playing around, turn this off.

-rH tells wget to recursively download (r) the site and to span hosts (-H). This means that wget can wander into hosts that aren’t http://economyofspace.tumblr.com/. This is a pretty risky thing, it’s easy to end up downloading the whole internet. This brings us to the next flag.

-Dmedia.tumblr.com,economyofspace.tumblr.com tells wget to only visit domains that are part of media.tumblr.com and economyofspace.tumblr.com. What’s interesting is that subdomains end up on the approved list, so 29.media.tumblr.com is just fine.

-R "*avatar*". This tells wget to download any files with avatar in them.

-A "[0-9]" -A "*index*" -A jpeg,jpg,bmp,gif,png. -A is the counterpart to -R. This tells wget to allow files that contains jpeg, jpg, bmp, gif, or png and to allow any files that contain index and to allow files that contain 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. Why the list? We need the index page to keep spidering, we need the number to download pages from the archive, and we need the file formats to download those image types.

--level=10 tells wget to go 10 levels deep. This would be dangerous except that we’ve restricted domains and limited the file names that can be downloaded.

-nd tells wget not to recreate the directory structure and to instead download all of the files into the current working directory.

nc tells gets not to re-download files that exist. This is to keep tumblr happy with us and prevent files from being needlessly redownloaded.

Finally, http://economyofspace.tumblr.com/ is the domain we’re downloading from.

Github project

I’ve packaged up a script that makes downloading from a tumblr easy. It also makes downloading a set of tumblrs easy. Check out tumbld on Github

Comments or Questions? Contact Nick @nixterrimus on twitter.

Nick is a software engineer, geek, web enthusaist, open source contributor, home automation tinkerer, ocean admirer and all around general optimist living in San Francisco. Want to get in touch about professional matters? Nick Rowe is also available on LinkedIn.