The Pug Automatic

The Dilbert Blog RSS feed with full entries (and RSS scraping with Ruby on Dreamhost)

Written December 14, 2007. Tagged Ruby, Hpricot, RSS.

The Dilbert Blog can be entertaining. However, the RSS feed was recently changed from containing the full entries to containing only a snippet, to increase ad revenue (by having people click through to the site).

I like full entries in my feed reader, so I made a feed that has them: dilbert.rss

That's all you need to know if all you wanted was the feed. Read on for technical details.

Technical details :)

I'm using Ruby as a CGI script on Dreamhost (referral link).

Most of the heavy lifting is done by Christoffer Sawicki's excellent Feedalizer gem.

I added very simple caching: rather than retrieving and parsing the web site on each hit, the output is written to a text file. The text file lives for 30 minutes and is then regenerated on the next hit.

This goes in dilbert.cgi, which should be executable:

dilbert.cgi
#!/usr/bin/env ruby

# This script generates a RSS feed for The Dilbert Blog with full entries,
# as opposed to summaries.

# Enable using gems I've installed, on Dreamhost
# http://nateclark.com/articles/2006/10/20/dreamhost-your-own-packages-and-gems
ENV['GEM_PATH'] = "/usr/lib/ruby/gems/1.8:/home/henrik/.gems"

require "rubygems"
require "feedalizer"

URL = "http://dilbertblog.typepad.com/the_dilbert_blog/"
CACHE_FILE = "dilbert.cache"
CACHE_LIFE = 30 # minutes

def uncached?
!File.exist?(CACHE_FILE) ||
(Time.now - File.mtime(CACHE_FILE))/60 > CACHE_LIFE
end

print "Content-Type: text/xml\n\n"

feedalize(URL) do
feed.title = "The Dilbert Blog: Full Entries"
feed.description = "Dilbert humor, business absurdity, the meaning of life. And full entries."
feed.about = URL

scrape_items("//div.entry") do |rss_item, html_entry|
header = html_entry.at('.entry-header')
rss_item.link = header.at('a').attributes['href']
rss_item.date = Time.parse(html_entry.at('.post-footers').inner_text)
rss_item.title = header.inner_text
rss_item.description = html_entry.at('.entry-body').inner_html
end

File.open(CACHE_FILE, 'w') {|f| f.write output } # cache output
end if uncached?

print File.read(CACHE_FILE)

If you don't like .cgi in your URLs (I don't), put this in a .htaccess file in the same directory:

.htaccess
RewriteEngine On
RewriteRule ^(dilbert)\.rss$ $1.cgi [L]