The Dilbert Blog RSS feed with full entries (and RSS scraping with Ruby on Dreamhost)

Written . Tagged Hpricot, RSS, Ruby.

The Dilbert Blog can be entertaining. However, the RSS feed was recently changed from containing the full entries to containing only a snippet, to increase ad revenue (by having people click through to the site).

I like full entries in my feed reader, so I made a feed that has them: dilbert.rss

That’s all you need to know if all you wanted was the feed. Read on for technical details.

Technical details :)

I’m using Ruby as a CGI script on Dreamhost (referral link).

Most of the heavy lifting is done by Christoffer Sawicki’s excellent Feedalizer gem.

I added very simple caching: rather than retrieving and parsing the web site on each hit, the output is written to a text file. The text file lives for 30 minutes and is then regenerated on the next hit.

This goes in dilbert.cgi, which should be executable:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/usr/bin/env ruby

# This script generates a RSS feed for The Dilbert Blog with full entries,
# as opposed to summaries.

# Enable using gems I've installed, on Dreamhost
# http://nateclark.com/articles/2006/10/20/dreamhost-your-own-packages-and-gems
ENV['GEM_PATH'] = "/usr/lib/ruby/gems/1.8:/home/henrik/.gems"

require "rubygems"
require "feedalizer"

URL   = "http://dilbertblog.typepad.com/the_dilbert_blog/"
CACHE_FILE = "dilbert.cache"
CACHE_LIFE  = 30  # minutes

def uncached?
  !File.exist?(CACHE_FILE) ||
  (Time.now - File.mtime(CACHE_FILE))/60 > CACHE_LIFE
end

print "Content-Type: text/xml\n\n"

feedalize(URL) do
  feed.title = "The Dilbert Blog: Full Entries"
  feed.description = "Dilbert humor, business absurdity, the meaning of life. And full entries."
  feed.about = URL

  scrape_items("//div.entry") do |rss_item, html_entry|
    header = html_entry.at('.entry-header')
    rss_item.link  = header.at('a').attributes['href']
    rss_item.date  = Time.parse(html_entry.at('.post-footers').inner_text)
    rss_item.title = header.inner_text
    rss_item.description = html_entry.at('.entry-body').inner_html
  end

  File.open(CACHE_FILE, 'w') {|f| f.write output }  # cache output
end if uncached?

print File.read(CACHE_FILE)

If you don’t like .cgi in your URLs (I don’t), put this in a .htaccess file in the same directory:

.htaccess
1
2
RewriteEngine On
RewriteRule ^(dilbert)\.rss$    $1.cgi [L]