The Pug Automatic

Rake task to (re)index models for acts_as_solr

Written June 14, 2007. Tagged Ruby, Ruby on Rails, Rake.

I'm currently playing with Solr/acts_as_solr for a Rails project.

Alas, there doesn't seem to be any simple way to (re)index your models. If model objects are added or modified while the Solr server is running, the index is updated, but if you install acts_as_solr when you already have a bunch of data, you've got some work ahead of you.

You could loop over every object and run solr_save. A better idea is to run rebuild_solr_index on every model class. This method more or less amounts to running solr_save on each object and optimizing the index afterwards, though it can add items in batch to speed things up.

Better still would be to have this wrapped in a Rake task, so you can easily (re)index all models that act_as_solr without going into the console and processing the classes one by one.

This is solr_additions.rake. Stick it in lib/tasks inside your Rails project.

namespace :solr do

desc %{Reindexes data for all acts_as_solr models. Clears index first to get rid of orphaned records and optimizes index afterwards. RAILS_ENV=your_env to set environment. ONLY=book,person,magazine to only reindex those models; EXCEPT=book,magazine to exclude those models. START_SERVER=true to solr:start before and solr:stop after. BATCH=123 to post/commit in batches of that size: default is 300. CLEAR=false to not clear the index first; OPTIMIZE=false to not optimize the index afterwards.}
task :reindex => :environment do

includes = env_array_to_constants('ONLY')
if includes.empty?
includes = Dir.glob("#{RAILS_ROOT}/app/models/*.rb").map { |path| File.basename(path, ".rb").camelize.constantize }
end
excludes = env_array_to_constants('EXCEPT')
includes -= excludes

optimize = env_to_bool('OPTIMIZE', true)
start_server = env_to_bool('START_SERVER', false)
clear_first = env_to_bool('CLEAR', true)
batch_size = ENV['BATCH'].to_i.nonzero? || 300

if start_server
puts "Starting Solr server..."
Rake::Task["solr:start"].invoke
end

# Disable solr_optimize
module ActsAsSolr::CommonMethods
def blank() end
alias_method :deferred_solr_optimize, :solr_optimize
alias_method :solr_optimize, :blank
end

models = includes.select { |m| m.respond_to?(:rebuild_solr_index) }
models.each do |model|

if clear_first
puts "Clearing index for #{model}..."
ActsAsSolr::Post.execute(Solr::Request::Delete.new(:query => "type_t:#{model}"))
end

puts "Rebuilding index for #{model}..."
model.rebuild_solr_index(batch_size)

end

if models.empty?
puts "There were no models to reindex."
elsif optimize
puts "Optimizing..."
models.last.deferred_solr_optimize
end

if start_server
puts "Shutting down Solr server..."
Rake::Task["solr:stop"].invoke
end

end

def env_array_to_constants(env)
env = ENV[env] || ''
env.split(/\s*,\s*/).map { |m| m.singularize.camelize.constantize }.uniq
end

def env_to_bool(env, default)
env = ENV[env] || ''
case env
when /^true$/i: true
when /^false$/i: false
else default
end
end

end

The description and code hopefully make things apparent. The simplest way to use it is just rake solr:reindex which will (re)index all models with an act_as_solr declaration inside them, and which assumes there is a Solr server already running.

Since it depends on the "environment" task, you can also use e.g. RAILS_ENV=production to set what environment it applies to. I think there are gotchas related to using acts_as_solr against multiple environments, though.

Update 2007-06-18

The entry has been updated to take advantage of the batch processing support in acts_as_solr 0.9. Batched reindexing is several times faster, since the overhead for posting and indexing additions one by one really adds up.

Update 2007-06-19

Now takes an OPTIMIZE flag that defaults to true. Optimizing the index is recommended "once following large batch-like updates and/or once a day".

Update 2007-07-04

Now takes an CLEAR flag that defaults to true. Clearing means emptying the index before reindexing. rebuild_solr_index does not do this by default, which means it will not get rid of orphaned records – items that are indexed but no longer in the database.