Rake task to (re)index models for acts_as_solr

Written . Tagged Rake, Ruby, Ruby on Rails.

I’m currently playing with Solr/acts_as_solr for a Rails project.

Alas, there doesn’t seem to be any simple way to (re)index your models. If model objects are added or modified while the Solr server is running, the index is updated, but if you install acts_as_solr when you already have a bunch of data, you’ve got some work ahead of you.

You could loop over every object and run solr_save. A better idea is to run rebuild_solr_index on every model class. This method more or less amounts to running solr_save on each object and optimizing the index afterwards, though it can add items in batch to speed things up.

Better still would be to have this wrapped in a Rake task, so you can easily (re)index all models that act_as_solr without going into the console and processing the classes one by one.

This is solr_additions.rake. Stick it in lib/tasks inside your Rails project.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
namespace :solr do

  desc %{Reindexes data for all acts_as_solr models. Clears index first to get rid of orphaned records and optimizes index afterwards. RAILS_ENV=your_env to set environment. ONLY=book,person,magazine to only reindex those models; EXCEPT=book,magazine to exclude those models. START_SERVER=true to solr:start before and solr:stop after. BATCH=123 to post/commit in batches of that size: default is 300. CLEAR=false to not clear the index first; OPTIMIZE=false to not optimize the index afterwards.}
  task :reindex => :environment do

    includes = env_array_to_constants('ONLY')
    if includes.empty?
      includes = Dir.glob("#{RAILS_ROOT}/app/models/*.rb").map { |path| File.basename(path, ".rb").camelize.constantize }
    end
    excludes = env_array_to_constants('EXCEPT')
    includes -= excludes

    optimize     = env_to_bool('OPTIMIZE',     true)
    start_server = env_to_bool('START_SERVER', false)
    clear_first   = env_to_bool('CLEAR',       true)
    batch_size   = ENV['BATCH'].to_i.nonzero? || 300

    if start_server
      puts "Starting Solr server..."
      Rake::Task["solr:start"].invoke
    end

    # Disable solr_optimize
    module ActsAsSolr::CommonMethods
      def blank() end
      alias_method :deferred_solr_optimize, :solr_optimize
      alias_method :solr_optimize, :blank
    end

    models = includes.select { |m| m.respond_to?(:rebuild_solr_index) }
    models.each do |model|

      if clear_first
        puts "Clearing index for #{model}..."
        ActsAsSolr::Post.execute(Solr::Request::Delete.new(:query => "type_t:#{model}"))
      end

      puts "Rebuilding index for #{model}..."
      model.rebuild_solr_index(batch_size)

    end

    if models.empty?
      puts "There were no models to reindex."
    elsif optimize
      puts "Optimizing..."
      models.last.deferred_solr_optimize
    end

    if start_server
      puts "Shutting down Solr server..."
      Rake::Task["solr:stop"].invoke
    end

  end

  def env_array_to_constants(env)
    env = ENV[env] || ''
    env.split(/\s*,\s*/).map { |m| m.singularize.camelize.constantize }.uniq
  end

  def env_to_bool(env, default)
    env = ENV[env] || ''
    case env
      when /^true$/i: true
      when /^false$/i: false
      else default
    end
  end

end

The description and code hopefully make things apparent. The simplest way to use it is just rake solr:reindex which will (re)index all models with an act_as_solr declaration inside them, and which assumes there is a Solr server already running.

Since it depends on the “environment” task, you can also use e.g. RAILS_ENV=production to set what environment it applies to. I think there are gotchas related to using acts_as_solr against multiple environments, though.

Update 2007-06-18
The entry has been updated to take advantage of the batch processing support in acts_as_solr 0.9. Batched reindexing is several times faster, since the overhead for posting and indexing additions one by one really adds up.
Update 2007-06-19
Now takes an OPTIMIZE flag that defaults to true. Optimizing the index is recommended “once following large batch-like updates and/or once a day”.
Update 2007-07-04
Now takes an CLEAR flag that defaults to true. Clearing means emptying the index before reindexing. rebuild_solr_index does not do this by default, which means it will not get rid of orphaned records – items that are indexed but no longer in the database.