out of time

21 July 2009

Sunspot 0.9 Released

If you haven’t read the front page of this morning’s Times, then you heard it first here: Sunspot 0.9 is out. Here’s what I wrote about the upcoming version in my last post about Sunspot, on the occasion of the 0.8 release:

Sunspot 0.9 is up next; the main goal for that version is to replace solr-ruby with RSolr as the low-level Solr interface, which will open the door to more features in future versions (query-based faceting, LocalSolr support, etc.), but probably won’t have much effect on the API for that version (other than supporting use of the faster Curb library for the HTTP communication with Solr).

Turns out that was completely wrong: 0.9 introduces lots and lots of new features, inspired by requests from users, anticipated needs in my company’s application, and a close reading of the Sunspot wiki to find out more about what it’s capable of. Read on for the juicy details.

But first, this post is really long, so here’s the first table of contents I’ve ever put in a blog post:

If this is the first you’ve heard of Sunspot, I’d recommend checking out the home page and the README before reading on.

The new version introduces several improvements to how fulltext search is performed, giving you a lot more control over how it works and how relevance is calculated.

Dismax queries

Fulltext search in Sunspot 0.9 is performed using Solr’s dismax handler, an awesome feature that I had managed to be unaware of until fairly recently. You can read all about it in the Solr API docs, but the upshot is that Solr parses fulltext queries under the assumption that they are coming from user input. It provides a circumscribed subset of the usual Lucene query syntax: in particular, well-matched quotes can be used to demarcate a phrase, and the +/- modifiers work as usual. All other Lucene query syntax is escaped, and non-well-matched quotation marks are ignored.

As well as providing user-input-safe query parsing, the use of dismax queries opens up a few more features. Read on.

Field and document boosting

Probably the most requested feature for Sunspot is boosting. Sunspot now supports boosting at both the document level and the field level. Document boosts can be dynamic (i.e., evaluate a method or block for each indexed object to determine the boost) or static; field boosts are always static.

Some examples:

Sunspot.setup(Post) do
  boost 2.0 # All Posts will have a document-level boost of 2.0
  text :title, :boost => 1.5 # The title field will have a boost of 1.5
  text :body # Body will have the default boost of 1.0
end

Sunspot.setup(User) do
  boost do # featured users get a big boost
    if featured?
      2.0
    else
      0.75
    end
  end
end

In Sunspot 0.8, fulltext search always searched all of the text fields. In 0.9, you can specify which fields you’d like to search:

Sunspot.search(Post) do
  keywords 'pizza restaurant', :fields => [:title, :abstract]
end

If you don’t specify which fields to search, the search will of course apply to all indexed text fields. Note that when searching for multiple types, the set of available text fields is the union of text fields configured for the types under search, not the intersection as in attribute field search.

Index multiple values in text fields

Sunspot 0.8 didn’t allow the indexing of multiple values for text fields. In 0.9, all text fields allow multiple values. The reasoning for this is that the main reason to disallow multiple values is that multi-valued fields cannot be used for sorting; but sorting by tokenized text fields is nonsense anyway. So this is fine:

Sunspot.setup(Post) do
  text :comment_bodies do
    comments.map { |comment| comment.body }
  end
end

Search API

The new release also adds several enhancements to the general search API, increasing the information available from results as well as enhancing the power and ease of use of building queries.

It’s a hit

The Search class now implements the #hits method, which returns objects encapsulating result data coming directly from Solr. #hits is an enhanced version of the #raw_results method available in 0.8; #raw_results is still aliased and the objects returned are backward-compatible.

As in 0.8, Hit objects give access to the class name and primary key of the result object. They also give access to the keyword relevance score, if they’re coming from a keyword search. You can call #instance to load the actual result instance - the first time you call that method on a Hit, all the Hit objects will have their instances populated, so don’t worry about losing batch data retrieval.

Finally, Hit objects give access to stored fields, another new feature in v0.9. Stored fields can be configured in the indexer setup:

Sunspot.setup(Post) do
  string :title, :stored => true
end

Then here’s how to get data out of the Hit object:

search = Sunspot.search(Post) { keywords 'pizza' }
hit = search.hits.first
hit.class_name #=> "Post"
hit.primary_key #=> "12"
hit.score #=> 8.27
hit.stored(:title) #=> "Best pizza joints in town"
hit.instance #=> #<Post:0xb7d4c0d0>

Stored fields are most useful if you store a few crucial fields that you’d like to be able to display without making the round trip to persistent storage to retrieve the data.

Smarter shorthand restrictions

Sunspot 0.9 expands the types that can be passed as a value into the short-form #with method:

For example:

Sunspot.search(Post) { with(:blog_id, 1) } # Find all posts with blog_id 1
Sunspot.search(Post) { with(:category_ids, [1, 3, 5]) } # Find all posts whose
                                                        # category_id is 1, 3,
                                                        # or 5
Sunspot.search(Post) { with(:average_rating, 3.0..5.0) } # Find all posts whose
                                                         # average rating is
                                                         # between 3.0 and 5.0

Have your cake OR (eat it too AND enjoy it)

The query DSL now supports the #any_of and #all_of methods, which group the enclosed restrictions into disjunctions and conjunctions respectively. One good use case is if you have an expiry time field; you’d like to get results whose expiry is either in the future, or nil:

Sunspot.search(Post) do
  any_of do
    with(:expires_at).greater_than(Time.now)
    with(:expires_at, nil)
  end
end

If you’d like to AND together restrictions inside an OR, you can nest an #all_of block:

Sunspot.search(Post) do
  any_of do
    with(:average_rating).greater_than(3.0)
    all_of do
      with(:featured, true)
      with(:published_at).greater_than(Time.now - 2.weeks)
    end
  end
end

Note that using #all_of at the top level of a query block is a no-op, since query restrictions are already combined using AND semantics.

Random ordering

By popular request, Sunspot now supports random ordering, which makes use of Solr’s RandomSortField:

Sunspot.search(Post) do
  order_by_random
end

Faceting

One of the biggest and most exciting changes in the new release is far fuller support for Solr’s faceting capabilities. While 0.8 supported basic field facets, I think it’s safe to say that 0.9 supports pretty much all of Solr’s built-in faceting features.

More facet control

The call to #facet inside the query DSL now takes the following options:

:sort
How the facet rows should be sorted. Options are :count, which orders by the number of results matching the row's value, and :index, which sorts the values lexically.
:limit
Maximum number of facet rows to return.
:minimum_count
The minimum count a facet row must have to be returned.
:zeros
Whether to return facet rows that match no documents in the scope. Default is false; setting to true is the same as setting :minimum_count to 0.

So, for example:

Sunspot.search do
  facet :author_name, :sort => :index
  facet :category_ids, :sort => :count, :limit => 5
end

Time Facets

Solr has special support for faceting over a time range, with a given interval to which rows should apply. The new release adds an API for this type of facet; simply provide the :time_range key to use this type of faceting. Note that time faceting only works with time type fields - Sunspot will fail fast if you try to use it with another field type.

Available options for time faceting are:

:time_range
A Range object of Times. This is the full range over which times are returned. Specifying this field also enables time faceting.
:time_interval
Interval that each row should cover, in seconds. The default is 1 day.
:time_other
Times outside the range that should be returned as facet rows. Allowed values are :before, :after, :between, :none, and :all. The default is :none.

For example:

Sunspot.search(Post) do
  facet :published_at, :time_range => 1.year.ago..Time.now,
                       :time_interval => 1.month
end

This will return facets covering each month that a publish date can fall into, for the last year. The facet rows returned in the results will have Range values containing the Time range for that particular row.

See the Solr Wiki for more information on date faceting.

Query facets

Field and date facets are useful, but the real ultimate power lies in Solr’s query faceting. This allows you to specify an arbitrary set of conditions for each row, making the possibilities pretty much endless. Sunspot 0.9 supports building query facets using the same DSL that is used for building normal search scope:

search = Sunspot.search(Post) do
  facet :rating_ranges do
    row 1.0..2.0 do
      with :average_rating, 1.0..2.0
    end
    row 2.0..3.0 do
      with :average_rating, 2.0..3.0
    end
    row 3.0..4.0 do
      with :average_rating, 3.0..4.0
    end
    row 4.0..5.0 do
      with :average_rating, 4.0..5.0
    end
  end
end

A few things to point out about the above. First, the concept of grouping the various rows into a single “facet” is introduced by Sunspot; Solr itself simply accepts an undifferentiated set of query facets, with no grouping. I decided to introduce the grouping as it seems more intuitive to me, and helps keep the API consistent when retrieving facets from the search results. Also, the arguments to the #facet and #row methods are not passed on to Solr; they’re simply there to make it easy to make sense of the results. In particular, the argument passed to #facet should be a symbol, and it’s used to retrieve the facet from the Search#facet method. The argument to #row can be whatever you like; it becomes the #value associated with that facet row in the results. So, in the results from the previous example, we’d see:

ratings_facet = search.facet(:rating_ranges)
ratings_facet.rows.first.value #=> 3.0..4.0

Note that the field facet options aren’t supported by query facets; they’re always ordered by count, zeros are always returned, and there’s no limit. If there’s demand, I’d be happy to support those options in a post-processing stage in a later version.

Instantiated Facets

It’s common to index database foreign keys in Solr; the new release adds explicit recognition of that fact where faceting is concerned, allowing you to specify that a field references a particular class, and then populate the facet row with the instance referenced by the row’s value. Instantiated facets are lazy-loaded, but when you request any facet row’s instance, all of the instances for the facet’s rows are loaded, so batch loading is still taken advantage of.

To specify that a field references a persisted class, just add the :references option to the field definition:

Sunspot.setup(Post) do
  integer :blog_id, :references => Blog
end

Then when you facet by :blog_id field, you’ll have access to the #instance method on the rows:

search = Sunspot.search(Post) do
  facet :blog_id
end
search.facet(:blog_id).rows.first.instance #=> #<Blog:0xb7e1cd0c>

Facet by class

If you’re performing a search on multiple object types, you may want to facet based on the class of the documents. Sunspot now adds the :class field to all index setups, and allows faceting on it. The facet row values are Class objects:

search = Sunspot.search(Post, Comment) do
  keywords 'great pizza'
  facet :class
end
search.facet(:class).rows.first.value #=> Post

New features that don’t fit into a group

Batch indexing

In my company’s production application, we perform complex operations that initiate Solr indexing from disparate places within the application code. However, it’s more efficient to send all adds/updates as part of a single request; the Sunspot.batch method makes that simple:

Sunspot.batch do
  Sunspot.index(Post.new)
  Sunspot.index(Comment.new)
  Sunspot.index(User.new)
end

When the batch block exits, Sunspot will send all of the indexed documents in a single HTTP request.

Date field type

Java doesn’t have a built-in type that contains date information without time information, like Ruby’s Date does; neither does Solr. For convenience, the new release creates a new date type, which indexes Ruby Date objects. Internally, the dates are stored as a time, with the time portion at midnight UTC. Facet values and stored values are returned as Ruby Date objects as expected.

Access to data accessors

Let’s say you’re running a Solr search against objects that are persisted with ActiveRecord; wouldn’t it be nice to be able to specify :include arguments for the database query? Toward this end, Sunspot now allows you to access the accessor for a given class from inside the query DSL; accessors can implement any methods they’d like to inform how data should be pulled from persistent storage.

For instance, let’s say your ActiveRecord adapter’s data accessor has an #includes= method, which tells it to pass the arguments into ActiveRecord’s :include option when performing the query. You can access that functionality like so:

Sunspot.search(Post, Comment) do
  adapter_for(Post).includes = [:blog, :comments]
end

Note that even if Post and Comment use the same adapter class (i.e., an ActiveRecord adapter), Sunspot will use a separate adapter instance for each, so you can safely set different options for each.

Easily configure your Solr installation with sunspot-configure-solr

While using the packaged Solr installation is great for development, I don’t recommend using it in production. The new release includes a new executable called sunspot-configure-solr, which writes a schema.xml file to the Sunspot installation of your choice, backing up the old schema.xml if it exists. sunspot-configure-solr includes a few options for areas where you can safely customize your schema:

--tokenizer
The tokenizer class to use for fulltext field tokenization; the default is solr.StandardTokenizerFactory
--extra-filters
Comma-separated list of extra filters to apply to fulltext fields. These will be applied after the default solr.StandardFilterFactory and solr.LowerCaseFilterFactory.
--dir
Solr home directory in which to install the schema file. This directory should contain a conf directory (it will be created if not). The default is the working directory from which the command is issued.

The tokenizer and filter classes can be specified with a shorthand: if the name passed is unqualified (i.e., doesn’t have any periods), it will be prefixed with “solr.” and suffixed with “FilterFactory” or “TokenizerFactory” respectively:

$ sunspot-configure-solr --dir /var/solr --tokenizer com.myapp.MyTokenizerFactory --filters EnglishPorter,com.myapp.MyFilterFactory

This will set the tokenizer to com.myapp.MyTokenizerFactory and add the extra filters solr.EnglishPorterFilterFactory. Note that more advanced Solr users will want to work with the schema file directly; just don’t change the naming scheme for the dynamic typed fields.

RSolr replaces solr-ruby

solr-ruby has been the de facto low-level Solr interaction layer for several years; RSolr is a newer library that has several advantages over solr-ruby:

Remove accidental ActiveSupport dependency

Sunspot 0.8 required WillPaginate into the spec suite by default, which in turn loaded ActiveSupport. Because of this, a few places in the code were inadvertantly using ActiveSupport extensions, and the specs still passed even though they shouldn’t have. I modified the spec suite to only load WillPaginate if an environment variable is passed, and fixed the broken specs.

Toward the future

So, what’s next, you may wonder! Perhaps you have a few ideas of your own. Perhaps they are:

Highlighting

Solr supports keyword highlighting — this has never been a big priority for me but I have heard from other Sunspot users that it would be a nice thing to have, so I’m hoping to get support for that in a future version.

LocalSolr support

LocalSolr is a Solr extension that brings geographical-based searching to Solr; in particular, results can be restricted and sorted by distance from a given lat/long. Do want.

Query facet abstraction

I’ve just begun giving this thought, but it seems pretty clear from the query faceting example above that certain common use cases for query facets could be abstracted into a more concise API. For instance, wouldn’t it be nice to write that example as:

Sunspot.search(Post) do
  range_facet :average_rating, 1.0..2.0, 2.0..3.0, 3.0..4.0, 4.0..5.0
end

Install it.

$ sudo gem install outoftime-sunspot --source=http://gems.github.com

Be in touch.

My goal for Sunspot has always been for it to become the de facto Solr abstraction library for Rubyists. I’m always happy to get feature requests, bug reports, and especially patches.