out of time
21 July 2009
Sunspot 0.9 Released
If you haven’t read the front page of this morning’s Times, then you heard it first here: Sunspot 0.9 is out. Here’s what I wrote about the upcoming version in my last post about Sunspot, on the occasion of the 0.8 release:
Sunspot 0.9 is up next; the main goal for that version is to replace solr-ruby with RSolr as the low-level Solr interface, which will open the door to more features in future versions (query-based faceting, LocalSolr support, etc.), but probably won’t have much effect on the API for that version (other than supporting use of the faster Curb library for the HTTP communication with Solr).
Turns out that was completely wrong: 0.9 introduces lots and lots of new features, inspired by requests from users, anticipated needs in my company’s application, and a close reading of the Sunspot wiki to find out more about what it’s capable of. Read on for the juicy details.
But first, this post is really long, so here’s the first table of contents I’ve ever put in a blog post:
- Dismax queries
- Field and document boosting
- Specifying fields for fulltext search
- Indexing multiple values in text fields
- Accessing keyword relevance score and stored field values
- Smarter shorthand restrictions
- Using disjunctions and conjunctions
- Random ordering
- More facet control
- Time range facets
- Get referenced objects from facets on foreign keys
- Facet by class
- Batch indexing
- New Date field type
- Direct access to data accessors
- Executable to configure production Solr instances
- RSolr is in; solr-ruby is out
- Sunspot no longer accidentally depends on ActiveSupport
- What to look for in future versions
- Installation
- Submit feature requests, bug reports, and patches
If this is the first you’ve heard of Sunspot, I’d recommend checking out the home page and the README before reading on.
A Better Fulltext Search
The new version introduces several improvements to how fulltext search is performed, giving you a lot more control over how it works and how relevance is calculated.
Dismax queries
Fulltext search in Sunspot 0.9 is performed using Solr’s dismax handler, an awesome feature that I had managed to be unaware of until fairly recently. You can read all about it in the Solr API docs, but the upshot is that Solr parses fulltext queries under the assumption that they are coming from user input. It provides a circumscribed subset of the usual Lucene query syntax: in particular, well-matched quotes can be used to demarcate a phrase, and the +/- modifiers work as usual. All other Lucene query syntax is escaped, and non-well-matched quotation marks are ignored.
As well as providing user-input-safe query parsing, the use of dismax queries opens up a few more features. Read on.
Field and document boosting
Probably the most requested feature for Sunspot is boosting. Sunspot now supports boosting at both the document level and the field level. Document boosts can be dynamic (i.e., evaluate a method or block for each indexed object to determine the boost) or static; field boosts are always static.
Some examples:
Choose your fields for fulltext search
In Sunspot 0.8, fulltext search always searched all of the text fields. In 0.9, you can specify which fields you’d like to search:
If you don’t specify which fields to search, the search will of course apply to all indexed text fields. Note that when searching for multiple types, the set of available text fields is the union of text fields configured for the types under search, not the intersection as in attribute field search.
Index multiple values in text fields
Sunspot 0.8 didn’t allow the indexing of multiple values for text fields. In 0.9, all text fields allow multiple values. The reasoning for this is that the main reason to disallow multiple values is that multi-valued fields cannot be used for sorting; but sorting by tokenized text fields is nonsense anyway. So this is fine:
Search API
The new release also adds several enhancements to the general search API, increasing the information available from results as well as enhancing the power and ease of use of building queries.
It’s a hit
The Search class now implements the #hits
method, which returns
objects encapsulating result data coming directly from Solr. #hits
is an enhanced version of the #raw_results
method available in
0.8; #raw_results
is still aliased and the objects returned are
backward-compatible.
As in 0.8, Hit objects give access to the class name and primary key of the
result object. They also give access to the keyword relevance score, if they’re
coming from a keyword search. You can call #instance
to load the
actual result instance - the first time you call that method on a Hit, all the
Hit objects will have their instances populated, so don’t worry about losing
batch data retrieval.
Finally, Hit objects give access to stored fields, another new feature in v0.9. Stored fields can be configured in the indexer setup:
Then here’s how to get data out of the Hit object:
Stored fields are most useful if you store a few crucial fields that you’d like to be able to display without making the round trip to persistent storage to retrieve the data.
Smarter shorthand restrictions
Sunspot 0.9 expands the types that can be passed as a value into the short-form
#with
method:
- Passing a scalar value will scope to results where the field contains that value (this is not new).
- Passing an Array will scope to results where the field contains any of the values in the array.
- Passing a Range will scope to results where the field’s value is in the range.
For example:
Have your cake OR (eat it too AND enjoy it)
The query DSL now supports the #any_of
and #all_of
methods, which group the enclosed restrictions into disjunctions and
conjunctions respectively. One good use case is if you have an expiry time
field; you’d like to get results whose expiry is either in the future, or nil:
If you’d like to AND together restrictions inside an OR, you can nest an
#all_of
block:
Note that using #all_of at the top level of a query block is a no-op, since query restrictions are already combined using AND semantics.
Random ordering
By popular request, Sunspot now supports random ordering, which makes use of Solr’s RandomSortField:
Faceting
One of the biggest and most exciting changes in the new release is far fuller support for Solr’s faceting capabilities. While 0.8 supported basic field facets, I think it’s safe to say that 0.9 supports pretty much all of Solr’s built-in faceting features.
More facet control
The call to #facet
inside the query DSL now takes the following
options:
:sort
:count
, which orders by the number of results
matching the row's value, and :index
, which
sorts the values lexically.:limit
:minimum_count
:zeros
So, for example:
Time Facets
Solr has special support for faceting over a time range, with a given interval
to which rows should apply. The new release adds an API for this type of facet;
simply provide the :time_range
key to use this type of faceting.
Note that time faceting only works with time
type fields - Sunspot
will fail fast if you try to use it with another field type.
Available options for time faceting are:
:time_range
:time_interval
:time_other
:before
, :after
, :between
,
:none
, and :all
. The default is :none
.For example:
This will return facets covering each month that a publish date can fall into, for the last year. The facet rows returned in the results will have Range values containing the Time range for that particular row.
See the Solr Wiki for more information on date faceting.
Query facets
Field and date facets are useful, but the real ultimate power lies in Solr’s query faceting. This allows you to specify an arbitrary set of conditions for each row, making the possibilities pretty much endless. Sunspot 0.9 supports building query facets using the same DSL that is used for building normal search scope:
A few things to point out about the above. First, the concept of grouping the
various rows into a single “facet” is introduced by Sunspot; Solr itself simply
accepts an undifferentiated set of query facets, with no grouping. I decided to
introduce the grouping as it seems more intuitive to me, and helps keep the API
consistent when retrieving facets from the search results. Also, the arguments
to the #facet
and #row
methods are not passed on to
Solr; they’re simply there to make it easy to make sense of the results. In
particular, the argument passed to #facet
should be a symbol, and
it’s used to retrieve the facet from the Search#facet
method. The
argument to #row
can be whatever you like; it becomes the
#value
associated with that facet row in the results. So, in the
results from the previous example, we’d see:
Note that the field facet options aren’t supported by query facets; they’re always ordered by count, zeros are always returned, and there’s no limit. If there’s demand, I’d be happy to support those options in a post-processing stage in a later version.
Instantiated Facets
It’s common to index database foreign keys in Solr; the new release adds explicit recognition of that fact where faceting is concerned, allowing you to specify that a field references a particular class, and then populate the facet row with the instance referenced by the row’s value. Instantiated facets are lazy-loaded, but when you request any facet row’s instance, all of the instances for the facet’s rows are loaded, so batch loading is still taken advantage of.
To specify that a field references a persisted class, just add the
:references
option to the field definition:
Then when you facet by :blog_id
field, you’ll have access to the
#instance
method on the rows:
Facet by class
If you’re performing a search on multiple object types, you may want to facet
based on the class of the documents. Sunspot now adds the :class
field to all index setups, and allows faceting on it. The facet row values are
Class objects:
New features that don’t fit into a group
Batch indexing
In my company’s production application, we perform complex operations that
initiate Solr indexing from disparate places within the application code.
However, it’s more efficient to send all adds/updates as part of a single
request; the Sunspot.batch
method makes that simple:
When the batch block exits, Sunspot will send all of the indexed documents in a single HTTP request.
Date field type
Java doesn’t have a built-in type that contains date information without time
information, like Ruby’s Date does; neither does Solr. For convenience, the new
release creates a new date
type, which indexes Ruby Date objects.
Internally, the dates are stored as a time, with the time portion at midnight
UTC. Facet values and stored values are returned as Ruby Date objects as
expected.
Access to data accessors
Let’s say you’re running a Solr search against objects that are persisted with
ActiveRecord; wouldn’t it be nice to be able to specify :include
arguments for the database query? Toward this end, Sunspot now allows you to
access the accessor for a given class from inside the query DSL; accessors can
implement any methods they’d like to inform how data should be pulled from
persistent storage.
For instance, let’s say your ActiveRecord adapter’s data accessor has an
#includes=
method, which tells it to pass the arguments into
ActiveRecord’s :include
option when performing the query. You can
access that functionality like so:
Note that even if Post and Comment use the same adapter class (i.e., an ActiveRecord adapter), Sunspot will use a separate adapter instance for each, so you can safely set different options for each.
Easily configure your Solr installation with sunspot-configure-solr
While using the packaged Solr installation is great for development, I don’t
recommend using it in production. The new release includes a new executable
called sunspot-configure-solr
, which writes a schema.xml file to
the Sunspot installation of your choice, backing up the old schema.xml if it
exists. sunspot-configure-solr
includes a few options for areas
where you can safely customize your schema:
--tokenizer
solr.StandardTokenizerFactory
--extra-filters
solr.StandardFilterFactory
and
solr.LowerCaseFilterFactory
.--dir
conf
directory (it will be created if not). The
default is the working directory from which the command is issued.The tokenizer and filter classes can be specified with a shorthand: if the name passed is unqualified (i.e., doesn’t have any periods), it will be prefixed with “solr.” and suffixed with “FilterFactory” or “TokenizerFactory” respectively:
This will set the tokenizer to com.myapp.MyTokenizerFactory
and add
the extra filters solr.EnglishPorterFilterFactory
. Note that more
advanced Solr users will want to work with the schema file directly; just don’t
change the naming scheme for the dynamic typed fields.
RSolr replaces solr-ruby
solr-ruby has been the de facto low-level Solr interaction layer for several years; RSolr is a newer library that has several advantages over solr-ruby:
- It’s more actively maintained.
- It passes queries directly to Solr without interpreting or modifying the parameters; this means that it implicitly supports any query parameters that are supported by Solr (or any Solr extensions that are installed).
- It gives you the choice between using Net::HTTP, which is slow, and
curb, which is a Ruby interface to
libcurl, and is fast. Sunspot uses Net::HTTP
for HTTP interaction by default for maximum compatibility, but applications
can easily switch to curb by setting
Sunspot.config.http_client = :curb
(do this before initiating any interaction with Solr).
Remove accidental ActiveSupport dependency
Sunspot 0.8 require
d WillPaginate into the spec suite by default,
which in turn loaded ActiveSupport. Because of this, a few places in the code
were inadvertantly using ActiveSupport extensions, and the specs still passed
even though they shouldn’t have. I modified the spec suite to only load
WillPaginate if an environment variable is passed, and fixed the broken specs.
Toward the future
So, what’s next, you may wonder! Perhaps you have a few ideas of your own. Perhaps they are:
Highlighting
Solr supports keyword highlighting — this has never been a big priority for me but I have heard from other Sunspot users that it would be a nice thing to have, so I’m hoping to get support for that in a future version.
LocalSolr support
LocalSolr is a Solr extension that brings geographical-based searching to Solr; in particular, results can be restricted and sorted by distance from a given lat/long. Do want.
Query facet abstraction
I’ve just begun giving this thought, but it seems pretty clear from the query faceting example above that certain common use cases for query facets could be abstracted into a more concise API. For instance, wouldn’t it be nice to write that example as:
Install it.
Be in touch.
My goal for Sunspot has always been for it to become the de facto Solr abstraction library for Rubyists. I’m always happy to get feature requests, bug reports, and especially patches.
- If you notice some missing functionality in Sunspot or have a sweet idea for a new feature, please shoot a message to the Sunspot mailing list.
- Found a bug? Submit a ticket on Lighthouse
- Either of the above, and have a patch? Shoot me a pull request on GitHub.