Awesome Rails search with Solr and Sunspot

Mat Brown

Pivotal Labs Tech Talks

16 March 2010
The Road to Victory
- Get up and running with Sunspot::Rails — it's easy!
- How Solr works
- Exploring Sunspot and Solr's search features
- Sunspot in production
Let's get started!
Install Sunspot

Install the gem(s):
```
# gem install sunspot-rails
```
Generate the config file:
```
$ script/generate sunspot
```
Start Solr
```
$ rake sunspot:solr:start
```
This creates some files and directories in your Rails root:
```
solr/conf/solrconfig.xml
solr/conf/schema.xml
solr/data/
```
You can edit the XML files to customize Solr's behavior.

Index your data

Make your model searchable:

class Post < ActiveRecord::Base
  searchable do
    text :title, :body
  end
end

Add your data to Solr:

$ rake sunspot:reindex

You only need to do that once.

Add search to your app

Create a controller action:

class PostsController < ApplicationController
  def search
    @search = Post.search do
      keywords(params[:q])
    end
  end
end

Add search to your app

Output your results:

.results
  - @search.results.each do |post|
    .result
      %h1= h(post.title)
      %p= h(truncate(post.body))

.pagination= will_paginate(@search.results)

That was easy.
So what is this Solr thing?
Solr is a standalone HTTP server that provides a document-oriented, inverted index of fulltext and scalar data.
This is an inverted index.
Why is Solr awesome?
- Data is indexed by your application, how you want, when you want.
- Standalone web service provides multiple good paths to scaling.
- Wildly popular and maintained by the Apache Software Foundation.
What kind of data can Solr index?
- Fulltext
- Scalar types
  - String
  - Integer
  - Float
  - Time
  - Boolean
- Trie Fields

Attribute Fields in Sunspot

class Post < ActiveRecord::Base
  belongs_to :blog
  has_and_belongs_to_many :categories

  searchable do
    integer :blog_id
    integer :category_ids, :multiple => true
    time :published_at, :trie => true
  end
end

Attribute Field Scoping

Match a value exactly:
```
with(:blog_id, 1)
```
Attribute Field Scoping

Match by inequality:
```
with(:published_at).less_than(Time.now)
```
Attribute Field Scoping

Match by multiple values:
```
with(:category_ids).any_of([1, 3, 5])
```

Attribute Field Scoping

Match with a range:

with(:published_at).between(
  Time.parse('2010-01-01')..Time.parse('2010-02-01')
)

Attribute Field Scoping

Combine restrictions with connectives:

any_of do
  with(:expired_at).greater_than(Time.now)
  with(:expired_at, nil)
end

Attribute Field Scoping

Exclude specific instances from results:
```
without(current_post)
```
Drilling Down
Drill-down search: The Problem
- The user has performed a keyword search.
- We want to allow them to drill down by category.
- However, we only want to show categories which will return results given the keywords they've entered.

Facets: The Solution

Post.search do
  keywords params[:q]
  with :category_ids, params[:category_id] if category_id
  facet :category_ids
end

Facets: The Solution

- @search.facet(:category_ids).rows.each do |row|
  - category_id, count = row.value, row.count
  - category = Category.find(category_id)
  - params_with_facet = params.merge(:category_id => category_id)
  .facet
    = link_to(category.name, params_with_facet)
    == (#{count})


  But that's not very efficient!?
  
    Instantiated Facets
    class Post < ActiveRecord::Base
  has_and_belongs_to_many :categories
  searchable do
    integer :category_ids,
            :multiple => true,
            :references => Category
  end
end
    Now Sunspot knows that :category_id is a reference to Category objects.
  
  
    Instantiated Facets
    - @search.facet(:category_ids).rows.each do |row|
  - category, count = row.instance, row.count
  - params_with_facet = params.merge(:category_id => category.id)
  .facet
    = link_to(category.name, params_with_facet)
    == (#{count})
    The first time you call row.instance on any row, Sunspot
    will eager-load all of the Category objects referenced by the
    facet rows.
  

  
    But what if I've already selected a category? Can't I have more?
    
      The user has already selected a category.
      Most posts only have one or two categories.
      So, the only categories returned by the facet are the ones that are cross-assigned to posts in the selected category.
      We'd really like to be able to select more than one category.
    
  
  
    But what if I've already selected a category? Can't I have more?
    
      The user has already selected a category.
      Most posts only have one or two categories.
      So, the only categories returned by the facet are the ones that are cross-assigned to posts in the selected category.
      We'd really like to be able to select more than one category.
      No problem.
    
  
  
    Multiselect Faceting
    
      For the purposes of computing the category facet, we want to ignore the fact that a category has already been selected.
      But we want to take into account all of the other selections the user has made.
      Enter Multiselect Faceting, a new feature in Solr 1.4.
    
  
  
    Multiselect Faceting
    Post.search do
  keywords params[:q]
  if params[:category_ids]
    category_filter =
      with :category_ids, params[:category_ids]
  end
  facet :category_ids, :exclude => category_filter
end
  
  
    Tuning Fulltext Relevance
    Returning the most relevant results for a keyword search is crucial. Here's what we'd like to do:
    
      Keyword matches in the title field are more important than matches in the body field.
      If the exact search phrase is present in the title, consider that highly relevant.
      Posts published in the last 2 weeks are considered more relevant.
    
  
  
    Tuning Fulltext Relevance: Field Boost
    Post.search do
  keywords params[:q] do
    boost_fields :title => 2.0
  end
end
    Keyword matches in the title field are twice as relevant as keyword matches in the body field.
  
  
    Tuning Fulltext Relevance: Phrase Fields
    Post.search do
  keywords params[:q] do
    phrase_fields :title => 5.0
  end
end
    If the search phrase is found exactly in a post's title field, that post is 5 times more important than it would be otherwise.
  
  
    Tuning Fulltext Relevance: Boost Queries
    Post.search do
  keywords params[:q] do
    boost(2.0) do
      with(:published_at).greater_than(2.weeks.ago)
    end
  end
end
    Posts published in the last two weeks are twice as relevant as older posts.
  
  
    Solr in Production
  
  
    Running Solr in Production
    
      
        Don't:
        
          use Sunspot's embedded Solr instance.
          use package-managed Tomcat/Jetty/Solr packages.
        
      
      
        Do:
        
          set up and maintain your own Solr instance.
          give Solr its own machine.
          use this tutorial: http://wiki.apache.org/solr/SolrTomcat
        
      
    
  
  
    Running Solr in Production
    When you first install Solr:
    $ sunspot-installer -fv /path/to/my/solr/instance/home
    When you upgrade Sunspot:
    $ sunspot-installer -v /path/to/my/solr/instance/home
  
  
    Commit Frequency
    What happens when you index or delete a document:
    
      Solr stages your changes in memory.
      It's fast and inexpensive.
      But the changes aren't yet reflected in search results.
    
  
  
    Commit Frequency
    What happens when you commit the index:
    
      Solr writes all of the changes since the last commit to disk.
      Solr's active Searcher instance is deprecated, and will not service
      new search requests.
      Solr instantiates a new Searcher, which reads the updated index from
      disk into memory.
      Then it auto-warms your caches.
      Then it's ready to respond to search requests.
      It's slow and expensive.
    
  
  
    How not to commit too much
    
      By default, Sunspot::Rails commits at the end of every request that
      updates the Solr index. Turn that off.
      Use Solr's autoCommit functionality. That's configured in
      solr/conf/solrconfig.xml
      Be glad for assumed inconsistency. Don't use search where results need
      to be up-to-the-second.
    
  
  
    Scaling Solr
    
      Operating System Resources
      Caching
      Replication
      Sharding
    
  
  
    Scaling Solr: Operating System Resources
    
      Solr's memory needs will grow in proportion to your index size.
      Make sure Solr's heap size is sufficient to hold your index.
    
  
  
    Scaling Solr: Caching
    
      Filter Cache: Cache the set of documents matching a particular
      filter. Each filter is cached independently for reuse in subsequent
      searches.
      Query Result Cache: Cache the results of a particular query, in
      order.
      Document Cache: Cache stored fields for a given document.
    
    When a new Searcher is instantiated (after a commit), the caches are
    autowarmed, meaning that they are pre-populated with data from the
    new index. This reduces cache misses but means that starting up a searcher
    takes longer. You can configure how much autowarming you want.
  
  
    Scaling Solr: Replication
    
      Standard master/slave architecture.
      Scales with the frequency of search traffic.
      All master/slave communication done over HTTP.
      Slave instance(s) poll master at a configured interval.
      If the master index has been committed since the last poll, slaves
      receive the changes.
      Sunspot supports a master/slave configuration using the
      MasterSlaveSessionProxy.
      If you're using more than one slave, put a load balancer in front of
      them.
    
  
  
    Scaling Solr: Sharding
    
      Index is divided according to natural criteria (data type, geography,
      etc).
      Scales with the size of your index.
      Writes go to a single shard instance based on those criteria.
      Searches go to an single instance, which then aggregates the results
      from all the shards.
      Sunspot gives you a starting point for this using the
      ShardingSessionProxy, which you subclass to implement the
      business logic for determining shards.
    
  
  That is all.
  Questions?
  
    More Info
    
      Sunspot Home Page: http://outoftime.github.com/sunspot
      Sunspot Wiki: http://wiki.github.com/outoftime/sunspot
      Sunspot API Docs: http://outoftime.github.com/sunspot/docs
      Sunspot::Rails API Docs: http://outoftime.github.com/sunspot/rails/docs
      Solr Wiki: http://wiki.apache.org/solr
      Sunspot IRC Channel: #sunspot-ruby @ Freenode
      Sunspot mailing list: ruby-sunspot@googlegroups.com

Awesome Rails search with Solr and Sunspot

Mat Brown

Pivotal Labs Tech Talks

16 March 2010

The Road to Victory

Let's get started!

Install Sunspot

Start Solr

Index your data

Add search to your app

Add search to your app

That was easy.

So what is this Solr thing?

This is an inverted index.

Why is Solr awesome?

What kind of data can Solr index?

Attribute Fields in Sunspot

Attribute Field Scoping

Attribute Field Scoping

Attribute Field Scoping

Attribute Field Scoping

Attribute Field Scoping

Attribute Field Scoping

Drilling Down

Drill-down search: The Problem

Facets: The Solution

Facets: The Solution

But that's not very efficient!?

Instantiated Facets

Instantiated Facets

But what if I've already selected a category? Can't I have more?

But what if I've already selected a category? Can't I have more?

Multiselect Faceting

Multiselect Faceting

Tuning Fulltext Relevance

Tuning Fulltext Relevance: Field Boost

Tuning Fulltext Relevance: Phrase Fields

Tuning Fulltext Relevance: Boost Queries

Solr in Production

Running Solr in Production

Running Solr in Production

Commit Frequency

Commit Frequency

How not to commit too much

Scaling Solr

Scaling Solr: Operating System Resources

Scaling Solr: Caching

Scaling Solr: Replication

Scaling Solr: Sharding

That is all.

Questions?

More Info