• Awesome Rails search with Solr and Sunspot

    Mat Brown

    Pivotal Labs Tech Talks

    16 March 2010

  • The Road to Victory

    • Get up and running with Sunspot::Rails — it's easy!
    • How Solr works
    • Exploring Sunspot and Solr's search features
    • Sunspot in production
  • Let's get started!

  • Install Sunspot

    Install the gem(s):

    # gem install sunspot-rails

    Generate the config file:

    $ script/generate sunspot
  • Start Solr

    $ rake sunspot:solr:start

    This creates some files and directories in your Rails root:

    solr/conf/solrconfig.xml
    solr/conf/schema.xml
    solr/data/

    You can edit the XML files to customize Solr's behavior.

  • Index your data

    Make your model searchable:

    class Post < ActiveRecord::Base
      searchable do
        text :title, :body
      end
    end

    Add your data to Solr:

    $ rake sunspot:reindex

    You only need to do that once.

  • Add search to your app

    Create a controller action:

    class PostsController < ApplicationController
      def search
        @search = Post.search do
          keywords(params[:q])
        end
      end
    end
  • Add search to your app

    Output your results:

    .results
      - @search.results.each do |post|
        .result
          %h1= h(post.title)
          %p= h(truncate(post.body))
    
    .pagination= will_paginate(@search.results)
  • That was easy.

  • So what is this Solr thing?

  • Solr is a standalone HTTP server that provides a document-oriented, inverted index of fulltext and scalar data.

  • This is an inverted index.

  • Why is Solr awesome?

    • Data is indexed by your application, how you want, when you want.
    • Standalone web service provides multiple good paths to scaling.
    • Wildly popular and maintained by the Apache Software Foundation.
  • What kind of data can Solr index?

    • Fulltext
    • Scalar types

      • String
      • Integer
      • Float
      • Time
      • Boolean
    • Trie Fields
  • Attribute Fields in Sunspot

    class Post < ActiveRecord::Base
      belongs_to :blog
      has_and_belongs_to_many :categories
    
      searchable do
        integer :blog_id
        integer :category_ids, :multiple => true
        time :published_at, :trie => true
      end
    end
  • Attribute Field Scoping

    Match a value exactly:

    with(:blog_id, 1)
  • Attribute Field Scoping

    Match by inequality:

    with(:published_at).less_than(Time.now)
  • Attribute Field Scoping

    Match by multiple values:

    with(:category_ids).any_of([1, 3, 5])
  • Attribute Field Scoping

    Match with a range:

    with(:published_at).between(
      Time.parse('2010-01-01')..Time.parse('2010-02-01')
    )
  • Attribute Field Scoping

    Combine restrictions with connectives:

    any_of do
      with(:expired_at).greater_than(Time.now)
      with(:expired_at, nil)
    end
  • Attribute Field Scoping

    Exclude specific instances from results:

    without(current_post)
  • Drilling Down

  • Drill-down search: The Problem

    • The user has performed a keyword search.
    • We want to allow them to drill down by category.
    • However, we only want to show categories which will return results given the keywords they've entered.
  • Facets: The Solution

    Post.search do
      keywords params[:q]
      with :category_ids, params[:category_id] if category_id
      facet :category_ids
    end
  • Facets: The Solution

    - @search.facet(:category_ids).rows.each do |row|
      - category_id, count = row.value, row.count
      - category = Category.find(category_id)
      - params_with_facet = params.merge(:category_id => category_id)
      .facet
        = link_to(category.name, params_with_facet)
        == (#{count})
      
  • But that's not very efficient!?

  • Instantiated Facets

    class Post < ActiveRecord::Base
      has_and_belongs_to_many :categories
      searchable do
        integer :category_ids,
                :multiple => true,
                :references => Category
      end
    end

    Now Sunspot knows that :category_id is a reference to Category objects.

  • Instantiated Facets

    - @search.facet(:category_ids).rows.each do |row|
      - category, count = row.instance, row.count
      - params_with_facet = params.merge(:category_id => category.id)
      .facet
        = link_to(category.name, params_with_facet)
        == (#{count})
        

    The first time you call row.instance on any row, Sunspot will eager-load all of the Category objects referenced by the facet rows.

  • But what if I've already selected a category? Can't I have more?

    • The user has already selected a category.
    • Most posts only have one or two categories.
    • So, the only categories returned by the facet are the ones that are cross-assigned to posts in the selected category.
    • We'd really like to be able to select more than one category.
  • But what if I've already selected a category? Can't I have more?

    • The user has already selected a category.
    • Most posts only have one or two categories.
    • So, the only categories returned by the facet are the ones that are cross-assigned to posts in the selected category.
    • We'd really like to be able to select more than one category.
    • No problem.
  • Multiselect Faceting

    • For the purposes of computing the category facet, we want to ignore the fact that a category has already been selected.
    • But we want to take into account all of the other selections the user has made.
    • Enter Multiselect Faceting, a new feature in Solr 1.4.
  • Multiselect Faceting

    Post.search do
      keywords params[:q]
      if params[:category_ids]
        category_filter =
          with :category_ids, params[:category_ids]
      end
      facet :category_ids, :exclude => category_filter
    end
  • Tuning Fulltext Relevance

    Returning the most relevant results for a keyword search is crucial. Here's what we'd like to do:

    • Keyword matches in the title field are more important than matches in the body field.
    • If the exact search phrase is present in the title, consider that highly relevant.
    • Posts published in the last 2 weeks are considered more relevant.
  • Tuning Fulltext Relevance: Field Boost

    Post.search do
      keywords params[:q] do
        boost_fields :title => 2.0
      end
    end

    Keyword matches in the title field are twice as relevant as keyword matches in the body field.

  • Tuning Fulltext Relevance: Phrase Fields

    Post.search do
      keywords params[:q] do
        phrase_fields :title => 5.0
      end
    end

    If the search phrase is found exactly in a post's title field, that post is 5 times more important than it would be otherwise.

  • Tuning Fulltext Relevance: Boost Queries

    Post.search do
      keywords params[:q] do
        boost(2.0) do
          with(:published_at).greater_than(2.weeks.ago)
        end
      end
    end

    Posts published in the last two weeks are twice as relevant as older posts.

  • Solr in Production

  • Running Solr in Production

    • Don't:

      • use Sunspot's embedded Solr instance.
      • use package-managed Tomcat/Jetty/Solr packages.
    • Do:

      • set up and maintain your own Solr instance.
      • give Solr its own machine.
      • use this tutorial: http://wiki.apache.org/solr/SolrTomcat
  • Running Solr in Production

    When you first install Solr:

    $ sunspot-installer -fv /path/to/my/solr/instance/home

    When you upgrade Sunspot:

    $ sunspot-installer -v /path/to/my/solr/instance/home
  • Commit Frequency

    What happens when you index or delete a document:

    • Solr stages your changes in memory.
    • It's fast and inexpensive.
    • But the changes aren't yet reflected in search results.
  • Commit Frequency

    What happens when you commit the index:

    • Solr writes all of the changes since the last commit to disk.
    • Solr's active Searcher instance is deprecated, and will not service new search requests.
    • Solr instantiates a new Searcher, which reads the updated index from disk into memory.
    • Then it auto-warms your caches.
    • Then it's ready to respond to search requests.
    • It's slow and expensive.
  • How not to commit too much

    • By default, Sunspot::Rails commits at the end of every request that updates the Solr index. Turn that off.
    • Use Solr's autoCommit functionality. That's configured in solr/conf/solrconfig.xml
    • Be glad for assumed inconsistency. Don't use search where results need to be up-to-the-second.
  • Scaling Solr

    • Operating System Resources
    • Caching
    • Replication
    • Sharding
  • Scaling Solr: Operating System Resources

    • Solr's memory needs will grow in proportion to your index size.
    • Make sure Solr's heap size is sufficient to hold your index.
  • Scaling Solr: Caching

    • Filter Cache: Cache the set of documents matching a particular filter. Each filter is cached independently for reuse in subsequent searches.
    • Query Result Cache: Cache the results of a particular query, in order.
    • Document Cache: Cache stored fields for a given document.

    When a new Searcher is instantiated (after a commit), the caches are autowarmed, meaning that they are pre-populated with data from the new index. This reduces cache misses but means that starting up a searcher takes longer. You can configure how much autowarming you want.

  • Scaling Solr: Replication

    • Standard master/slave architecture.
    • Scales with the frequency of search traffic.
    • All master/slave communication done over HTTP.
    • Slave instance(s) poll master at a configured interval.
    • If the master index has been committed since the last poll, slaves receive the changes.
    • Sunspot supports a master/slave configuration using the MasterSlaveSessionProxy.
    • If you're using more than one slave, put a load balancer in front of them.
  • Scaling Solr: Sharding

    • Index is divided according to natural criteria (data type, geography, etc).
    • Scales with the size of your index.
    • Writes go to a single shard instance based on those criteria.
    • Searches go to an single instance, which then aggregates the results from all the shards.
    • Sunspot gives you a starting point for this using the ShardingSessionProxy, which you subclass to implement the business logic for determining shards.
  • That is all.

  • Questions?

  • More Info

    • Sunspot Home Page: http://outoftime.github.com/sunspot
    • Sunspot Wiki: http://wiki.github.com/outoftime/sunspot
    • Sunspot API Docs: http://outoftime.github.com/sunspot/docs
    • Sunspot::Rails API Docs: http://outoftime.github.com/sunspot/rails/docs
    • Solr Wiki: http://wiki.apache.org/solr
    • Sunspot IRC Channel: #sunspot-ruby @ Freenode
    • Sunspot mailing list: ruby-sunspot@googlegroups.com