Sesat > Docs + Support > HowTo Solr Query Evaluation

Setting up Sesat's Query Evaluation with a Solr index

Preface

  Sesat comes with a query evaluation that occurs during the query parsing stage. Its purpose is to evaluate metadata against each clause (leaf and operator) within the currently constructed Query tree. The evaluation has various types.

  Current implementations are clauses matching regular expressions (RegExpTokenEvaluator), mathematical expressions (JepTokenEvaluator), hits against a FAST Query Matching server (VeryFastTokenEvaluator), and hits against a Solr server (SolrTokenEvaluator).

  Some examples of metadata used at sesam.no are first_name, last_name, full_name, company_name, english_words, geological_province, geological_city, geological_suburb, and geological_street. All of these metadata examples are stored in a Solr index as some of them are very large, eg the full_name list contains the national register of the norwegian population - roughly 5 million names.

  After the Query tree has been constructed, and the metadata associated to each clause, this can be used to maximise the efficiently of which searches to execute and end up federating. For example there is no need for us to initiate a search against our 'white pages' (or people catalogue) if neither a full_name, or first_name + last_name combination, metadata exists within the query. Another example is our 'yellow pages' (or company catalogue) searches can be enhanced when we know which clauses within query are geological terms.

  It can be seen why we, at sesam.no, see this query evaluation as a crucial part of a federating search engine.

Introduction

Our query evaluation against large data lists used to be via the (VeryFastTokenEvaluator) that works against a FAST Query Matching server. In a desire to move away from a proprietary closed solution that left us at the mercy of FAST consultants and towards an open solution that we fully owned and were free to share we decided to re-implement all this functionality with a Solr index. What follows is our installation and setup of a Solr index to successfully work with our (SolrTokenEvaluator) using the english_words metadata as the example.

Configuration

In Sesat's base skin "generic.sesam" you'll find war/src/main/conf/SolrEvaluators.xml containing:

<solr-evaluators>
    <list token="ENGLISHWORDS" list-name="common_english"/>
 </solr-evaluators>

This declares the metadata (also referred to as a Token or TokenPredicate) ENGLISHWORDS, and connects it to hits in the solr index with list_name=common_english.
This is all that is required to setup evaluation for english_words metadata.

We also need to configure where the Solr index can be found.
In the same skin you'll find war/src/main/conf/configuration.properties with the line:

tokenevaluator.solr.serverUrl=@tokenevaluator.solr.serverUrl@

The @tokenevaluator.solr.serverUrl@ is filtered from values defined in the skin's pom.xml according to the current profile in action.
For the development profile this reads:

<tokenevaluator.solr.serverUrl>http://localhost:16000/solr</tokenevaluator.solr.serverUrl>

So you can either run Solr locally on port 16000 or create a ssh tunnel on your port 16000 to another Solr server.
Note: the values for the other profiles point to a host "sch-solr-test01.dev.osl.basefarm.net". This is our own Solr server and so naturally won't work for you - you'll need to override this setting.

Along with this configuration it's now presumed that you have sesat skin up and running. If you don't read Tutorial - Building Sesam.com for help on how to.

Solr Installation

You'll need a recent version of Solr, 1.4, or latter, so that the two patches found in https://issues.apache.org/jira/browse/LUCENE-1380 and https://issues.apache.org/jira/browse/SOLR-763 are included.

Deploy the solr.war to your container. Before starting it you'll need to configure your Solr Home.
It is fine to use the example Solr home found in example/solr but replacing schema.xml with:

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.1">
  <types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>
   <fieldType name="shingleString" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ShingleFilterFactory" outputUnigrams="true" outputUnigramIfNoNgram="true" maxShingleSize="99" enablePositions="false" />
      </analyzer>
    </fieldType>
    <fieldtype name="ignored" stored="false" indexed="false" class="solr.StrField" /> 
 </types>


 <fields>
   <field name="id" type="string" stored="true" required="true" /> 
   <field name="list_name" type="string" indexed="true" stored="true"/>
   <field name="list_entry" type="string" indexed="true" stored="true"/>
   <field name="list_entry_shingle" type="shingleString" indexed="true" stored="true"/>
   <field name="list_entry_synonym" type="string" indexed="true" stored="true"/>
   <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
 </fields>

 <uniqueKey>id</uniqueKey>
 <defaultSearchField>list_entry_shingle</defaultSearchField>
 <solrQueryParser defaultOperator="OR"/>
</schema>

Now start the container and unzip and feed in the solr xml document add_english_words.xml.gz that contains all the english words.

Coding with the metadata: TokenPredicates

Everything should now work.

When a query is parsed WordClauses within the Query that match one of the english words with contain within the clause's knownPredicates list the TokenPredicate ENGLISHWORDS. For example:

true == clause.containsKnownPredicate(Categories.ENGLISHWORDS)
Known and Possible Predicates?

What is a possible predicate?

Sometimes metadata is position dependent, that is the position of the term within the query has the final say whether the metadata is really applicable. This can be used with the regular expression evaluators, but also every TokenPredicate has an "exactPeer" which is only ever true if the whole query has the metadata. We cannot assign such metadata definitely to the root clause of any query because clauses are immutable and used in a fly-weight pattern across multiple Query trees.

 © 2007-2009 Schibsted ASA
Contact us