Lessons learned while building Solr Search for over 5 billion documents

As an analytics solution, One of Big SMS providers in Saudi Arabia gathers vast amounts of data. To ensure they provide analytics data that effectively capitalize on the data they collect, They provide real-time search and analytics capabilities. To achieve this, I have chosen Apache Solr as the core search functionality.

It took much trial and error to achieve acceptable performance. In the process, I learned a lot, from which Amazon EC2 instance-type works best for us, to which simple query adjustments can improve query-time considerably. We want to share our experience, and we hope it will help you with your Solr-based project.

The target audience of this post is software developers who are already comfortable with the basics of Solr.

The data

The nature and characteristics of the data are important. In our case, the data possesses some characteristics that impact query performance and index size:

  • Each Solr document is pretty small.
  • Some fields have high cardinality (a measure of the ‘number of elements of the set;’ i.e., URLs and Text element selectors can be infinitely diverse).

Infrastructure

A common open source product that fulfills this requirement is Solr Cloud.

Lessons we learned

Lesson 1 — Capacity planning

Here are some metrics to consider as you experiment:

  1. Indexing effort: How do your machines handle incoming documents? If your machines’ CPU usage levels are constantly high, it might be due to indexing strain.
  2. Index-size-to-RAM ratio: A common recommendation is that you have more RAM than the size of your Solr index. We found this unnecessary, at least if you have SSD storage. Though our index size is about 500GB per server, and we have considerably less RAM than that, the performance is acceptable. A good starting point for RAM-sizing is 25–50% of your index size and SSD storage combined.
  3. A number of documents per server: The more documents a server holds, the more work it has to do while executing a query and aggregating data. In our experience, a good number of documents per server is about 400 million. For really large data sets, you’ll need to watch out for the hard limit of two billion documents per Solr core.

There are many metrics to look at. To collect them, we use Grafana and Prometheus.

Our current hardware

Tip: Before throwing hardware at your performance issue, try to figure out the root cause of your problem. There is a good chance you’ll find that your issue arises from incorrect Solr usage rather than insufficient hardware. In fact, some of the following lessons are exactly that: performance issues that were solved by changing code, not hardware.

Lesson 2 — Understand Solr query execution

Phase 1: Each clause of a query will hit the index of the queried fields.

Phase 2: All the results of phase 1 will be merged according to the boolean logic the query defines.

Let’s see what Solr would do with the following query:

q=(username: Bill AND country: KSA) OR (username: Steve AND country: UAE)

Phase 1: Perform four index hits — Get all docs with username: Bill (let’s call this set S1); all the docs with the country: KSA (let’s call this set S2); all the docs with username: Steve (let’s call this set S3); and all the docs with the country: UAE (let’s call this set S4).

Phase 2: Perform the following set operation: (S1 intersection S2) union (S3 intersection S4).

The result of this computation is the query result set.

Filter queries

The query execution procedure described above is true for filter queries as well. For a compound query with a q, and several fq’s, Solr will perform the search for each part independently, and will merge the resulting doc sets.

Filter cache

The relevant config XML is as follows:

<filterCache class="solr.FastLRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>

As you can see, the size of this cache is limited. If your system has a lot of unique filter queries, the effectiveness of the cache will decrease. In addition to increasing the size configuration, you should consider managing the uniqueness of your filter queries. For example, let’s think of a filter query that searches for all the visits the user Bill has made using the Country:

fq=username: Bill AND country: KSA

A similar query might be performed for the username Steve:

fq=username: Steve AND country: UAE

And so on with many other username names.

Since we use a single filter query to filter two different fields, we increase the number of unique filter queries that will be performed. An alternative to this is to split the query into two filter queries:

fq=user: Bill
fq=country: KSA

This will cache all the visits by the user Bill and all the visits using the KSA country separately. These caches will be independently usable in any search that includes either of these filter queries.

Insights gained

  1. The query will take at least as long as the most costly filter query. It does not matter that one of the filter queries narrows down the result set considerably.
  2. An uncached filter query will always run against the entire index.
  3. The order of the filter queries does not matter.
  4. The structure of the queries has a dramatic effect on caching, and therefore query runtime.

Lesson 3 — Wildcard queries can be evil

Lucene’s underlying data structure is an inverted index. An inverted index manages a list of terms, and the IDs of the documents these terms appear in.

Inverted index

Lucene also supports a wildcard query to their inverted index. This is possible because Lucene keeps the indexed terms not in a list, but in a Trie structure.

Trie inverted index

Let’s look at some example queries: q=username:St*, q=username:*ve, and q=username:*tev*.

These queries differ in their placement of the wildcard selector — 1) end, 2) start, and 3) either side. In a non-intuitive way, these examples’ respective performance levels can be quite different; some will perform well, while others will perform poorly.

Wildcard query execution

For the second example: q=username:*ve, execution is quite different. The trie structure is not helpful at all. Lucene will have to traverse the entire trie and find all paths that end with 've.’ This operation can take serious time, especially for fields with high cardinality.

Wildcard query performance improvement — Attempt #1:

This method solves the leading wildcard issue, but we are still powerless to handle the username:*tev* example.

Wildcard query performance improvement — Attempt #2:

You need to ask yourself: ‘Do I really need to support an arbitrary substring search, or might a token search suffice?’ As Solr comes with extensive tokenizing capabilities, you might be able to tokenize your data in a way that suits your needs.

Let’s look at an example of helpful tokenizing:

Scenario: We want to implement a URL search. Product dictates that we need to support a sub-URL search.

Approach 1: You might be tempted to store URLs as simple strings and use a double-sided wildcard search to implement the feature, like so:

fq=url:*settings*

Approach 2: A better approach might be tokenizing the URL into its section:

<fieldType class="solr.TextField" name="url">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" group="-1" pattern="[\s/:?\#&amp;.=]+"/>
</analyzer>
</fieldType>

Solution: Using the above analyzer code above, you tokenize your URL data into URL fragmented parts, allowing you to support sub-URL search without wildcard operators. The query: fq=url:settings will give you the same results as before, but without using a wildcard operator!

Lesson 4 — Join performance

Let’s look at a join example:

Scenario: We want to query all users from Iceland that visited the settings page.

Solution: We’ll query over the users collection and join into the visits collection.

fq=country: Iceland
fq={!join fromIndex=visits from=user_id to=user_id} url:settings

Solr will first execute a query over the visits collection with the query url:settings, then it will gather all the IDs of resulting documents, and finally it will perform an in query with these IDs on the users collection.

Note that the result-size of the inner query is very important to the performance of the query. A large inner result will result in a heavy in operation. In our example, the result of the url:settings query will include all visits globally, rendering this a large inner result.

Improved solution: We can improve the performance of this query easily: add country information for each visit. With this additional field, we can change our query to:

fq={!join fromIndex=visits from=user_id to=user_id} url:settings AND country: Iceland

Now, the inner query will return only a few documents, and the in query will be much lighter, making our entire query blazing fast.

Doc values on join fields

<field name='example' ... docValues='true'>

Some Solr features require docValues to be set to true, but the Join query parser can function without it. It turns out that Join can utilize docValues to achieve much better performance.

If you use docValues=true on all from and to join fields, Solr will make use of the doc values and make the query even faster.

Lesson 5 — Facet performance

5a: Know your data — Cardinality

As it turns out, the cardinality of a field is an important factor in a faceted query over that field. Fields with high cardinality are harder for Solr to facet on.

5b: Different performance for different field types

We mapped textual values into a long field and performed facet operations over it. The performance gain was very significant: Queries that originally took 30–40 seconds took less than one second afterward.

5c: Use the json.facet API, not old facets

Summary

Apache Solr Contributor | Slack Contributor | Speaker | Chief Technology Officer at The Chefz| Technical Reviewer for Big Data Books

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store