Lessons learned while building Solr Search for over 5 billion documents

10 min readMar 19, 2021

As an analytics solution, One of Big SMS providers in Saudi Arabia gathers vast amounts of data. To ensure they provide analytics data that effectively capitalize on the data they collect, They provide real-time search and analytics capabilities. To achieve this, I have chosen Apache Solr as the core search functionality.

It took much trial and error to achieve acceptable performance. In the process, I learned a lot, from which Amazon EC2 instance-type works best for us, to which simple query adjustments can improve query-time considerably. We want to share our experience, and we hope it will help you with your Solr-based project.

The target audience of this post is software developers who are already comfortable with the basics of Solr.

The data

The data we collect is largely comprised of messages interactions with text inputs, and how many parts the message is, Currently, They hold around 5 billion such messages, a number that is growing rapidly and is expected to accelerate given SMS Provider exponential growth.

The nature and characteristics of the data are important. In our case, the data possesses some characteristics that impact query performance and index size:

Each Solr document is pretty small.
Some fields have high cardinality (a measure of the ‘number of elements of the set;’ i.e., URLs and Text element selectors can be infinitely diverse).

Infrastructure

Hosting a large Solr index requires a sharding solution; we need to be able to split the data between several servers, and have the flexibility to add more servers easily.

A common open source product that fulfills this requirement is Solr Cloud.

Lessons we learned

Lesson 1 — Capacity planning

The first question that should always come to mind when starting a data cluster is: ‘What hardware should I use?’ This is known as the sizing question, or capacity planning. There is, of course, no magic formula; you will have to find your ideal hardware by estimation and monitoring, AKA trial and error.

Here are some metrics to consider as you experiment:

Indexing effort: How do your machines handle incoming documents? If your machines’ CPU usage levels are constantly high, it might be due to indexing strain.
Index-size-to-RAM ratio: A common recommendation is that you have more RAM than the size of your Solr index. We found this unnecessary, at least if you have SSD storage. Though our index size is about 500GB per server, and we have considerably less RAM than that, the performance is acceptable. A good starting point for RAM-sizing is 25–50% of your index size and SSD storage combined.
A number of documents per server: The more documents a server holds, the more work it has to do while executing a query and aggregating data. In our experience, a good number of documents per server is about 400 million. For really large data sets, you’ll need to watch out for the hard limit of two billion documents per Solr core.

There are many metrics to look at. To collect them, we use Grafana and Prometheus.

Our current hardware

WalkMe currently has 6 Solr nodes hosted on Amazon AWS EC2. The instance type we chose is i3.4x-large, each with 16 cores, 122GB RAM, and a 3800GB SSD. We are actually considering downsizing to i3.2x-large, as some experiments show it might hold.

Tip: Before throwing hardware at your performance issue, try to figure out the root cause of your problem. There is a good chance you’ll find that your issue arises from incorrect Solr usage rather than insufficient hardware. In fact, some of the following lessons are exactly that: performance issues that were solved by changing code, not hardware.

Lesson 2 — Understand Solr query execution

Understanding Solr query execution can facilitate revelations about performance. Solr (due to its reliance on Lucene) has a somewhat unique way of performing a query: Most SQL databases will begin a query execution by hitting the index of a certain field and then scanning the results one item at a time to determine if the item fulfills the entire query. In Solr, there is no table scan phase. We can summarize a Solr query plan as follows:

Phase 1: Each clause of a query will hit the index of the queried fields.

Phase 2: All the results of phase 1 will be merged according to the boolean logic the query defines.

Let’s see what Solr would do with the following query:

q=(username: Bill AND country: KSA) OR (username: Steve AND country: UAE)

Phase 1: Perform four index hits — Get all docs with username: Bill (let’s call this set S1); all the docs with the country: KSA (let’s call this set S2); all the docs with username: Steve (let’s call this set S3); and all the docs with the country: UAE (let’s call this set S4).

Phase 2: Perform the following set operation: (S1 intersection S2) union (S3 intersection S4).

The result of this computation is the query result set.

Filter queries

Solr expands on this concept with the introduction of filter queries; in addition to the q parameter, we can pass an array of filter queries: fq. The final query result is the intersection of the q result set and fq result sets.

The query execution procedure described above is true for filter queries as well. For a compound query with a q, and several fq’s, Solr will perform the search for each part independently, and will merge the resulting doc sets.

Filter cache

One of the benefits of filter queries is the filter cache. The resulting set of each filter query is cached by Solr. This cache is a key value store that maps the filter query string to its resulting doc set.

The relevant config XML is as follows:

<filterCache class="solr.FastLRUCache"
                 size="512"
                 initialSize="512"
                 autowarmCount="0"/>

As you can see, the size of this cache is limited. If your system has a lot of unique filter queries, the effectiveness of the cache will decrease. In addition to increasing the size configuration, you should consider managing the uniqueness of your filter queries. For example, let’s think of a filter query that searches for all the visits the user Bill has made using the Country:

fq=username: Bill AND country: KSA

A similar query might be performed for the username Steve:

fq=username: Steve AND country: UAE

And so on with many other username names.

Since we use a single filter query to filter two different fields, we increase the number of unique filter queries that will be performed. An alternative to this is to split the query into two filter queries:

fq=user: Bill
fq=country: KSA

This will cache all the visits by the user Bill and all the visits using the KSA country separately. These caches will be independently usable in any search that includes either of these filter queries.

Insights gained

Once we understood how Solr executes queries, we could extrapolate some insights:

The query will take at least as long as the most costly filter query. It does not matter that one of the filter queries narrows down the result set considerably.
An uncached filter query will always run against the entire index.
The order of the filter queries does not matter.
The structure of the queries has a dramatic effect on caching, and therefore query runtime.

Lesson 3 — Wildcard queries can be evil

The wildcard (‘*’) operator can be very helpful. The need to search for a subsection of a string field often arises in later stages of development, and the use of wildcard queries can meet this need easily. When we initially attempted to use the wildcard operator, however, we discovered that our performance had greatly deteriorated. To understand why, let’s dive into how wildcard search is implemented in Lucene (Solr’s indexing and searching engine).

Lucene’s underlying data structure is an inverted index. An inverted index manages a list of terms, and the IDs of the documents these terms appear in.

Inverted index

Lucene also supports a wildcard query to their inverted index. This is possible because Lucene keeps the indexed terms not in a list, but in a Trie structure.

Trie inverted index

Let’s look at some example queries: q=username:St*, q=username:*ve, and q=username:*tev*.

These queries differ in their placement of the wildcard selector — 1) end, 2) start, and 3) either side. In a non-intuitive way, these examples’ respective performance levels can be quite different; some will perform well, while others will perform poorly.

Wildcard query execution

Let’s look at our first example: q=username:St*. Lucene’s first phase in performing this query is to find all terms that start with 'St.' In order to do that, Lucene will traverse the trie in this path: 'S'->'t;' all the nodes under that 't' node fulfill the searched term. All Lucene needs to do now is to collect all the referenced document IDs.

For the second example: q=username:*ve, execution is quite different. The trie structure is not helpful at all. Lucene will have to traverse the entire trie and find all paths that end with 've.’ This operation can take serious time, especially for fields with high cardinality.

Wildcard query performance improvement — Attempt #1:

A cool method to handle the issue in the second example is to use Reversed Wildcard Filter. The reverse wildcard filter will create a reversed version of your indexed text. During query execution, Solr will know to replace the query username:*ve withusername:ev* thus reducing the problem to the form Lucene trie is good at.

This method solves the leading wildcard issue, but we are still powerless to handle the username:*tev* example.

Wildcard query performance improvement — Attempt #2:

This solution is less a way to improve wildcard performance, and more a reminder that wildcards might not be necessary at all; a slight compromise product-wise can allow you to avoid wildcard operators all together.

You need to ask yourself: ‘Do I really need to support an arbitrary substring search, or might a token search suffice?’ As Solr comes with extensive tokenizing capabilities, you might be able to tokenize your data in a way that suits your needs.

Let’s look at an example of helpful tokenizing:

Scenario: We want to implement a URL search. Product dictates that we need to support a sub-URL search.

Approach 1: You might be tempted to store URLs as simple strings and use a double-sided wildcard search to implement the feature, like so:

fq=url:*settings*

Approach 2: A better approach might be tokenizing the URL into its section:

<fieldType class="solr.TextField" name="url">
  <analyzer>
    <tokenizer class="solr.PatternTokenizerFactory" group="-1" pattern="[\s/:?\#&amp;.=]+"/>
  </analyzer>
</fieldType>

Solution: Using the above analyzer code above, you tokenize your URL data into URL fragmented parts, allowing you to support sub-URL search without wildcard operators. The query: fq=url:settings will give you the same results as before, but without using a wildcard operator!

Lesson 4 — Join performance

The join query parser is an extremely powerful tool to query over data relations. It’s not exactly an SQL join; a better comparison would be an SQL inner select: SELECT * FROM table WHERE id in (SELECT id FROM other_table WHERE ...).

Let’s look at a join example:

Scenario: We want to query all users from Iceland that visited the settings page.

Solution: We’ll query over the users collection and join into the visits collection.

fq=country: Iceland
fq={!join fromIndex=visits from=user_id to=user_id} url:settings

Solr will first execute a query over the visits collection with the query url:settings, then it will gather all the IDs of resulting documents, and finally it will perform an in query with these IDs on the users collection.

Note that the result-size of the inner query is very important to the performance of the query. A large inner result will result in a heavy in operation. In our example, the result of the url:settings query will include all visits globally, rendering this a large inner result.

Improved solution: We can improve the performance of this query easily: add country information for each visit. With this additional field, we can change our query to:

fq={!join fromIndex=visits from=user_id to=user_id} url:settings AND country: Iceland

Now, the inner query will return only a few documents, and the in query will be much lighter, making our entire query blazing fast.

Doc values on join fields

Solr’s docvalues is a data structure that is pre-computed to speed up access to document field values. It is one of the configuration items you can define while defining a field:

<field name='example' ... docValues='true'>

Some Solr features require docValues to be set to true, but the Join query parser can function without it. It turns out that Join can utilize docValues to achieve much better performance.

If you use docValues=true on all from and to join fields, Solr will make use of the doc values and make the query even faster.

Lesson 5 — Facet performance

In data analytics, you often need to aggregate data. Solr’s Facets API is a very powerful tool for this purpose. We use this feature extensively to generate analytics reports of our customers’ data. Using the Facets API, we’ve learned some important lessons on faceted performance:

5a: Know your data — Cardinality

The cardinality of a single digit field is 10. The cardinality of a first name field is a few thousand. The cardinality of an ID field might be as large as your whole data set.

As it turns out, the cardinality of a field is an important factor in a faceted query over that field. Fields with high cardinality are harder for Solr to facet on.

5b: Different performance for different field types

A trick we found to handle high cardinality text field facet is this: facet over a numeric field instead. It might sound ridiculous, but it works!

We mapped textual values into a long field and performed facet operations over it. The performance gain was very significant: Queries that originally took 30–40 seconds took less than one second afterward.

5c: Use the json.facet API, not old facets

Solr’s Facets API has changed over the years; in Solr 5, a new json.facet API has been introduced. The old Facet API is pretty much obsolete. You’ll find all you need in the newer, and faster json.facet.

Summary

In our couple of years working with Solr, we’ve overcome plenty of performance challenges. Sometimes, we could not find any way around the challenges and had to scale our hardware up. In most cases, though, we managed to improve performance considerably by optimizing our use of Solr’s features.