Distributed Search with SOLR Multicore – Part 2

In the previous post we got the bare minimum SOLR install up and running with apache, tomcat, and SOLR with multicore/multiindex. So you should be able to navigate to http://localhost/ we should be seeing the manage your cores screen.

solr

All is well with this but we need to modify our default install a bit to more represent a real world multi-core setup.

Download SOLR Example

The schemas provided by the SOLR distribution are rather basic and do not include field types commonly found in production deployments. To simplify this I created a new template that you can use to get a jump-start. It includes;

  • Settings: A more robust schema.xml that includes field types such as text_ws, string, int, decimal, and ignore.
  • Performance: A solr core exists [see: distrib] that simply provides distributed search via a requestHandler that auto injects the shards parameter on any search request.
  • Performance: The solrconfg.xml of each has a more aggressive caching mechanism in place to provide better scalability as well as some further options to enhance performance out of the box (etag, cache control headers, etc).
  • Security: Disabled remote streaming.

You can download the example here.

How to issue multi-core/multi-index queries

To issue a multi-core query by hand we can use a url template like below:

view source
1.http://localhost/core1/select/?q=solr&shards=localhost/core01,localhost/core02

You will notice the use of the shards parameter. This allows us to query multiple cores (indexes) simultaneously. This proves invaluable as you horizontally scale your search as shards can be on different pcs all together. In our case, however, we are simply using it to query across indexes in a search on the same PC, but you don’t have to do it like this.

Distributed Search Scenarios

From the above we can explore two scenarios for deployment depending on the needs of our search architecture and current scale. Scenarios differ only in their physical deployment so moving from scenario 1 to scenario 2 is as painless as it can get. Note: The reasons for moving to a distributed platform usually revolve around bottlenecks at the IO level either IO is too slow, or size of index is too large (>1 million)

As the size of our indexes grow we may move a shard to a different server to get the benefit of higher IO. However, keep in mind, this will increase CPU on the core servers respectively. We might also consider still keeping 1-2 search servers in place and as IO becomes an issue mount new drives to take care of the load. I would strongly recommend SSD drives for any search server as 99% of the time you are concerned with how fast your “gets” are and not your “sets”.

figure1

Benefits

  • Easier to maintain
  • Allows migration to multi-server solutions
  • Allows higher up-time through multi-cores

Potential Drawbacks

  • Scalability at start is diminished

Alternate Considerations

  • Consider scenario 2 when you need massive scalability.
  • Consider a replicated index with multiple hosts.

figure2

Benefits

  • Massive horizontal scale

Potential Drawbacks

  • The increased scale can be hard to manage
  • Effective performance tuning is required
  • More testing to establish best configuration

Alternate Considerations

  • Reconsider scenario 1 if IO is the main bottleneck, instead of scaling horizontal with physical servers, mount different RAID configurations

Additional Resources

Scaling up Large Scale Search from 500,000 volumes to 5 Million volumes and beyond

SOLR Performance Benchmark Single index vs Multiple (Multicore)

Multicore admin commands for SOLR

Add a Comment

Your email address will not be published. Required fields are marked *