Apache Solr with Zookeeper Ensemble

Vishal Khare
5 min readOct 14, 2018

Apache solr is a powerful open source search engine. Lucene and TF-IDF (Term Frequency-Inverse Document Frequency) are the algorithms that it is based on. It has proved to be extremely useful in solving some of the most complex problems that my team at AmEx has faced. Apache solr is most effective when used in cloud mode also called as SolrCloud — It is a distributed architecture focused on horizontal scaling where multiple nodes run instance of solr that communicate with each other through zookeeper. Now Apache Solr comes with built in zookeeper. However it is not advisable to use it in production. Rather an external zookeeper ensemble(cluster) should be used in production environment. We use Zookeeper ensemble in production because of obvious reasons of scalability and replication factor.

To create a zookeeper ensemble a minimum of 3 nodes are required. It is advisable to always use odd number of nodes(3,5,7…) to create a zookeeper ensemble. Why you ask? Read below.

For a ZooKeeper service to be active, there must be a majority of non-failing machines that can communicate with each other. To create a deployment that can tolerate the failure of F machines, you should count on deploying 2xF+1 machines. Thus, a deployment that consists of three machines can handle one failure, and a deployment of five machines can handle two failures. Note that a deployment of six machines can only handle two failures since three machines is not a majority. For this reason, ZooKeeper deployments are usually made up of an odd number of machines.

Let us now see how we can actually practically implement an Apache Solr cloud using external Zookeeper ensemble.

Let us assume we have 3 separate nodes as that is the minimum number of nodes required to create a zookeeper ensemble. This can be scaled to any number of nodes depending on horizontal scaling factor that is required for the use case. There can be a scenario where zookeeper and Solr instances reside on different nodes and in my opinion that is how it should be. Because if you think about it, This will actually make the cluster more robust and fault tolerant.

First of all, let’s download zookeeper and Apache Solr in each of the 3 machines whose IPs we assume are 10.17.153.112, 10.17.152.145 and 10.17.153.247 using following steps for Mac OS or Linux. Windows user can directly download and unzip the package.

Commands to download and extract Solr
Commands to download and extract ZooKeeper

Now that we have Solr and Zookpeer in place. Let’s start by setting up zookeeper.

Setting up Zookeeper Ensemble

Login to 10.17.153.112 and do the following-

  1. Make a new directory for zookeeper data and navigate inside it using following command
mkdir /opt/zookeeperdata && cd zookeeperdata

2. Create a new file at this location called myid which should contain the serial number of the server. This indicates that server 10.17.153.112 is the 1st server of zookeeper ensemble.

echo “1” > myid

3. Rename zoo_sample.cfg to zoo.cfg in conf folder of zookeeper

mv /opt/zookeeper/conf/zoo_sample.cfg /opt/zookeeper/conf/zoo.cfg

4. comment dataDir and clientPort in zoo.cfg file by prepending the lines with #

5. Add following lines in zoo.cfg

zoo.cfg

In server.x the value of x should be exactly the same as the value in the file named myid we created in step 2

clientPort=2181 means that zookeeper on this node will spin up on port 2181

7. Now let us spin up the zookeeper instance by invoking the zkServer.sh file in bin of zookeeper

./opt/zookeeper/bin/zkServer.sh start /opt/zookeeper/conf/zoo.cfg

8. Repeat steps 1 to 7 on every server you wish to install zookeeper on. Except in step 2, you should echo the server number on each one of them. Eg. “2” will be echoed on server 10.17.152.145 and “3” will be echoed on server 10.17.152.145. In this case all three servers will have zookeeper as well as solr.

9. Ensure that there are no major errors in the log file using

cat /opt/zookeeper/bin/zookeeper.out

Now that zookeeper ensemble is setup, let’s spin up Solr instances also.

Setting up Apache Solr

Login to 10.17.153.112 and do the following-

  1. Navigate to bin folder of Apache Solr
cd /opt/solr/bin

2. Start the solr server by point to each of the zookeeper instances that we already spin up as mentioned above.

./solr start -c -p 8983 -z 10.17.153.112:2181,10.17.152.145:2181,10.17.153.247:2181 -force

3. Repeat steps 1 and 2 on other two servers.

Creating collections

You can now create collections and implement sharding and replication factor to newly created collection.

Sharding — Breaking data into parts is called sharding. Each part of data is called a shard.

Replication factor — Factor by which each shard should be replicated. This ensures availability of shards (data) even when some nodes are down.

Use following command to create a collection named TEST with 3 shards and 3 as it’s replication factor.

./solr create_collection -c TEST -s 3 -rf 3 -force

Validating cloud status

You can hit any of the three IPs to access the SolrCloud. For example http://10.17.153.112:8983/solr will open up Solr UI on browser. Navigate to Cloud option on the left hand side and you will see TEST collection with its shards and replicas as follows-

Cloud depiction of a collection

This confirms that everything is up and running. Apache Solr cloud has been implemented using an external Zookeeper ensemble.

--

--

Vishal Khare

Engineering manager at TATA 1mg || Google Cloud Certified Cloud Engineer