You have a huge number of documents. It would be nice if you could search them almost as well as Google. Lucene (http://lucene.apache.org/) has been helping organisations search their data for years. Projects like elasticsearch(http://www.elasticsearch.org/) build on top of Lucene to provide distributed, scalable solutions for searching huge volumes of data. A good example is the use of elasticsearch at wordpress – http://gibrown.wordpress.com/2014/01/09/scaling-elasticsearch-part-1-overview/. In this experiment, you start with three nodes on OpenStack – h-mstr, h-slv1 and h-slv2 as in the previous article. Download the rpm package from ElasticSearch site and install it on each of the nodes. The configuration file is /etc/elasticsearch/elasticsearch.yml. You will need to configure it on each of the three nodes. Consider the following settings on the h-mstr node:
You have given the name es to the cluster. The same value should be used on the h-slv1 and h-slv2 nodes. This node will act as a master and store data as well. The master nodes process the requests by distributing the search to the data nodes and consolidating the results. The next two parameters relate to the index. The number of shards is the number of sub-indices which are created and distributed among the data nodes. The default value for the number of shards is five. The number of replicas represents the additional copies of the indices created. You have set it to no replicas. The default value is one. You may use the same values on slv1 and slv2 nodes or use node.master set to false. Once you have loaded the data, you will find that the h-mstr node has 4 shards and h-slv1 and h-slv2 have 3 shards each. The indices will be in the directory /var/lib/elasticsearch/es/nodes/0/indices/ on each node. You start the elasticsearch on each node by executing: $ sudo systemctl start
elasticsearch You can know the status of the cluster by browsing http://h-mstr:9200/_cluster/health?pretty. Loading the DataYou want to index the documents located on your desktop. Elasticsearch supports a python interface for it. It is available in the Fedora 20 repository. So, on your desktop, install: $ sudo yum install
python-elasticsearch The following is a sample program to index LibreOffice documents. The comments embedded in the code, hopefully, make it clear that it is not a complex task.
Once the index is created, you cannot increase the number of shards. However, you can change the replication value as follows:
Searching the DataThe program search_documents.py below uses the query_string option of Elasticsearch to search for the string passed as a parameter in the content field 'text'. It returns the fields, 'path' and 'title' in the response, which are joined to print the full filenames of the documents found.
You can now search using expressions like the following:
More details can be found at the Lucene and ElasticSearch sites. Open source options let you build a custom, scalable search engine. You may include information from your databases, documents, emails, etc very conveniently. Hence, it is a shame to come across sites which do not offer an easy way to search the site's content and one hopes that website managers will add that functionality using tools like ElasticSearch! |