Apache Solr Index Bechmarking

I recently started playing around with Apache Solr and currently trying to figure out the best way to benchmark the indexing of a corpus of XML documents. I am basically interested in the throughput (documents indexed/second) and index size on disk.

I am doing all this on Ubuntu.

Benchmarking Technique

* Run the following 5 times& get average total time taken *

  • Index documents [ curl http://localhost:8983/solr/core/dataimport?command=full-import ]
  • Get 'Time taken' name attribute from XML response when status is 'idle' [curl http://localhost:8983/solr/core/dataimport]
  • Get size of 'data/index' directory
  • Delete Index [ curl http://localhost:8983/solr/core/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8' curl http://localhost:8983/solr/core/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8' ]
  • Commit [ curl http://localhost:8983/solr/w5/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8' curl http://localhost:8983/solr/w5/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8' ]
  • Re-index documents
  • Questions

  • I intend to calculate my throughput by dividing the number of documents indexed by average total time taken; is this fine?
  • Are there tools (like SolrMeter for query benchmarking) or standard scripts already available that I could use to achive my objectives? I do not want to re-invent the wheel...
  • Is my approach fine?
  • Is there an easier way of getting the index size as opposed to performing a 'du' on the data/index/ directory?
  • Where can I find information on how to interpret XML response attributes (see sample output below). For instance, I would want to know the difference between the QTime and Time taken values.
  • * XML Response Used to Get Throughput *

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
    <lst name="responseHeader">
      <int name="status">0</int>
        <int name="QTime">0</int>
      </lst>
      <lst name="initArgs">
        <lst name="defaults">
          <str name="config">w5-data-config.xml</str>
        </lst>
      </lst>
      <str name="status">idle</str>
      <str name="importResponse"/>
      <lst name="statusMessages">
        <str name="Total Requests made to DataSource">0</str>
        <str name="Total Rows Fetched">3200</str>
        <str name="Total Documents Skipped">0</str>
        <str name="Full Dump Started">2012-12-11 14:06:19</str>
        <str name="">Indexing completed. Added/Updated: 1600 documents. Deleted 0 documents.</str>
        <str name="Total Documents Processed">1600</str>
        <str name="Time taken">0:0:10.233</str>
      </lst>
      <str name="WARNING">This response format is experimental.  It is likely to change in the future.</str>
    </response>
    

    To question 1:

    I would suggest you should try to index more than 1 XML (with different dataset) file and compare the given results. Thats the way you will know if it´s ok to simply divide the taken time with your number of documents.

    To question 2:

    I didn´t find any of these tools, I did it by my own by developing a short Java application

    To question 3:

    Which approach you mean? I would link to my answer to question 1...

    To question 4:

    The size of the index folder gives you the correct size of the whole index, why don´t you want to use it?

    To question 5:

    The results you get in the posted XML is transfered through a XSL file. You can find it in the /bin/solr/conf/xslt folder. You can look up what the termes exactly means AND you can write your own XSL to display the results and informations. Note: If you create a new XSL file, you have to change the settings in your solrconfig.xml. If you don´t want to make any changes, edit the existing file.

    edit: I think the difference is, that the Qtime is the rounded value of the taken time value. There are only even numbers in Qtime.

    Best regards

    链接地址: http://www.djcxy.com/p/67036.html

    上一篇: django haystack重建索引时无法连接到solr服务器

    下一篇: Apache Solr索引Bechmarking