Lucene, Sphinx, Postgresql, MySQL?

I'm building a Django site and I am looking for a search engine.

A few candidates:

  • Lucene/Lucene with Compass/Solr

  • Sphinx

  • Postgresql built-in full text search

  • MySQl built-in full text search

  • Selection criteria:

  • result relevance and ranking
  • searching and indexing speed
  • ease of use and ease of integration with Django
  • resource requirements - site will be hosted on a VPS, so ideally the search engine wouldn't require a lot of RAM and CPU
  • scalability
  • extra features such as "did you mean?", related searches, etc
  • Anyone who has had experience with the search engines above, or other engines not in the list -- I would love to hear your opinions.

    EDIT: As for indexing needs, as users keep entering data into the site, those data would need to be indexed continuously. It doesn't have to be real time, but ideally new data would show up in index with no more than 15 - 30 minutes delay


    Good to see someone's chimed in about Lucene - because I've no idea about that.

    Sphinx, on the other hand, I know quite well, so let's see if I can be of some help.

  • Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
  • Indexing speed is super-fast, because it talks directly to the database. Any slowness will come from complex SQL queries and un-indexed foreign keys and other such problems. I've never noticed any slowness in searching either.
  • I'm a Rails guy, so I've no idea how easy it is to implement with Django. There is a Python API that comes with the Sphinx source though.
  • The search service daemon (searchd) is pretty low on memory usage - and you can set limits on how much memory the indexer process uses too.
  • Scalability is where my knowledge is more sketchy - but it's easy enough to copy index files to multiple machines and run several searchd daemons. The general impression I get from others though is that it's pretty damn good under high load, so scaling it out across multiple machines isn't something that needs to be dealt with.
  • There's no support for 'did-you-mean', etc - although these can be done with other tools easily enough. Sphinx does stem words though using dictionaries, so 'driving' and 'drive' (for example) would be considered the same in searches.
  • Sphinx doesn't allow partial index updates for field data though. The common approach to this is to maintain a delta index with all the recent changes, and re-index this after every change (and those new results appear within a second or two). Because of the small amount of data, this can take a matter of seconds. You will still need to re-index the main dataset regularly though (although how regularly depends on the volatility of your data - every day? every hour?). The fast indexing speeds keep this all pretty painless though.
  • I've no idea how applicable to your situation this is, but Evan Weaver compared a few of the common Rails search options (Sphinx, Ferret (a port of Lucene for Ruby) and Solr), running some benchmarks. Could be useful, I guess.

    I've not plumbed the depths of MySQL's full-text search, but I know it doesn't compete speed-wise nor feature-wise with Sphinx, Lucene or Solr.


    I don't know Sphinx, but as for Lucene vs a database full-text search, I think that Lucene performance is unmatched. You should be able to do almost any search in less than 10 ms, no matter how many records you have to search, provided that you have set up your Lucene index correctly.

    Here comes the biggest hurdle though: personally, I think integrating Lucene in your project is not easy. Sure, it is not too hard to set it up so you can do some basic search, but if you want to get the most out of it, with optimal performance, then you definitely need a good book about Lucene.

    As for CPU & RAM requirements, performing a search in Lucene doesn't task your CPU too much, though indexing your data is, although you don't do that too often (maybe once or twice a day), so that isn't much of a hurdle.

    It doesn't answer all of your questions but in short, if you have a lot of data to search, and you want great performance, then I think Lucene is definitely the way to go. If you're not going to have that much data to search, then you might as well go for a database full-text search. Setting up a MySQL full-text search is definitely easier in my book.


    I am surprised that there isn't more information posted about Solr. Solr is quite similar to Sphinx but has more advanced features (AFAIK as I haven't used Sphinx -- only read about it).

    The answer at the link below details a few things about Sphinx which also applies to Solr. Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?

    Solr also provides the following additional features:

  • Supports replication
  • Multiple cores (think of these as separate databases with their own configuration and own indexes)
  • Boolean searches
  • Highlighting of keywords (fairly easy to do in application code if you have regex-fu; however, why not let a specialized tool do a better job for you)
  • Update index via XML or delimited file
  • Communicate with the search server via HTTP (it can even return Json, Native PHP/Ruby/Python)
  • PDF, Word document indexing
  • Dynamic fields
  • Facets
  • Aggregate fields
  • Stop words, synonyms, etc.
  • More Like this...
  • Index directly from the database with custom queries
  • Auto-suggest
  • Cache Autowarming
  • Fast indexing (compare to MySQL full-text search indexing times) -- Lucene uses a binary inverted index format.
  • Boosting (custom rules for increasing relevance of a particular keyword or phrase, etc.)
  • Fielded searches (if a search user knows the field he/she wants to search, they narrow down their search by typing the field, then the value, and ONLY that field is searched rather than everything -- much better user experience)
  • BTW, there are tons more features; however, I've listed just the features that I have actually used in production. BTW, out of the box, MySQL supports #1, #3, and #11 (limited) on the list above. For the features you are looking for, a relational database isn't going to cut it. I'd eliminate those straight away.

    Also, another benefit is that Solr (well, Lucene actually) is a document database (eg NoSQL) so many of the benefits of any other document database can be realized with Solr. In other words, you can use it for more than just search (ie Performance). Get creative with it :)

    链接地址: http://www.djcxy.com/p/38722.html

    上一篇: Django:获取模型字段列表?

    下一篇: Lucene,Sphinx,Postgresql,MySQL?