full text search

I have a database of 80,000 rows and as I was testing out some FULLTEXT queries I encountered an unexpected result. I have removed stopwords from MYSQL and have set the minimum word length to 3.

When I do this query:

SELECT `sentence`, MATCH (`sentence`) AGAINST ('CAN YOU FLY') AS `relevance`
FROM `sentences`
WHERE MATCH (`sentence`) AGAINST ('CAN YOU FLY')
ORDER BY `relevance` DESC

It gives this result:

NO A FLY WITHOUT WINGS WOULD BE CALLED A WINGLESS | 10.623517036438
I CAN FLY                                         | 7.61278629302979
I CAN FLY :)                                      | 7.61278629302979
CAN YOU FLY?                                      | 7.61278629302979
THEY CAN FLY                                      | 7.61278629302979
YOU AM NOT FLY                                    | 7.61278629302979
CAN YOU FLY                                       | 7.61278629302979
HAVE YOU EVER SWALLOWED A FLY?                    | 7.52720737457275
I JUST WANNA FLY                                  | 7.52720737457275

Why has "NO A FLY WITHOUT WINGS WOULD BE CALLED A WINGLESS" got the highest relevance, it only contains one of the words... Also, how come "CAN YOU FLY" isn't at the top, it's an exact match.

I would like it to order by most matched keywords, then order by ones which are in order most, then order by fewest words. This would give the logical result:

CAN YOU FLY
CAN YOU FLY?
I CAN FLY
THEY CAN FLY
I CAN FLY :)
YOU AM NOT FLY
HAVE YOU EVER SWALLOWED A FLY?
I JUST WANNA FLY
NO A FLY WITHOUT WINGS WOULD BE CALLED A WINGLESS

The formula used for calculations is available in MySQL Internals Manual:

w = (log(dtf)+1)/sumdtf * U/(1+0.0115*U) * log((N-nf)/nf)

where

dtf     is the number of times the term appears in the document
sumdtf  is the sum of (log(dtf)+1)'s for all terms in the same document
U       is the number of Unique terms in the document
N       is the total number of documents
nf      is the number of documents that contain the term

The first text obviously has more content than than the others. And the formula relies greatly on U , the number of unique terms in the document.

By your comment, I would suggest to use Boolean Fulltext Search:

SELECT `sentence`, MATCH (`sentence`) AGAINST ('CAN YOU FLY' IN BOOLEAN MODE) AS `relevance`
FROM `sentences`
WHERE MATCH (`sentence`) AGAINST ('CAN YOU FLY' IN BOOLEAN MODE)
ORDER BY `relevance` DESC
链接地址: http://www.djcxy.com/p/75266.html

上一篇: 突出显示匹配的单词MySQL FULLTEXT索引

下一篇: 全文搜索