Text search in multilingual content in SQL Server

We have a site which supports different languages. We have millions of data so in search we would like to implement SQL Server Full-Text Search .

The table structure we have currently like below.

CREATE TABLE Product
(
   ID INT IDENTITY(1,1),
   Code VARCHAR(50),
   ........
   ........
)

CREATE TABLE ProductLanguage
(
   ID INT,
   LanguageID INT,
   Name NVARCHAR(200),
   ........
   ........
)

We would like to implement Full-Text search in "Name" column so we have created Full-Text index on the Name column. But while creating Full-Text index we can select only one language per column. If we select "English" or "Neutral" its not returning expected data in other languages like Japanese, Chinese, French etc.

So what is the best way to implement Full-Text search in SQL Server for multilingual content.

Do we need to create a different table. If yes then what will be the table structure (We need to keep in mind that the Languages are not fixed, different language can be added later) and what will be search query?

We are using SQL Server 2008 R2.


Certain content (document) types support language settings - eg Microsoft Office Documents, PDF, [X]HTML, or XML.

If you change the type of your Name column to XML, you can determine the language of each value (ie per row). For instance:

Instead of storing values as strings

name 1
name 2
name 3

...you could store them as XML documents with the appropriate language declarations:

<content xml:lang="en-US">name 1</content>
<content xml:lang="fr-FR">name 2</content>
<content xml:lang="en-UK">name 3</content>

During Full-text index population the correct word breaker/stemmer will be used, based on the language settings of each value (XML document): US English for name 1, French or name 2, and UK English for name 3.

Of course, this would require a significant change in the way your data is managed and consumed.

ML


I'd be concerned about the performance of using XML instead of NVARCHAR(n) - though I have no hard proof for it. One alternative could be to use dynamic SQL (generate the language specific code on the fly), combined with language specific indexed views on the Product table. Drawback of thsi is the lack of execution plan caching, ie again: performance.


Same idea as Matija Lah's answer, but this is the suggested solution outlined in the MS whitepaper.

  • When the indexed content is of binary type (such as a Microsoft Word document), the iFilter responsible for processing the text content before sending it to the word breaker might honor specific language tags in the binary file. When this is the case, at indexing time the iFilter invokes the correct word breaker for a specific document or section of a document specified in a particular language. All you need to do in this case is to verify after indexing that the multilanguage content was indexed correctly. Filters for Word, HTML, and XML documents honor language specification attributes in document content:
  • Word – language settings
  • HTML - <meta name=“MS.locale”…>
  • XML – xml:lang attribute
  • When your content is plain text, you can convert it to the XML data type and add specific language tags to indicate the language corresponding to that specific document or document section. Note that for this to work, before you index you must know the language that will be used.
  • https://technet.microsoft.com/en-us/library/cc721269%28v=sql.100%29.aspx

    链接地址: http://www.djcxy.com/p/66852.html

    上一篇: 使列与Full兼容

    下一篇: 在SQL Server中的多语言内容中进行文本搜索