use fuzzy matching in django queryset filter

Is there a way to use fuzzy matching in a django queryset filter?

I'm looking for something along the lines of Object.objects.filter(fuzzymatch(namevariable)__gt=.9)

or is there a way to use lambda functions, or something similar in django queries, and if so, how much would it affect performance time (given that I have a stable set of ~6000 objects in my database that I want to match to)

(realized I should probably put my comments in the question)

I need something stronger than contains, something along the lines of difflib. I'm basically trying to get around doing a Object.objects.all() and then a list comprehension with fuzzy matching.

(although I'm not necessarily sure that doing that would be much slower than trying to filter based on a function, so if you have thoughts on that I'm happy to listen)

also, even though it's not exactly what I want, I'd be open to some kind of tokenized opposite-contains, like Object.objects.filter(['Virginia', 'Tech']__in=Object.name), where something like "Virginia Technical Institute" would be returned. Although case insensitive, preferably.


When you're using the ORM, the thing to understand is that everything you do converts to SQL commands and it's the performance of the underlying queries on the underlying database that matter. Case in point:

SELECT COUNT (*) ...

Is that fast? Depends on whether your database stores any records to give you that information - MySQL/MyISAM does, MySQL/InnoDB does not. In English - this is one lookup in MYISAM, and n in InnoDB.

Next thing - in order to do exact match lookups efficiently in SQL you have to tell it when you create the table - you can't just expect it to understand. For this purpose SQL has the INDEX statement - in django, use db_index=True in the field options of your model. Bear in mind that this has an added performance hit on writes (to create the index) and obviously extra storage is needed (for the datastructure) so you cannot " INDEX all the things". Also, I don't think it will help for fuzzy matching - but it's worth noting anyway.

Next consideration - how do we do fuzzy matching in SQL? Well apparently LIKE and CONTAINS allow a certain amount of searching and wildcard-results to be executed in SQL. These are T-SQL links - translate for your database server :) You can achieve this via Model.objects.get(fieldname__contains=value) which will produce LIKE SQL, or similar. There are a number of options available there for different lookups.

This may or may not be powerful enough for you - I'm not sure.

Now, for the big question: performance. Chances are if you're doing a contains search that the SQL server will have to hit all of the rows in the database - don't take my word on that, but it would be my bet - even with indexing on. With 6000 rows this might not take all that long; then again, if you're doing this on a per-connection-to-your-app basis it's probably going to create a slowdown.

Next thing to understand about the ORM: if you do this:

Model.objects.get(fieldname__contains=value)
Model.objects.get(fieldname__contains=value)

You will issue two queries to the database server. In other words, the ORM doesn't always cache the results - so you might just want to do an .all() and search in memory. Do read about caching and querysets.

Further on on that last page, you'll also see Q objects - useful for more complicated queries.

So in summary then:

  • SQL contains some basic fuzzy matching-like parameters.
  • Whether or not these are sufficient depends on your needs.
  • How they perform depends on your SQL server - definitely measure it .
  • Whether you can cache these results in memory depends on how likely scaling is - again might be worth measuring the memory commit as a result - if you can share between instances and if the cache will be frequently invalidated (if it will be, don't do it).
  • Ultimately, I'd start by getting your fuzzy matching working, then measure, then tweak, then measure until you work out how to improve performance. 99% of this I learnt doing exactly that :)


    如果你需要比包含lookup更强的东西,看看正则表达式查找:https://docs.djangoproject.com/en/1.0/ref/models/querysets/#regex


    with postgres as database, you can use TrigramSimilarity to do fuzzy search and rank your results on different weight as well. Here is the link to documentation :

    https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/search/#trigram-similarity

    For full text search you can refer to https://czep.net/17/full-text-search.html

    链接地址: http://www.djcxy.com/p/38662.html

    上一篇: 如何使用Django ORM在两个传入模型上查询?

    下一篇: 在django queryset filter中使用模糊匹配