Software chain to find duplicate images

What I'm trying to achieve

I'm looking for a software chain to find duplicate images. First, here's how I define a duplicate image : There's an original image, coming directly from a camera, and modified version(s) of this image. Modifying the image can be any or a combination of the following operations:

  • Changing brightness, contrast, coloring (a modified version of the image could be in Black & White)
  • Cropping
  • Resizing
  • Rotating
  • Adding a frame around the image
  • Writing on the frame
  • A real world example:

    The original image 原始图像

    Luminosity + brightness change + resize 修改后的版本#1

    Cropping 修改版本#2

    Frame + text 修改版本#3

    Matching a pair of any of the images above should result in finding a duplicate. As you can see, the modification is not intended to be destructive, but rather ameliorative. For instance, the main subject of the image (here, the alarm clock) will never be cropped in its middle.

    The modification can be chained (a new modification can be based on a previous modification rather than on the original image), resulting in an image to be compressed a lot of times.

    Then, the photographer can take another image:


    The viewpoint and the main subject have changed (it's now 0:02!) => when compared to any of the images above, this new image should not be considered as a duplicate.

    What I was doing so far

    #1 : getting rid of frames

    First of all, I'm using OpenCV's Canny Detector + Hough algorithm to find vertical and horizontal lines on the image. Then, I crop the picture according to the lines the algorithm found.

    Problem I've been facing with that solution: when there are horizontal or vertical lines in the original picture's background, it's hard to distinguish which lines are from the frame, which one are from the picture => manual review.

    I've also set up a higher thresold to avoid getting too many false positive: unfortunately, some elaborate frames (with a gradient, for instance) go through.

    Is there a better algorithm to detect these frames?

    #2 : finding duplicate

    I've been using pHash and its DCT image hash so far. It computes a visual hash, and provides a very efficient way to search for similar images in a large database.

    Advantages :

  • It's very fast
  • You can search through thousands of images
  • It works good enough with all of my criteria (cropping, resizing, re-compressed images, rotation)
  • Disadvantages :

  • Many false positive
  • Find duplicates for images that have been taken from completely different pointviews
  • Can miss some duplicates when images had a combination of modifications
  • All of the duplicate pHash finds end up in manual review as well. That's not a problem, except when the input data is thousands of images of the same subject. The number of duplicates to review then grows quadratically, which is not very convenient.

    Ideas on how to improve the duplicate detection

    I've been digging around on how to reduce the number of false positive from pHash. My first idea was adding OpenCV's template matching to my existing software chain. Problem : it wouldn't work for rotated images.

    Then, I learned about feature detection, and I thought this might be the way to go. However, this is a very vast field and this is where I need help.

    I found at page 81 of this PDF an interesting comparison of feature detectors. If I get it right, I need "Rotation invariant", "Scale invariant" but not "Affine invariant" (which seems to be a change in the viewpoint). This would be give me the following options:

  • Harris-Laplace
  • Hessian-Laplace
  • DoG
  • SURF
  • Would these algorithms answer my needs? Should I integrate them in my existing chain or should I start over a new chain? Feature detection to duplicate matching seems a long way to go, what would be the best approach?

    你应该采取本地特征匹配方法(SURF / ORB / BRISK ...)你可以在这里找到一个很好的教程:如果效率是非常重要的是,你可以用自定义的find-rigid-transform代码代替OpenCV的findHomography ,但如果它不是一个大问题, findHomography可能会很好地为你服务。


    上一篇: 如何在先前录制的视频/图像帧中检测鼠标光标/点击

    下一篇: 软件链来查找重复的图像