Software chain to find duplicate images
What I'm trying to achieve
I'm looking for a software chain to find duplicate images. First, here's how I define a duplicate image : There's an original image, coming directly from a camera, and modified version(s) of this image. Modifying the image can be any or a combination of the following operations:
A real world example:
The original image
Luminosity + brightness change + resize
Cropping
Frame + text
Matching a pair of any of the images above should result in finding a duplicate. As you can see, the modification is not intended to be destructive, but rather ameliorative. For instance, the main subject of the image (here, the alarm clock) will never be cropped in its middle.
The modification can be chained (a new modification can be based on a previous modification rather than on the original image), resulting in an image to be compressed a lot of times.
Then, the photographer can take another image:
The viewpoint and the main subject have changed (it's now 0:02!) => when compared to any of the images above, this new image should not be considered as a duplicate.
What I was doing so far
#1 : getting rid of frames
First of all, I'm using OpenCV's Canny Detector + Hough algorithm to find vertical and horizontal lines on the image. Then, I crop the picture according to the lines the algorithm found.
Problem I've been facing with that solution: when there are horizontal or vertical lines in the original picture's background, it's hard to distinguish which lines are from the frame, which one are from the picture => manual review.
I've also set up a higher thresold to avoid getting too many false positive: unfortunately, some elaborate frames (with a gradient, for instance) go through.
Is there a better algorithm to detect these frames?
#2 : finding duplicate
I've been using pHash and its DCT image hash so far. It computes a visual hash, and provides a very efficient way to search for similar images in a large database.
Advantages :
Disadvantages :
All of the duplicate pHash finds end up in manual review as well. That's not a problem, except when the input data is thousands of images of the same subject. The number of duplicates to review then grows quadratically, which is not very convenient.
Ideas on how to improve the duplicate detection
I've been digging around on how to reduce the number of false positive from pHash. My first idea was adding OpenCV's template matching to my existing software chain. Problem : it wouldn't work for rotated images.
Then, I learned about feature detection, and I thought this might be the way to go. However, this is a very vast field and this is where I need help.
I found at page 81 of this PDF an interesting comparison of feature detectors. If I get it right, I need "Rotation invariant", "Scale invariant" but not "Affine invariant" (which seems to be a change in the viewpoint). This would be give me the following options:
Would these algorithms answer my needs? Should I integrate them in my existing chain or should I start over a new chain? Feature detection to duplicate matching seems a long way to go, what would be the best approach?
你应该采取本地特征匹配方法(SURF / ORB / BRISK ...)你可以在这里找到一个很好的教程:http://docs.opencv.org/doc/tutorials/features2d/feature_flann_matcher/feature_flann_matcher.html如果效率是非常重要的是,你可以用自定义的find-rigid-transform代码代替OpenCV的findHomography
,但如果它不是一个大问题, findHomography
可能会很好地为你服务。
下一篇: 软件链来查找重复的图像