Algorithm to match one input file with given numbers of file
I had an interview last week. I was stuck in one of the question in algorithm round. I answered that question, but the interviewer did not seem convinced. That's why I am sharing the same.
Please tell me any optimized method for this question, so that it will help me in future interviews.
Question :-
There are 20 text files given, all files are ASCII text files, having size less than 10^9 bytes. There is one input also given, this is also one ASCII file , say, input.txt.
Our task is to strategically match the content of this input file with given 20 files, and print the name of closest matching file. The contents of input file might only match partially
Thanks in advance. Looking for your kind reply.
对它们进行差异化并通过wc -l,或者在C ++中实现Levenshtein距离,将每行视为单个字符(或者包含主题域的任何更合适的单位)
You can create some kind of indexing (example: trie) to summarize the input file. Then you can check how many indices match across documents.
Eg. Create a trie for input file for length 10. For every string of length 10 (overlapping) in the text files check how many of them match in the trie.
As a suggestion for designing really capable, scalable systems for document similarity I'd suggest reading Chapter 3 of Mining Massive Datasets, which is freely available online. One approach presented there is to 'shingle' datasets by vectorizing word counts into sets, then hashing those word counts and comparing families of hashes results with Jaccard similarity to get a score between all documents. This can work on petabytes of files with high precision if done right. Explicit details with good diagrams can be read off Stanford's CS246 Slides on Locality Sensitive Hashing. Simpler approaches like word frequency counting are described in the book as well.
链接地址: http://www.djcxy.com/p/70630.html上一篇: 简单的方法来压扁这个数组?