Code Golf: Quickly Build List of Keywords from Text, Including # of Instances

2018-06-05 18:11:58

I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).

Return only words with an occurrence greater than X

Return only words with a length greater than Y

Ignore common terms like "and, is, the, etc"

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John")

Return results in a collection/array

Extra Credit

Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")
Where 'too good to be true' would be the actual statement

Extra-Extra Credit

Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example: *"The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."* Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?

Source text: http://sampsonresume.com/labs/c.txt

Answer Format

It would be great to see the results of your code, output, in addition to how long the operation lasted.

GNU scripting

sed -e 's/ /n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr

Results:

  7 be
  6 to
[...]
  1 2.
  1 -

With occurence greater than X:

sed -e 's/ /n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'

Return only words with a length greater than Y (put Y+1 dots in second grep):

sed -e 's/ /n/g' | grep -v '^ *$' | grep .... | sort | uniq -c

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

sed -e 's/ /n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

sed -e 's/[,.:"']//g;s/ /n/g' | grep -v '^ *$' | sort | uniq -c

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

Perl in only 43 characters.

perl -MYAML -anE'$_{$_}++for@F;say Dump%_'

Here is an example of it's use:

echo a a a b b c  d e aa | perl -MYAML -anE'$_{$_}++for@F;say Dump %_'

---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1

If you need to list only the lowercase versions, it requires two more characters.

perl -MYAML -anE'$_{lc$_}++for@F;say Dump%_'

For it to work on the specified text requires 58 characters.

curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'W+' -anE'$_{lc$_}++for@F;END{say Dump%_}'

real    0m0.679s
user    0m0.304s
sys     0m0.084s

Here is the last example expanded a bit.

#! perl
use 5.010;
use YAML;

while( my $line = <> ){
  for my $elem ( split 'W+', $line ){
    $_{ lc $elem }++
  }
  END{
    say Dump %_;
  }
}

F＃：304个字符

let f =
    let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
    fun length occurrence msg ->
        System.Text.RegularExpressions.Regex.Split(msg, @"[^w-']+")
        |> Seq.countBy (fun a -> a)
        |> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)

链接地址: http://www.djcxy.com/p/18126.html

上一篇: 算法找到最有效的移动到达给定的点

下一篇: 代码高尔夫：快速构建来自文本的关键字列表，包括实例数量