Code Golf: Quickly Build List of Keywords from Text, Including # of Instances
I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).
Extra Credit
Where 'too good to be true' would be the actual statement
Extra-Extra Credit
Source text: http://sampsonresume.com/labs/c.txt
Answer Format
GNU scripting
sed -e 's/ /n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr
Results:
7 be
6 to
[...]
1 2.
1 -
With occurence greater than X:
sed -e 's/ /n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'
Return only words with a length greater than Y (put Y+1 dots in second grep):
sed -e 's/ /n/g' | grep -v '^ *$' | grep .... | sort | uniq -c
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
sed -e 's/ /n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
sed -e 's/[,.:"']//g;s/ /n/g' | grep -v '^ *$' | sort | uniq -c
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
Perl in only 43 characters.
perl -MYAML -anE'$_{$_}++for@F;say Dump%_'
Here is an example of it's use:
echo a a a b b c d e aa | perl -MYAML -anE'$_{$_}++for@F;say Dump %_'
---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1
If you need to list only the lowercase versions, it requires two more characters.
perl -MYAML -anE'$_{lc$_}++for@F;say Dump%_'
For it to work on the specified text requires 58 characters.
curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'W+' -anE'$_{lc$_}++for@F;END{say Dump%_}'
real 0m0.679s user 0m0.304s sys 0m0.084s
Here is the last example expanded a bit.
#! perl
use 5.010;
use YAML;
while( my $line = <> ){
for my $elem ( split 'W+', $line ){
$_{ lc $elem }++
}
END{
say Dump %_;
}
}
F# :304个字符
let f =
let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
fun length occurrence msg ->
System.Text.RegularExpressions.Regex.Split(msg, @"[^w-']+")
|> Seq.countBy (fun a -> a)
|> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)
链接地址: http://www.djcxy.com/p/18126.html
上一篇: 算法找到最有效的移动到达给定的点