建立给定文本中最常用单词的ASCII图表

2018-06-05 18:09:54

挑战：

建立给定文本中最常用单词的ASCII图表。

规则：

只接受az和AZ （字母字符）作为单词的一部分。

忽略套管（ She == she为我们的目的）。

我忽略了下面的几个字（我知道这是非常随意的）： the, and, of, to, a, i, it, in, or, is

澄清：考虑don't ：这将被视为az和AZ范围内的两个不同的'单词'：（ don和t ）。

可选地 （现在要正式更改规格为时已晚），您可以选择放弃所有单个字母的单词（这可能会缩短忽略列表的范围）。

解析给定的text （读取通过命令行参数指定的文件或输入;假设us-ascii ）并为我们构建一个具有以下特征的word frequency chart ：

显示22个最常用单词（按降序排列）的图表（请参阅下面的示例）。

条width表示单词的出现次数（按比例）。追加一个空格并打印单词。

确保这些条（加上空格 - 词 - 空间）总是适合的： bar + [space] + word + [space]应该总是<= 80字符（确保考虑到可能存在不同的条和词长度：例如：第二个最常用的单词可能比第一单词长得多，但频率差别不大）。在这些约束条件下最大化条宽，并适当缩放条（根据它们所代表的频率）。

一个例子：

这个例子的文本可以在这里找到（刘易斯卡罗尔的“爱丽丝梦游仙境”）。

这个特定的文本将产生下面的图表：

 _________________________________________________________________________
|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|____________________________________________________| alice 
|______________________________________________| was 
|__________________________________________| that 
|___________________________________| as 
|_______________________________| her 
|____________________________| with 
|____________________________| at 
|___________________________| s 
|___________________________| t 
|_________________________| on 
|_________________________| all 
|______________________| this 
|______________________| for 
|______________________| had 
|_____________________| but 
|____________________| be 
|____________________| not 
|___________________| they 
|__________________| so

为了您的信息：这些是以上图表的频率：

[('she', 553), ('you', 481), ('said', 462), ('alice', 403), ('was', 358), ('that
', 330), ('as', 274), ('her', 248), ('with', 227), ('at', 227), ('s', 219), ('t'
, 218), ('on', 204), ('all', 200), ('this', 181), ('for', 179), ('had', 178), ('
but', 175), ('be', 167), ('not', 166), ('they', 155), ('so', 152)]

第二个例子（检查您是否实现完整的规格）：替换的每一次出现you的链接爱丽丝梦游仙境与文件superlongstringstring ：

 ________________________________________________________________
|________________________________________________________________| she 
|_______________________________________________________| superlongstringstring 
|_____________________________________________________| said 
|______________________________________________| alice 
|________________________________________| was 
|_____________________________________| that 
|______________________________| as 
|___________________________| her 
|_________________________| with 
|_________________________| at 
|________________________| s 
|________________________| t 
|______________________| on 
|_____________________| all 
|___________________| this 
|___________________| for 
|___________________| had 
|__________________| but 
|_________________| be 
|_________________| not 
|________________| they 
|________________| so

获胜者，冠军：

最短的解决方案（按字符数，每种语言）。玩的开心！

编辑：目前为止总结结果的表（2012-02-15）（最初由用户Nas Banov添加）：

Language          Relaxed  Strict
=========         =======  ======
GolfScript          130     143
Perl                        185
Windows PowerShell  148     199
Mathematica                 199
Ruby                185     205
Unix Toolchain      194     228
Python              183     243
Clojure                     282
Scala                       311
Haskell                     333
Awk                         336
R                   298
Javascript          304     354
Groovy              321
Matlab                      404
C#                          422
Smalltalk           386
PHP                 450
F#                          452
TSQL                483     507

数字表示特定语言中最短解的长度。 “严格”是指完全实现规范的解决方案（绘制|____|条形图，用____线关闭第一个条形图，说明高频率的长词的可能性等）。 “轻松”意味着一些自由度被缩短以解决问题。

仅包含500个字符以内的解决方案。语言列表按“严格”解决方案的长度排序。 'Unix Toolchain'用来表示使用传统的* nix shell加上混合工具（如grep，tr，sort，uniq，head，perl，awk）的各种解决方案。

LabVIEW 51节点，5个结构，10个图表

教大象去踢踏舞永远不会很漂亮。我会啊，跳过字符数。

labVIEW代码

程序从左到右流动：

labVIEW代码解释

红宝石1.9,185字

（主要基于其他Ruby解决方案）

w=($<.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort[0,22]
k,l=w[0]
puts [?s+?_*m=76-l.size,w.map{|f,x|?|+?_*(f*m/k)+"| "+x}]

不像其他解决方案那样使用任何命令行开关，只需将文件名作为参数传递即可。（即ruby1.9 wordfrequency.rb Alice.txt ）

由于我在这里使用字符文字，所以此解决方案仅适用于Ruby 1.9。

编辑：用“换行”替换分号以换行。：P

编辑2：Shtééf指出我忘记了尾部空间 - 解决了这个问题。

编辑3：再次删除尾部空格;）

GolfScript，177 175 173 167 164 163 144 131 130个字符

慢 - 样本文本3分钟（130）

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~;}$22<.0=~:2;,76-:1'_':0*' '@{"
|"~1*2/0*'| '@}/

说明：

{           #loop through all characters
 32|.       #convert to uppercase and duplicate
 123%97<    #determine if is a letter
 n@if       #return either the letter or a newline
}%          #return an array (of ints)
]''*        #convert array to a string with magic
n%          #split on newline, removing blanks (stack is an array of words now)
"oftoitinorisa"   #push this string
2/          #split into groups of two, i.e. ["of" "to" "it" "in" "or" "is" "a"]
-           #remove any occurrences from the text
"theandi"3/-#remove "the", "and", and "i"
$           #sort the array of words
(1@         #takes the first word in the array, pushes a 1, reorders stack
            #the 1 is the current number of occurrences of the first word
{           #loop through the array
 .3$>1{;)}if#increment the count or push the next word and a 1
}/
]2/         #gather stack into an array and split into groups of 2
{~~;}$     #sort by the latter element - the count of occurrences of each word
22<         #take the first 22 elements
.0=~:2;     #store the highest count
,76-:1     #store the length of the first line
'_':0*' '@ #make the first line
{           #loop through each word
"
|"~        #start drawing the bar
1*2/0       #divide by zero
*'| '@      #finish drawing the bar
}/

“正确”（希望）。（143）

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~;}$22<..0=1=:^;{~76@,-^*/}%$0=:1'_':0*' '@{"
|"~1*^/0*'| '@}/

慢一点 - 半分钟。（162）

'"'/' ':S*n/S*'"#{%q
'+"
.downcase.tr('^a-z','
')}""+~n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~;}$22<.0=~:2;,76-:1'_':0*S@{"
|"~1*2/0*'| '@}/

输出在修订日志中可见。

链接地址: http://www.djcxy.com/p/18121.html

上一篇: Build an ASCII chart of the most commonly used words in a given text

下一篇: How to sort a Python dict by value