Build an ASCII chart of the most commonly used words in a given text

The challenge:

Build an ASCII chart of the most commonly used words in a given text.

The rules:

  • Only accept az and AZ (alphabetic characters) as part of a word.
  • Ignore casing ( She == she for our purpose).
  • Ignore the following words (quite arbitary, I know): the, and, of, to, a, i, it, in, or, is
  • Clarification: considering don't : this would be taken as 2 different 'words' in the ranges az and AZ : ( don and t ).

  • Optionally (it's too late to be formally changing the specifications now) you may choose to drop all single-letter 'words' (this could potentially make for a shortening of the ignore list too).

  • Parse a given text (read a file specified via command line arguments or piped in; presume us-ascii ) and build us a word frequency chart with the following characteristics:

  • Display the chart (also see the example below) for the 22 most common words (ordered by descending frequency).
  • The bar width represents the number of occurences (frequency) of the word (proportionally). Append one space and print the word.
  • Make sure these bars (plus space-word-space) always fit: bar + [space] + word + [space] should be always <= 80 characters (make sure you account for possible differing bar and word lengths: eg: the second most common word could be a lot longer then the first while not differing so much in frequency). Maximize bar width within these constraints and scale the bars appropriately (according to the frequencies they represent).
  • An example:

    The text for the example can be found here (Alice's Adventures in Wonderland, by Lewis Carroll).

    This specific text would yield the following chart:

     _________________________________________________________________________
    |_________________________________________________________________________| she 
    |_______________________________________________________________| you 
    |____________________________________________________________| said 
    |____________________________________________________| alice 
    |______________________________________________| was 
    |__________________________________________| that 
    |___________________________________| as 
    |_______________________________| her 
    |____________________________| with 
    |____________________________| at 
    |___________________________| s 
    |___________________________| t 
    |_________________________| on 
    |_________________________| all 
    |______________________| this 
    |______________________| for 
    |______________________| had 
    |_____________________| but 
    |____________________| be 
    |____________________| not 
    |___________________| they 
    |__________________| so 
    
    
    

    For your information: these are the frequencies the above chart is built upon:

    [('she', 553), ('you', 481), ('said', 462), ('alice', 403), ('was', 358), ('that
    ', 330), ('as', 274), ('her', 248), ('with', 227), ('at', 227), ('s', 219), ('t'
    , 218), ('on', 204), ('all', 200), ('this', 181), ('for', 179), ('had', 178), ('
    but', 175), ('be', 167), ('not', 166), ('they', 155), ('so', 152)]
    

    A second example (to check if you implemented the complete spec): Replace every occurence of you in the linked Alice in Wonderland file with superlongstringstring :

     ________________________________________________________________
    |________________________________________________________________| she 
    |_______________________________________________________| superlongstringstring 
    |_____________________________________________________| said 
    |______________________________________________| alice 
    |________________________________________| was 
    |_____________________________________| that 
    |______________________________| as 
    |___________________________| her 
    |_________________________| with 
    |_________________________| at 
    |________________________| s 
    |________________________| t 
    |______________________| on 
    |_____________________| all 
    |___________________| this 
    |___________________| for 
    |___________________| had 
    |__________________| but 
    |_________________| be 
    |_________________| not 
    |________________| they 
    |________________| so 
    

    The winner:

    Shortest solution (by character count, per language). Have fun!


    Edit : Table summarizing the results so far (2012-02-15) (originally added by user Nas Banov):

    Language          Relaxed  Strict
    =========         =======  ======
    GolfScript          130     143
    Perl                        185
    Windows PowerShell  148     199
    Mathematica                 199
    Ruby                185     205
    Unix Toolchain      194     228
    Python              183     243
    Clojure                     282
    Scala                       311
    Haskell                     333
    Awk                         336
    R                   298
    Javascript          304     354
    Groovy              321
    Matlab                      404
    C#                          422
    Smalltalk           386
    PHP                 450
    F#                          452
    TSQL                483     507
    

    The numbers represent the length of the shortest solution in a specific language. "Strict" refers to a solution that implements the spec completely (draws |____| bars, closes the first bar on top with a ____ line, accounts for the possibility of long words with high frequency etc). "Relaxed" means some liberties were taken to shorten to solution.

    Only solutions shorter then 500 characters are included. The list of languages is sorted by the length of the 'strict' solution. 'Unix Toolchain' is used to signify various solutions that use traditional *nix shell plus a mix of tools (like grep, tr, sort, uniq, head, perl, awk).


    LabVIEW 51 nodes, 5 structures, 10 diagrams

    Teaching the elephant to tap-dance is never pretty. I'll, ah, skip the character count.

    labVIEW代码

    The program flows from left to right:

    labVIEW代码解释


    Ruby 1.9, 185 chars

    (heavily based on the other Ruby solutions)

    w=($<.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort[0,22]
    k,l=w[0]
    puts [?s+?_*m=76-l.size,w.map{|f,x|?|+?_*(f*m/k)+"| "+x}]
    

    Instead of using any command line switches like the other solutions, you can simply pass the filename as argument. (ie ruby1.9 wordfrequency.rb Alice.txt )

    Since I'm using character-literals here, this solution only works in Ruby 1.9.

    Edit: Replaced semicolons by line breaks for "readability". :P

    Edit 2: Shtééf pointed out I forgot the trailing space - fixed that.

    Edit 3: Removed the trailing space again ;)


    GolfScript, 177 175 173 167 164 163 144 131 130 chars

    Slow - 3 minutes for the sample text (130)

    {32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~;}$22<.0=~:2;,76-:1'_':0*' '@{"
    |"~1*2/0*'| '@}/
    

    Explanation:

    {           #loop through all characters
     32|.       #convert to uppercase and duplicate
     123%97<    #determine if is a letter
     n@if       #return either the letter or a newline
    }%          #return an array (of ints)
    ]''*        #convert array to a string with magic
    n%          #split on newline, removing blanks (stack is an array of words now)
    "oftoitinorisa"   #push this string
    2/          #split into groups of two, i.e. ["of" "to" "it" "in" "or" "is" "a"]
    -           #remove any occurrences from the text
    "theandi"3/-#remove "the", "and", and "i"
    $           #sort the array of words
    (1@         #takes the first word in the array, pushes a 1, reorders stack
                #the 1 is the current number of occurrences of the first word
    {           #loop through the array
     .3$>1{;)}if#increment the count or push the next word and a 1
    }/
    ]2/         #gather stack into an array and split into groups of 2
    {~~;}$     #sort by the latter element - the count of occurrences of each word
    22<         #take the first 22 elements
    .0=~:2;     #store the highest count
    ,76-:1     #store the length of the first line
    '_':0*' '@ #make the first line
    {           #loop through each word
    "
    |"~        #start drawing the bar
    1*2/0       #divide by zero
    *'| '@      #finish drawing the bar
    }/
    

    "Correct" (hopefully). (143)

    {32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~;}$22<..0=1=:^;{~76@,-^*/}%$0=:1'_':0*' '@{"
    |"~1*^/0*'| '@}/
    

    Less slow - half a minute. (162)

    '"'/' ':S*n/S*'"#{%q
    '+"
    .downcase.tr('^a-z','
    ')}""+~n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~;}$22<.0=~:2;,76-:1'_':0*S@{"
    |"~1*2/0*'| '@}/
    

    Output visible in revision logs.

    链接地址: http://www.djcxy.com/p/2986.html

    上一篇: 如何按价值对字典进行排序?

    下一篇: 建立给定文本中最常用单词的ASCII图表