Performing Counts, Sorting/mapping Large Dicts

I'm doing this week's 'easy' Daily Programmer Challenge on Reddit. The description is at the link, but essentially the challenge is to read a text file from a url and do a word count. Needless to say the resulting output is a fairly large dictionary object. I have a few questions, mostly regarding accessing or sorting keys according to their value.

First, I developed the code according to what I currently understand about OOP and good Python style. I wanted it to be as robust as possible but I also wanted to use the least amount of imported modules. My goal is to become a good programmer, thus I believe it's important to develop a strong foundation and figure out how to do things myself whenever possible. That being said, the code:

from urllib2 import urlopen

class Word(object):

    def __init__(self):
        self.word_count = {}    

    def alpha_only(self, word):
        """Converts word to lowercase and removes any non-alphabetic characters."""
        x = ''
        for letter in word:
            s = letter.lower()
            if s in 'abcdefghijklmnopqrstuvwxyz':
                x += s
        if len(x) > 0:
            return x    

    def count(self, line):
        """Takes a line from the file and builds a list of lowercased words containing only alphabetic chars.
            Adds each word to word_count if not already present, if present increases the count by 1."""
        words = [self.alpha_only(x) for x in line.split(' ') if self.alpha_only(x) != None]
        for word in words:
            if word in self.word_count:
                self.word_count[word] += 1
            elif word != None:
                self.word_count[word] = 1

class File(object):

    def __init__(self,book):
        self.book = urlopen(book)               
        self.word = Word()

    def strip_line(self,line):
        """Strips newlines, tabs, and return characters from beginning and end of line. If remaining string > 1,
            splits up the line and passes it along to the count method of the word object."""
        s = line.strip('nrt')
        if s > 1:
            self.word.count(s)

    def process_book(self):
        """Main processing loop, will not begin processing until the first line after the line containing "START".
            After processing it will close the file."""
        begin = False
        for line in self.book:
            if begin == True:
                self.strip_line(line)
            elif 'START' in line:
                begin = True
        self.book.close()

book = File('http://www.gutenberg.org/cache/epub/47498/pg47498.txt')

book.process_book()

count = book.word.word_count

So now I have a fairly accurate and robust word count that probably doesn't have any duplicates or blank entries, but is nevertheless a dict object containing over 3k key/value pairs. I can't iterate over it using for k,v in count or it gives me the exception ValueError: too many values to unpack , which rules out using list comprehension or mapping to a function to perform any kind of sorting.

I was reading this HowTo on Sorting and playing with it a few minutes ago and noticed that for x in count.items() lets me iterate through a list of key/value pairs without throwing a ValueError exception, so I removed the line count = book.word.word_count and added the following:

s_count = sorted(book.word.word_count.items(), key=lambda count: count[1], reverse=True)

# Delete the original dict, it is no longer needed
del book.word.word_count

Now I finally have a sorted list of words, s_count . PHEW! So, my questions are:

  • Is a dict even the best data type to perform the original counting? Would a list of tuples like that returned by count.items() have been preferable? But that would probably slow it down, right?

  • This seems kind of 'clunky', as I'm building a dict, converting it to a list containing tuples, then sorting the list and returning a new list. However, it is my understanding that dictionaries allow me to perform the fastest lookups, so am I missing something here?

  • I read briefly about hashing. While I think I understand that the point is that hashing will save space in memory and allow me to perform faster look-ups and comparisons, wouldn't the trade off be that the program becomes more computationally expensive(higher CPU load) because it would then be calculating hashes for each word? Is hashing relevant here?

  • Any feedback on naming conventions (which I am terrible at), or any other suggestions about basically anything (including style), would be greatly appreciated.


  • Are you sure that for k,v in count: gives the exception ValueError: too many values to unpack ? I expect it to give ValueError: need more than 1 value to unpack .

    When you use a dict as an iterator (eg in a for loop) you just get the keys, you don't get the values. If you want key, value pairs you need to use the dict 's iteritems() method as mentioned by figs in the comment (or in Python 3 the items() method).

    Of course, you can always do something like:

    for k in count:
        print k, count[k]
    

    ...

    I think that most of your questions are more suited to Code Review than to Stack Overflow. But since you've asked so nicely here, I'll mention a few points. :)

    It's rather inefficient to build up a string char by char, so your alpha_only() method would be better if it collected chars in a list then used the str.join() method to join them into a single string. The usual Python idiom would do that using a list comprehension.

    The list comprehension in your count() method calls alpha_only() twice for each word, which is in efficient.

    You could make your strip() call simpler by using the default argument, as that strips all white space (and you don't need to preserve space chars in this application). Similarly, using split() with its default arg will split on any runs of blank space, which is probably better in this application, since giving an arg of a single space means that you'll get some empty strings in the list returned by split if there are any runs of multiple spaces within a line.

    ...

    You mention hashing in your question, and whether it's useful for this application. Yes, it is. Python dictionaries actually use hashing of their keys, so you don't need to worry about the details. And yes, a dictionary is a good data structure to use for this task. There are fancier forms of dictionary that make things a bit simpler, but to use them does require importing a (standard) module. But using a dictionary (of some flavour or another) to hold data and then generating a list of tuples from it for final sorting is a fairly common practice in Python. And there's no need to specifically delete the dictionary when you've finished with it if the program's about to terminate anyway.

    ...

    As for the duplicated call of alpha_only() , whenever you find yourself doing that sort of thing it's a sign that a list comprehension isn't really suitable for the task and that you should just use a normal for loop so that you can save the result of the function call rather than having to recalculate it. Eg,

    words = []
    for word in line.split():
        word = self.alpha_only(word)
        if word is not None:
            words.append(word) 
    
    链接地址: http://www.djcxy.com/p/76638.html

    上一篇: IF语句有2个变量和4个条件

    下一篇: 执行计数,排序/映射大型字典