Get the most probable color from a words set

Are there any libraries existing or methods that let you to figure out the most probable color for a words set? For example, cucumber, apple, grass, it gives me green color. Did anyone work in that direction before?


如果我必须这样做,我会尝试使用谷歌图片或其他字词搜索图片,并识别最常见的最常见的颜色。


That sounds like a pretty reasonable NLP problem and one thats very easy to handle via map-reduce.

Identify a list of words and phrases that you call colors ['blue', 'green', 'red', ...]. Go over a large corpus of sentences, and for the sentences that mention a particular color, for every other word in that sentence, note down (word, color_name) in a file. (Map Step)

Then for each word you have seen in your corpus, aggregate all the colors you have seen for it to get something like {'cucumber': {'green': 300, 'yellow': 34, 'blue': 2}, 'tomato': {'red': 900, 'green': 430'}...} (Reduce Step)

Provided you use a large enough corpus (something like wikipedia), and you figure out how to prune really small counts, rare words, you should be able to make pretty comprehensive and robust dictionary mapping millions of the items to their colors.


Another way to do that is to do a text search in google for combinations of colors and the word in question and take the combination with the highest number of results. Here's a quick Python script for that:

import urllib
import json
import itertools

def google_count(q):
      query = urllib.urlencode({'q': q})
      url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
      search_response = urllib.urlopen(url)
      search_results = search_response.read()
      results = json.loads(search_results)
      data = results['responseData']
      return int(data['cursor']['estimatedResultCount'])

colors = ['yellow', 'orange', 'red', 'purple', 'blue', 'green']

# get a list of google search counts
res = [google_count('"%s grass"' % c) for c in colors]
# pair the results with their corresponding colors
res2 = list(itertools.izip(res, colors))
# get the color with the highest score
print "%s is %s" % ('grass', sorted(res2)[-1][1])

This will print:

grass is green
链接地址: http://www.djcxy.com/p/18140.html

上一篇: 为给定单词找到anagrams

下一篇: 从单词集中获取最可能的颜色