Text extraction from email in Python
My users will send me posts by email ala Posterous
I'm using Google Apps Engine (GAE) to receive and parse emails. GAE returns the text part of the message.
I need to extract the post from the plain text part of the message.
The plain text can be "contaminated" with promotional headers, footers, signatures, etc.
Also I would like to leave out the "please post this:" or similar some people candidly include.
How would you achieve this?
Are there any tools (simpler than regex) I can use?
UPDATE
Examples:
(in all these examples the post is "Lorem ipsum sit amet..."
=====
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Victor P
victor.p@example.com
visit my blog at: www.example.com/victor
=====
Hello, I like your page. Please can you include this: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
=====
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
=====
If you find more examples of what a email can be, please feel free to include them in the post.
I would go with a list of compiled regular expressions. Something along the lines of:
import re
regexes = (
re.compile("visit my blog at: .*$", re.IGNORECASE),
re.compile("please post this:", re.IGNORECASE),
re.compile("please can you include this:", re.IGNORECASE)
# etc
)
for filePath in files:
with open(filePath) as file:
for line in file:
for regex in regexes:
print(re.sub(regex, ""))
链接地址: http://www.djcxy.com/p/55102.html
上一篇: 如何有效地检查相邻元素的特征
下一篇: 从Python中的电子邮件中提取文本