Identifying the first and last items in a list
I need to transform some text files into HTML code. I'm stuck in transforming a list into an HTML unordered list. Example source:
some text in the document
* item 1
* item 2
* item 3
some other text
The output should be:
some text in the document
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
some other text
Currently, I have this:
r = re.compile(r'*(.*)n')
r.sub('<li>1</li>', the_text_document)
which creates an HTML list without < ul >
tags.
How can I identify the first and last items and surround them with < ul >
tags?
Or use BeautifulSoup
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
edit
I apparently have to give you some hint on how to read documentation.
And many more things
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work .
Don't stop reading after the first sentence... The last one is pretty important and what's in the middle to.
In other word, you can create an empty document... let say:
soup = BeautifulSoup("<div></div>")
document = soup.div
then you read each lines of you text.. and then do that whenever you have text.
document.append(line)
if the line starts with a `*``
ul = document.new_tag('ul')
document.append(ul)
document = ul
then push all the li
on the document... and once you end up reading *
, just pop the parent so the document gets back to the div. And keep doing that... you can even do it recursively to insert ul
into ul
s.
Once you parsed everything... you can do
str(document)
or
document.prettify()
Edit
just realized that you weren't editing the html but a unformatted text.. You could try using markdown then.
http://daringfireball.net/projects/markdown/
You could just process you data line by line .. this quick and dirty solution below could probably be tidied up, but for your data it does the trick.
with open('data.txt') as inf:
star_count = 0
for line in inf:
line = line.strip()
if not line.startswith('*'):
if star_count == 1:
print'</ul>'
print line
else:
if star_count == 0:
print '<ul>'
star_count = 1
print ' <li>%s</li>' %line.split('*')[1].strip()
yields:
some text in the document
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
some other text
Depending on how complex your data, or if you have repeating unumbered lists etc this will require modification and you may want to look for a more general solution, or modify this starter code to fill your needs, only you can decide.
Update :
Edited <li> .. </li>
print line to get rid of *
that were previously left.
After playing with some ideas, I've decided to go with a second regex. So basically, after running the first regex (from my original post, that creates the <li>
tags), I run:
r = re.compile(r'(<li>.*?</li>n(?!s*<li>))', re.DOTALL)
r.sub('<ul>1</ul>', string_with_li_tags)
This will find the first match of <li>
tag and the last match of </li>n
combo, not followed by a <li>
tag (which essentially means the entire list) and add <ul>
tags.
EDIT: I modified the regex a bit so it won't be greedy. This way it can handle multiple lists in the same document. Only requirement is that there are no spaces between list items, as @Aprillion mentioned below
EDIT 2: Modified the negative lookahead to treat spaces between list items as well, so all cases are covered
链接地址: http://www.djcxy.com/p/10962.html下一篇: 识别列表中的第一个和最后一个项目