Regular expression for items listed in plain english

This is sort of a contrived example, but I'm trying to get at a general principle here.

Given phrases written in English using this list-like form:

I have a cat
I have a cat and a dog
I have a cat, a dog, and a guinea pig
I have a cat, a dog, a guinea pig, and a snake

Can I use a regular expression to get all of the items, regardless of how many there are? Note that the items may contain multiple words.

Obviously if I have just one, then I can use I have a (.+) , and if there are exactly two, I have a (.+) and a (.+) works.

But things get more complicated if I want to match more than just one example. If I want to extract the list items from the first two examples, I would think this would work: I have a (.*)(?: and a (.*))? And while this works on the first phrase, telling me I have a cat and null , for the second one it tells me I have a cat and a dog and null . Things only get worse when I try to match phrases in even more forms.

Is there any way I can use regexes for this purpose? It seems rather simple, and I don't understand why my regex that matches 2-item lists works, but the one that matches 1- or 2-item lists does not.


You can use a non-capturing group as a conditional delimiter (either a comma or end-of-line):
' a (.*?)(?:,|$)'

Example in python:

import re
line = 'I have a cat, a dog, a guinea pig, and a snake'
mat = re.findall(r' a (.*?)(?:,|$)', line)
print mat # ['cat', 'dog', 'guinea pig', 'snake']

I use regex splitting to do it. But this assumes sentence format exactly matching your input set:

>>> SPLIT_REGEX = r', |I have|and|, and'
>>> for sample in ('I have a cat', 'I have a cat and a dog', 'I have a cat, a dog, and a guinea pig', 'I have a cat, a dog, a guinea pig, and a snake'):
...     print [x.strip() for x in re.split(SPLIT_REGEX, sample) if x.strip()]
... 
['a cat']
['a cat', 'a dog']
['a cat', 'a dog', 'a guinea pig']
['a cat', 'a dog', 'a guinea pig', 'a snake']

What you can do is to use the G anchor with the find method:

(?:G(?!A)(?:,? and|,)|bI have) an? ((?>[b-z]+|Ba|a(?!ndb))+(?> (?>[b-z]+|Ba|a(?!ndb))+)*)

or more simple:

(?:G(?!A)(?:,? and|,)|bI have) an? ((?!andb)[a-z]+(?> (?!andb)[a-z]+)*)

The G is the position in the string after the last match. The pattern has two entry points. The first match will use the second entry point: bI have , and next matches, the first entry point that allows only contiguous results.

Note: G means match the position after the last match but it match the start of the string too. (?!A) is here to avoid this case.

online demo

regex planet (click the Java button)

链接地址: http://www.djcxy.com/p/22116.html

上一篇: 春季启动和多

下一篇: 以普通英语列出的项目的正则表达式