Extracting substring from URL using regex

Regex newbie here. I have a bunch of URLs from which I need to extract some substrings for which I am using regular expression.

Ex: If my URL is https://chrome.google.com/webstore/detail/vt-hokie-stone-theme/enmbbbhbkojhbkbolmfgbmlcgpkjjlja?hl=en-US , I need to extract 1. vt-hokie-stone-theme part and 2. enmbbbhbkojhbkbolmfgbmlcgpkjjlja part from this url into two seperate variables.

The initial part of my URL always remains constant, so I built the following regular expression detail/([a-z0-9-]+)/([az]+) and I am trying to mach on http://www.pythonregex.com/

I see that regex.findall(string) gives me what I want but I have following questions:

  • I want them in two seperate variables, instead of having them as a list format in a single variable. How do I do it?

  • Also, while checking on pythonregex, the regex.findall(string) command gives the output as [(u'vt-hokie-stone-theme', u'enmbbbhbkojhbkbolmfgbmlcgpkjjlja')] . I understand that the preceding u means unicode but I don't want it in my output. How do I remove it?


  • You can use tuple/list assignment syntax to achieve this:

    try:
        var1, var2 = re.search(r"detail/([a-z0-9-]+)/([a-z]+)", my_url).groups()
    except AttributeError:
        var1 = var2 = ""
    
  • The unicode strings are seen only in the website's answers, and in raw python the return values will be normal strings. So, you don't have to worry about it.


  • I personally don't see the issue in just setting the variables from the first index of the findall() array. But, if you're confident that your regex is going to always match the exact url string, you can try re.match:

    In [22]: regex = re.compile('a(bc)(cd)')

    In [23]: regex.match('abccd').groups()

    Out[23]: ('bc', 'cd')

  • What's the issue with unicode? Why don't you want to keep it? I know the regex will return only ascii anyway, so that's not an issue. Either way, if it's really important to make them be regular strings, just cast it to a string.

    str(u'abc') == 'abc'


  • You can use below regex to achieve the same. If you are certain of the format of the URL, you can try something like below. Note that the last .* regex capturing th groups base is non-greedy and the .* regex capturing the group theme is non-greedy.

    >>> var = 'https://chrome.google.com/webstore/detail/vt-hokie-stone-theme/enmbbbhbkojhbkbolmfgbmlcgpkjjlja?hl=en-U'
    
    >>> match = re.match(r"(?P<base>.*/webstore/.*?/)(?P<theme>.*?)/(?P<tail>.*)",var);
    >>> if match:
           ...    print match.group('base')
           ...    print match.group('theme')
           ...    print match.group('tail')
    
    https://chrome.google.com/webstore/detail/
    vt-hokie-stone-theme
    enmbbbhbkojhbkbolmfgbmlcgpkjjlja?hl=en-U
    
    链接地址: http://www.djcxy.com/p/87010.html

    上一篇: 如何使用正则表达式提取大型mgr

    下一篇: 使用正则表达式从URL中提取子字符串