Getting parts of a URL (Regex)

Given the URL (single line):
http://test.example.com/dir/subdir/file.html

How can I extract the following parts using regular expressions:

  • The Subdomain (test)
  • The Domain (example.com)
  • The path without the file (/dir/subdir/)
  • The file (file.html)
  • The path with the file (/dir/subdir/file.html)
  • The URL without the path (http://test.example.com)
  • (add any other that you think would be useful)
  • The regex should work correctly even if I enter the following URL:
    http://example.example.com/example/example/example.html

    Thank you.


    A single regex to parse and breakup a full URL including query parameters and anchors eg

    https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash

    ^((http[s]?|ftp):/)?/?([^:/s]+)((/w+)*/)([w-.]+[^#?s]+)(.*)?(#[w-]+)?$

    RexEx positions:

    url: RegExp['$&'],

    protocol:RegExp.$2,

    host:RegExp.$3,

    path:RegExp.$4,

    file:RegExp.$6,

    query:RegExp.$7,

    hash:RegExp.$8

    you could then further parse the host ('.' delimited) quite easily.

    What I would do is use something like this:

    /*
        ^(.*:)//([A-Za-z0-9-.]+)(:[0-9]+)?(.*)$
    */
    proto $1
    host $2
    port $3
    the-rest $4
    

    the further parse 'the rest' to be as specific as possible. Doing it in one regex is, well, a bit crazy.


    我意识到我迟到了,但是有一个简单的方法可以让浏览器为你解析一个没有正则表达式的url:

    var a = document.createElement('a');
    a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo';
    
    ['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
        console.log(k+':', a[k]);
    });
    
    /*//Output:
    href: http://www.example.com:123/foo/bar.html?fox=trot#foo
    protocol: http:
    host: www.example.com:123
    hostname: www.example.com
    port: 123
    pathname: /foo/bar.html
    search: ?fox=trot
    hash: #foo
    */
    

    I'm a few years late to the party, but I'm surprised no one has mentioned the Uniform Resource Identifier specification has a section on parsing URIs with a regular expression. The regular expression, written by Berners-Lee, et al., is:

    ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))?
     12            3  4          5       6  7        8 9
    

    The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (ie, each paired parenthesis). We refer to the value matched for subexpression as $. For example, matching the above expression to

    http://www.ics.uci.edu/pub/ietf/uri/#Related

    results in the following subexpression matches:

    $1 = http:
    $2 = http
    $3 = //www.ics.uci.edu
    $4 = www.ics.uci.edu
    $5 = /pub/ietf/uri/
    $6 = <undefined>
    $7 = <undefined>
    $8 = #Related
    $9 = Related
    

    For what it's worth, I found that I had to escape the forward slashes in JavaScript:

    ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))?

    链接地址: http://www.djcxy.com/p/76776.html

    上一篇: 为什么我看到不一致的JavaScript逻辑行为在alert()中循环而没有它?

    下一篇: 获取URL的部分(正则表达式)