Getting parts of a URL (Regex)
Given the URL (single line):
http://test.example.com/dir/subdir/file.html
How can I extract the following parts using regular expressions:
The regex should work correctly even if I enter the following URL:
http://example.example.com/example/example/example.html
Thank you.
A single regex to parse and breakup a full URL including query parameters and anchors eg
https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
^((http[s]?|ftp):/)?/?([^:/s]+)((/w+)*/)([w-.]+[^#?s]+)(.*)?(#[w-]+)?$
RexEx positions:
url: RegExp['$&'],
protocol:RegExp.$2,
host:RegExp.$3,
path:RegExp.$4,
file:RegExp.$6,
query:RegExp.$7,
hash:RegExp.$8
you could then further parse the host ('.' delimited) quite easily.
What I would do is use something like this:
/*
^(.*:)//([A-Za-z0-9-.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4
the further parse 'the rest' to be as specific as possible. Doing it in one regex is, well, a bit crazy.
我意识到我迟到了,但是有一个简单的方法可以让浏览器为你解析一个没有正则表达式的url:
var a = document.createElement('a');
a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo';
['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
console.log(k+':', a[k]);
});
/*//Output:
href: http://www.example.com:123/foo/bar.html?fox=trot#foo
protocol: http:
host: www.example.com:123
hostname: www.example.com
port: 123
pathname: /foo/bar.html
search: ?fox=trot
hash: #foo
*/
I'm a few years late to the party, but I'm surprised no one has mentioned the Uniform Resource Identifier specification has a section on parsing URIs with a regular expression. The regular expression, written by Berners-Lee, et al., is:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (ie, each paired parenthesis). We refer to the value matched for subexpression as $. For example, matching the above expression to
http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
For what it's worth, I found that I had to escape the forward slashes in JavaScript:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))?
上一篇: 为什么我看到不一致的JavaScript逻辑行为在alert()中循环而没有它?
下一篇: 获取URL的部分(正则表达式)