Grabbing the href attribute of an A element

Trying to find the links on a page.

my regex is:

/<as[^>]*href=("'??)([^"' >]*?)[^>]*>(.*)</a>/

but seems to fail at

<a title="this" href="that">what?</a>

How would I change my regex to deal with href not placed first in the a tag?


Reliable Regex for HTML are difficult. Here is how to do it with DOM:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
    echo $dom->saveHtml($node), PHP_EOL;
}

The above would find and output the "outerHTML" of all A elements in the $html string.

To get all the text values of the node, you do

echo $node->nodeValue; 

To check if the href attribute exists you can do

echo $node->hasAttribute( 'href' );

To get the href attribute you'd do

echo $node->getAttribute( 'href' );

To change the href attribute you'd do

$node->setAttribute('href', 'something else');

To remove the href attribute you'd do

$node->removeAttribute('href'); 

You can also query for the href attribute directly with XPath

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/@href');
foreach($nodes as $href) {
    echo $href->nodeValue;                       // echo current attribute value
    $href->nodeValue = 'new value';              // set new attribute value
    $href->parentNode->removeAttribute('href');  // remove attribute
}

Also see:

  • Best methods to parse HTML
  • DOMDocument in php
  • On a sidenote: I am sure this is a duplicate and you can find the answer somewhere in here


    I agree with Gordon, you MUST use an HTML parser to parse HTML. But if you really want a regex you can try this one :

    /^<a.*?href=(["'])(.*?)1.*$/
    

    This matches <a at the begining of the string, followed by any number of any char (non greedy) .*? then href= followed by the link surrounded by either " or '

    $str = '<a title="this" href="that">what?</a>';
    preg_match('/^<a.*?href=(["'])(.*?)1.*$/', $str, $m);
    var_dump($m);
    

    Output:

    array(3) {
      [0]=>
      string(37) "<a title="this" href="that">what?</a>"
      [1]=>
      string(1) """
      [2]=>
      string(4) "that"
    }
    

    你想要查找的模式将是链接锚点模式,例如(something):

    $regex_pattern = "/<a href="(.*)">(.*)</a>/";
    
    链接地址: http://www.djcxy.com/p/5066.html

    上一篇: DOM中的DOMDocument

    下一篇: 抓取A元素的href属性