Is using PHP's explode() for HTML scraping considered a bad practice?

2018-06-05 04:25:06

I have been coding for a while now but just can't seem to get my head around regular expressions.

This brings me to my question which is the following: is it bad practice to use PHP's explode for breaking up a string of html code to select bits of text? I need to scrape a page for various bits of information and due to my horrific regex knowledge (In a full software engineering degree I had to write maybe one....) I decided upon using explode().

I have provided my code below so someone more seasoned than me can tell me if it's essential that I use regex for this or not!

public function split_between($start, $end, $blob)
{
    $strip = explode($start,$blob);
    $strip2 = explode($end,$strip[1]);
    return $strip2[0];
}

public function get_abstract($pubmed_id)
{
    $scrapehtml = file_get_contents("http://www.ncbi.nlm.nih.gov/m/pubmed/".$pubmed_id);
    $data['title'] = $this->split_between('<h2>','</h2>',$scrapehtml);
    $data['authors'] = $this->split_between('<div class="auth">','</div>',$scrapehtml);
    $data['journal'] = $this->split_between('<p class="j">','</p>',$scrapehtml);
    $data['aff'] = $this->split_between('<p class="aff">','</p>',$scrapehtml);
    $data['abstract'] = str_replace('<p class="no_t_m">','',str_replace('</p>','',$this->split_between('<h3 class="no_b_m">Abstract','</div>',$scrapehtml)));
    $strip = explode('<div class="ids">', $scrapehtml);
    $strip2 = explode('</div>', $strip[1]);
    $ids[] = $strip2[0];
    $id_test = strpos($strip[2],"PMCID");
    if (isset($strip[2]) && $id_test !== false)
    {
        $step = explode('</div>', $strip[2]);
        $ids[] = $step[0];
    }
    $id_count = 0;
    foreach ($ids as &$value) {
        $value = str_replace("<h3>", "", $value);
        $data['ids'][$id_count]['id'] = str_replace("</h3>", "", str_replace('<span>','',str_replace('</span>','',$value)));
        $id_count++;
    }

    $jsonAbstract = json_encode($data);

    echo $this->indent($jsonAbstract);
}

I highly recommend you try out the PHP Simple HTML DOM Parser library. It handles invalid HTML and has been designed to solve the same problem you're working on.

A simple example from the documentation is as follows:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';

It's not essential to use regular expressions for anything, although it'll be useful to get comfortable with them and know when to use them.

It looks like your scraping PubMed, which I'm guessing has fairly static mark-up in terms of mark-up. If what you have works and performs as you hope I can't see any reason to switch over to using regular expressions, they're not necessarily going to be any quicker in this example.

Learn regular expressions and try to use a language that has libraries for this kind of task like perl or python. It will save you a lot of time. At first they might seem daunting but they are really easy for most of the tasks. Try reading this: http://perldoc.perl.org/perlre.html

链接地址: http://www.djcxy.com/p/16536.html

上一篇: var W / FILTER

下一篇: 使用PHP的explode（）进行HTML抓取被认为是一种不好的做法？