2009-09-12

Scraping Links From HTML

Did you try to scrap content from a html document using regular expressions? This is a bad idea (Read here why!).

With FluentDOM it is easy:

Get all links

Just create and FluentDOM from the HTML string, find all links using XPath and map the nodes to an array.

<?php
require('FluentDOM/FluentDOM.php');
$html = file_get_contents('http://www.papaya-cms.com/');
$links = FluentDOM($html, 'html')->find('//a[@href]')->map(
  function ($node) {
    return $node->getAttribute('href');
  }
);
var_dump($links);
?>

Extend local urls

Need to edit the links? Pretty much the same:

<?php
require('FluentDOM/FluentDOM.php');
$url = 'http://www.papaya-cms.com/';
$html = file_get_contents($url);
$fd = FluentDOM($html, 'html')->find('//a[@href]')->each(
  function ($node) use ($url) {
    $item = FluentDOM($node);
    if (!preg_match('(^[a-zA-Z]+://)', $item->attr('href'))) {
      $item->attr('href', $url.$item->attr('href'));
    }
  }
);
$fd->contentType = 'xml';
header('Content-type: text/xml');
echo $fd;
?>

1 comment:

  1. Great article on scraping links, I use beautiful soup in python, for tough projects though it may just be easier to have someone else do the web scraping

    ReplyDelete

x