2009-09-12

Scraping Links From HTML

Did you try to scrap content from a html document using regular expressions? This is a bad idea (Read here why!).

With FluentDOM it is easy:

Get all links

Just create and FluentDOM from the HTML string, find all links using XPath and map the nodes to an array.

<?php
require('FluentDOM/FluentDOM.php');
$html = file_get_contents('http://www.papaya-cms.com/');
$links = FluentDOM($html, 'html')->find('//a[@href]')->map(
  function ($node) {
    return $node->getAttribute('href');
  }
);
var_dump($links);
?>

Extend local urls

Need to edit the links? Pretty much the same:

<?php
require('FluentDOM/FluentDOM.php');
$url = 'http://www.papaya-cms.com/';
$html = file_get_contents($url);
$fd = FluentDOM($html, 'html')->find('//a[@href]')->each(
  function ($node) use ($url) {
    $item = FluentDOM($node);
    if (!preg_match('(^[a-zA-Z]+://)', $item->attr('href'))) {
      $item->attr('href', $url.$item->attr('href'));
    }
  }
);
$fd->contentType = 'xml';
header('Content-type: text/xml');
echo $fd;
?>

2009-09-10

Speaking at the PHPNW09

I will speak at the PHPNW09 in Manchester.

Optimizing Your Frontend Performance Take a look on web application performance from the users side. The session starts from the browser, showing tools to measure and analyze performance and takes you to the server, explaining headers and possible solutions.

Looking forward to answer your questions and hear from your experiences and solutions.

x