Did you try to scrap content from a html document using regular expressions? This is a bad idea (Read here why!).
With FluentDOM it is easy:
Get all links
Just create and FluentDOM from the HTML string, find all links using XPath and map the nodes to an array.
<?php
require('FluentDOM/FluentDOM.php');
$html = file_get_contents('http://www.papaya-cms.com/');
$links = FluentDOM($html, 'html')->find('//a[@href]')->map(
  function ($node) {
    return $node->getAttribute('href');
  }
);
var_dump($links);
?>
Extend local urls
Need to edit the links? Pretty much the same:
<?php
require('FluentDOM/FluentDOM.php');
$url = 'http://www.papaya-cms.com/';
$html = file_get_contents($url);
$fd = FluentDOM($html, 'html')->find('//a[@href]')->each(
  function ($node) use ($url) {
    $item = FluentDOM($node);
    if (!preg_match('(^[a-zA-Z]+://)', $item->attr('href'))) {
      $item->attr('href', $url.$item->attr('href'));
    }
  }
);
$fd->contentType = 'xml';
header('Content-type: text/xml');
echo $fd;
?>
Great article on scraping links, I use beautiful soup in python, for tough projects though it may just be easier to have someone else do the web scraping
ReplyDelete