A Basketful Of Papayas: 2017

FluentDOM 7.0 is out, so what has changed? Well, the FluentDOM namespace got a little crowded so I moved all the DOM child classes into FluentDOM\DOM, made the Creator a top level class and collected the utility classes. If you're updating you might need to change some imports. FluentDOM now requires PHP 7 and uses scalar type hints. In other words, lots of cleanup.

FluentDOM\XMLReader\SiblingIterator

Large XML files usually consist of a list element with many record elements as its children. The whole list is to large to load into memory, but the records are small enough.

The SiblingIterator takes a XMLReader, a tag name and a filter callback. It matches the tag name and executes the filter callback. If the tag name matches and the filter callback returns TRUE it will expand the node into DOM. After the first match it will only consider following siblings. This allows you to improve the read performance.

Here is an example that read a XML sitemap including video information.

$reader = new FluentDOM\XMLReader();
$reader->open($sitemapFile);
$reader->registerNamespace(
  's', 'http://www.sitemaps.org/schemas/sitemap/0.9'
);
$reader->registerNamespace(
  'v', 'http://www.google.com/schemas/sitemap-video/1.1'
);

foreach (new FluentDOM\XMLReader\SiblingIterator($reader, 's:url') as $url) {
  /** @var FluentDOM\DOM\Element $url */
  var_dump(
    [
      $url('string(v:video/v:title)'),
      $url('string(s:loc)')
    ]
  );
}

FluentDOM\XMLWriter::collapse()

FluentDOM 7.0 adds a collapse() method to XMLWriter. It is the missing opposite of XMLReader::expand(). Using the two methods allows you to work with large XML files in a really easy way.

The collapse() method takes any DOM node or node list and will write it to the output stream. You can use the extended DOM classes, FluentDOM\Creator or FluentDOM\Query to create the record node.

$writer = new FluentDOM\XMLWriter();
$writer->openURI('php://stdout');
$writer->registerNamespace(
  '', 'http://www.sitemaps.org/schemas/sitemap/0.9'
);
$writer->registerNamespace(
  'video', 'http://www.google.com/schemas/sitemap-video/1.1'
);

$writer->setIndent(2);
$writer->startDocument();
$writer->startElement('urlset');
$writer->writeAttribute(
  'xmlns:video', 'http://www.google.com/schemas/sitemap-video/1.1'
);

$_ = FluentDOM::create();
$_->registerNamespace(
  '', 'http://www.sitemaps.org/schemas/sitemap/0.9'
);
$_->registerNamespace(
  'video', 'http://www.google.com/schemas/sitemap-video/1.1'
);

foreach ($videos as $video) {
  $writer->collapse(
    $_(
      'url',
      $_('loc', $video['url']),
      $_(
        'video:video',
        $_('video:title', $video['title'])
      )
    )
  );
}
$writer->endElement();
$writer->endDocument();

XMLWriter::setAttribute() recognizes if you write an namespace definition so it will not add it to descendant nodes.

Put Together

If you combine the expand iterator with collapse you can easily write mappers that can consume large XML files. You can basically use each record as a separate DOM document.

For example you can use it to merge XML documents and change the namespaces:

$writer = new \FluentDOM\XMLWriter();
$writer->openURI('php://stdout');
$writer->registerNamespace('p', 'urn:persons');
$writer->setIndent(2);
$writer->startDocument();
$writer->startElement('p:persons');

// iterate the example sources
foreach ($data as $sourceFile) {
  // load the source into a reader
  $reader = new \FluentDOM\XMLReader();
  $reader->open($sourceFile);

  // iterate the person elements
  $persons = new FluentDOM\XMLReader\SiblingIterator($reader, 'person');

  foreach ($persons as $person) {
    // use the transformer to move the nodes into the namespace
    $writer->collapse(
      new \FluentDOM\Transformer\Namespaces\Replace(
        $person,
        // namespaces to replace
        ['' => 'urn:persons', 'urn:example' => 'urn:persons'],
        // prefix for target namespace
        ['urn:persons' => 'p']
      )
    );
  }
}

$writer->endElement();
$writer->endDocument();

Release: FluentDOM 6.1.0

MultiByte HTML

Thanks to some issues reported by Kyle Tse the multibyte handling for HTML was improved. It should now work properly. The HTML loader can read the encoding/charset from meta tags or you can specify as an loader option. The default is UTF-8. FluentDOM\Document::saveHTML() has got some additional logic as well.

XMLReader/XMLWriter

If you need to handle huge XML files, the XMLReader and XMLWriter APIs are the way to do it. Well you could try using SAX, but believe me THAT is no fun. XMLReader and XMLWriter are nice APIs by itself, so FluentDOM adds only slight changes for namespace handling.

XMLReader::read()/XMLReader::next()

Of the two traversing methods, only next() allows to specify a local name as a condition. FluentDOM extends the signature of both methods to allow for a tag name and a namespace URI. As a result the source reading an XML with namespaces can be simplified:

$sitemapUri = 'http://www.sitemaps.org/schemas/sitemap/0.9';
$reader = new FluentDOM\XMLReader();
$reader->open($file);
if ($reader->read('url', $sitemapUri)) {
  do {
    //...
  } while ($reader->next('url', $sitemapUri));
}

XMLReader::registerNamespace()

Additionally you can register namespaces on the XMLReader object itself. This allows it resolve namespace prefixes in tag name arguments.

Namespace definitions will be propagated to an FluentDOM\Document instance created by FluentDOM\XMLReader::expand().

$reader = new FluentDOM\XMLReader();
$reader->open($file);
$reader->registerNamespace('s', 'http://www.sitemaps.org/schemas/sitemap/0.9');
if ($reader->read('s:url')) {
  do {
    $url = $reader->expand();
    var_dump(
      $url('string(s:loc)')
    );
  } while ($reader->next('s:url'));
}

XMLWriter::registerNamespace()

The same registration is possible on an FluentDOM\XMLWriter. It keeps track track of the namespaces defined in the current context and avoid adding unnecessary definitions to the output (PHP Bug).

XMLWriter has many methods that have a tag name argument and this change allows all of them to become namespace aware.

$writer = new FluentDOM\XMLWriter();
$writer->openURI('php://stdout');
$writer->registerNamespace('', 'http://www.sitemaps.org/schemas/sitemap/0.9');
$writer->setIndent(2);
$writer->startDocument();
$writer->startElement('urlset');

foreach ($urls as $url) {
  $writer->startElement('url');
  $writer->writeElement('loc', $url['href']);
  // ...
  $writer->endElement();
}

$writer->endElement();
$writer->endDocument();

A Basketful Of Papayas

Pages

2017-10-03

FluentDOM 7.0, The Next Step