A Basketful Of Papayas

2017-10-03

FluentDOM 7.0, The Next Step

FluentDOM 7.0 is out, so what has changed? Well, the FluentDOM namespace got a little crowded so I moved all the DOM child classes into FluentDOM\DOM, made the Creator a top level class and collected the utility classes. If you're updating you might need to change some imports. FluentDOM now requires PHP 7 and uses scalar type hints. In other words, lots of cleanup.

FluentDOM\XMLReader\SiblingIterator

Large XML files usually consist of a list element with many record elements as its children. The whole list is to large to load into memory, but the records are small enough.

The SiblingIterator takes a XMLReader, a tag name and a filter callback. It matches the tag name and executes the filter callback. If the tag name matches and the filter callback returns TRUE it will expand the node into DOM. After the first match it will only consider following siblings. This allows you to improve the read performance.

Here is an example that read a XML sitemap including video information.

$reader = new FluentDOM\XMLReader();
$reader->open($sitemapFile);
$reader->registerNamespace(
  's', 'http://www.sitemaps.org/schemas/sitemap/0.9'
);
$reader->registerNamespace(
  'v', 'http://www.google.com/schemas/sitemap-video/1.1'
);

foreach (new FluentDOM\XMLReader\SiblingIterator($reader, 's:url') as $url) {
  /** @var FluentDOM\DOM\Element $url */
  var_dump(
    [
      $url('string(v:video/v:title)'),
      $url('string(s:loc)')
    ]
  );
}

FluentDOM\XMLWriter::collapse()

FluentDOM 7.0 adds a collapse() method to XMLWriter. It is the missing opposite of XMLReader::expand(). Using the two methods allows you to work with large XML files in a really easy way.

The collapse() method takes any DOM node or node list and will write it to the output stream. You can use the extended DOM classes, FluentDOM\Creator or FluentDOM\Query to create the record node.

$writer = new FluentDOM\XMLWriter();
$writer->openURI('php://stdout');
$writer->registerNamespace(
  '', 'http://www.sitemaps.org/schemas/sitemap/0.9'
);
$writer->registerNamespace(
  'video', 'http://www.google.com/schemas/sitemap-video/1.1'
);

$writer->setIndent(2);
$writer->startDocument();
$writer->startElement('urlset');
$writer->writeAttribute(
  'xmlns:video', 'http://www.google.com/schemas/sitemap-video/1.1'
);

$_ = FluentDOM::create();
$_->registerNamespace(
  '', 'http://www.sitemaps.org/schemas/sitemap/0.9'
);
$_->registerNamespace(
  'video', 'http://www.google.com/schemas/sitemap-video/1.1'
);

foreach ($videos as $video) {
  $writer->collapse(
    $_(
      'url',
      $_('loc', $video['url']),
      $_(
        'video:video',
        $_('video:title', $video['title'])
      )
    )
  );
}
$writer->endElement();
$writer->endDocument();

XMLWriter::setAttribute() recognizes if you write an namespace definition so it will not add it to descendant nodes.

Put Together

If you combine the expand iterator with collapse you can easily write mappers that can consume large XML files. You can basically use each record as a separate DOM document.

For example you can use it to merge XML documents and change the namespaces:

$writer = new \FluentDOM\XMLWriter();
$writer->openURI('php://stdout');
$writer->registerNamespace('p', 'urn:persons');
$writer->setIndent(2);
$writer->startDocument();
$writer->startElement('p:persons');

// iterate the example sources
foreach ($data as $sourceFile) {
  // load the source into a reader
  $reader = new \FluentDOM\XMLReader();
  $reader->open($sourceFile);

  // iterate the person elements
  $persons = new FluentDOM\XMLReader\SiblingIterator($reader, 'person');

  foreach ($persons as $person) {
    // use the transformer to move the nodes into the namespace
    $writer->collapse(
      new \FluentDOM\Transformer\Namespaces\Replace(
        $person,
        // namespaces to replace
        ['' => 'urn:persons', 'urn:example' => 'urn:persons'],
        // prefix for target namespace
        ['urn:persons' => 'p']
      )
    );
  }
}

$writer->endElement();
$writer->endDocument();

2017-07-02

FluentDOM 6.1 released - Improvements

Release: FluentDOM 6.1.0

MultiByte HTML

Thanks to some issues reported by Kyle Tse the multibyte handling for HTML was improved. It should now work properly. The HTML loader can read the encoding/charset from meta tags or you can specify as an loader option. The default is UTF-8. FluentDOM\Document::saveHTML() has got some additional logic as well.

XMLReader/XMLWriter

If you need to handle huge XML files, the XMLReader and XMLWriter APIs are the way to do it. Well you could try using SAX, but believe me THAT is no fun. XMLReader and XMLWriter are nice APIs by itself, so FluentDOM adds only slight changes for namespace handling.

XMLReader::read()/XMLReader::next()

Of the two traversing methods, only next() allows to specify a local name as a condition. FluentDOM extends the signature of both methods to allow for a tag name and a namespace URI. As a result the source reading an XML with namespaces can be simplified:

$sitemapUri = 'http://www.sitemaps.org/schemas/sitemap/0.9';
$reader = new FluentDOM\XMLReader();
$reader->open($file);
if ($reader->read('url', $sitemapUri)) {
  do {
    //...
  } while ($reader->next('url', $sitemapUri));
}

XMLReader::registerNamespace()

Additionally you can register namespaces on the XMLReader object itself. This allows it resolve namespace prefixes in tag name arguments.

Namespace definitions will be propagated to an FluentDOM\Document instance created by FluentDOM\XMLReader::expand().

$reader = new FluentDOM\XMLReader();
$reader->open($file);
$reader->registerNamespace('s', 'http://www.sitemaps.org/schemas/sitemap/0.9');
if ($reader->read('s:url')) {
  do {
    $url = $reader->expand();
    var_dump(
      $url('string(s:loc)')
    );
  } while ($reader->next('s:url'));
}

XMLWriter::registerNamespace()

The same registration is possible on an FluentDOM\XMLWriter. It keeps track track of the namespaces defined in the current context and avoid adding unnecessary definitions to the output (PHP Bug).

XMLWriter has many methods that have a tag name argument and this change allows all of them to become namespace aware.

$writer = new FluentDOM\XMLWriter();
$writer->openURI('php://stdout');
$writer->registerNamespace('', 'http://www.sitemaps.org/schemas/sitemap/0.9');
$writer->setIndent(2);
$writer->startDocument();
$writer->startElement('urlset');

foreach ($urls as $url) {
  $writer->startElement('url');
  $writer->writeElement('loc', $url['href']);
  // ...
  $writer->endElement();
}

$writer->endElement();
$writer->endDocument();

2016-12-24

FluentDOM 6.0 - What's New

FluentDOM 6.0 is released.

So a major version jump means that here are backwards compatibility breaks and major new features. File loading has changed to improve security. FluentDOM\Query has got a major overhaul to improve the integration of alternative formats. This affected the interfaces and usage. I tried to keep the breaks to a minimum and easily fixable.

Loading Files

To load a file you will now have to explicitly allow it using the options argument. Here are two options. FluentDOM\Loader\Options::ALLOW_FILE basically restores the previous behaviour. The loader still checks if it is a file or string and will load both. FluentDOM\Loader\Options::IS_FILE means that only a file will be loaded. This has to be implemented into to respective loader. So if you notice that some loader behaves differently, please drop me a note.

All loading functions allow you to provide the option.

$fd = FluentDOM($file, 'xml', [ FluentDOM\Loader\Options::ALLOW_FILE => TRUE ]);
$document = FluentDOM::load(
  $file, 'xml', [ FluentDOM\Loader\Options::IS_FILE => TRUE ]
);

Fragment Loading / Query Content Types

Several loaders now support fragment loading. It is used by methods like FluentDOM\\Query::append(). It allows the Query API to keep the content type that you loaded. So if you load HTML, the fragment are expected to be html, if you load XML the fragments are expected to be XML, if you load JSON ... well you get the picture. :-)

$fd = FluentDOM('<form></form>', 'text/html')->find('//form');
$fd->append('<input type="text" name="example">');
echo $fd;

You can change the behaviour by setting the content type. It works with additional loaders, so if you install fluentdom/html5 you get transparent support for HTML5.

$fd = FluentDOM('<form></form>', 'text/html5')->find('//html:form');
$fd->append('<input type="text" name="example">');
echo $fd;

The changes mean that all existing loaders need to be updated for FluentDOM 6.0.

Serializers

Serializers need to register itself on the FluentDOM class. It allows the FluentDOM\Query objects to output the same content type it loaded. Additionally you can use FluentDOM::getSerializerFactories()->createSerializer(); to get a serializer for a node by content type. I am thinking about adding something like a FluentDOM::save() function as a shortcut for that, but I am not sure about the name and implementation yet. If you have a suggestion please add it to the Issue.

Replace Whole Text

Character nodes (Text nodes and CDATA sections) have a property $wholeText. It returns the text content and the sibling character nodes. It even resolves entity references for that. The property is read only, but DOM Level 3 specifies a method replaceWholeText() as a write method for it. FluentDOM 6.0 implements that method in its extended DOM classes now.

$document = new FluentDOM\Document();
$document->loadXML(
  '<!DOCTYPE p ['."\n".
  '  <!ENTITY t "world">'."\n".
  ']>'."\n".
  '<p>Hello &t;<br/>, Hello &t;</p>'
);
/** @var \FluentDOM\Text $text */
$text = $document->documentElement->firstChild;
$text->replaceWholeText('Hi universe');
echo $document->saveXML();

Examples

The examples directory did grow a little confusing over the years. I restructured and refactored it. Some examples got removed, because the features are shown by newer examples.

What's Next?

I still have to updated some of the plugin repositories (loaders, serializers) and add some more documentation to the wiki. After that I plan to take a look into DOM Level 4. If you have suggestions please add a ticket to the issue tracker

Pages