2014-08-17

FluentDOM + HTML5

HTML 5 is not directly supported by PHPs DOM extension. That means FluentDOM can not understand it, too. But here is a solution. HTML5-PHP is library that can parse HTML5 into a DOM document.

Both libraries use Composer:
"require": {
  "fluentdom/fluentdom": "5.*",
  "masterminds/html5": "2.*"
}

Read HTML5 into FluentDOM:
$html5 = new Masterminds\HTML5();
$fd = FluentDOM($html5->loadHTML($html));

Or write it:
echo $html5->saveHTML($fd->document);

HTML5-PHP puts the elements into the XHTML namespace. To use XPath expressions, you will need to register a prefix for it:
$html5 = new Masterminds\HTML5();
$fd = FluentDOM($html5->loadHTML($html));
$fd->registerNamespace(
  'xhtml', 'http://www.w3.org/1999/xhtml'
);
echo $fd->find('//xhtml:p')->text();

2014-08-09

Xpath 1.0 - Quoting Strings

Strings in Xpath 1.0 can be enclosed in single or double quotes. The following expressions are equivalent.

//div[@id = 'foo']
//div[@id = "foo"]


This is nice because you can use the variant that requires less or none escaping. I prefer single quotes for PHP because they need less escaping (only single quote and backslash). I usually end up with something like this:

$xpath->evaluate('//div[@id = "foo"]');

However a problem comes up if the value is dynamic.

$xpath->evaluate('//div[@id = "'.$_GET['foo'].'"]');

If $_GET['foo'] contains a double quote, it will break the expression. It is compare able to an SQL-Injection and should be avoided, don't you think?

The Xpath 1.0 specification for a literal is:

Literal   ::=   '"' [^"]* '"'


| "'" [^']* "'"

It disallows the use of the enclosing quote in the literal itself, here is no way to escape it.

Hint: This is different in Xpath 2.0. You can duplicate the quotes to escape them.

Deciding Which Quote To Use

 The first and obvious step is to check the value for quotes and use the one that it does not contain:

function quote($value) {
  $char = strpos($value, '"') === FALSE ? '"' : "'";
  return $char.$value.$char;
}


But a value could contain both kind of quotes. This would still break the expression.

Divide And Conquer

If is not possible to quote the whole value because it contains both kind of quotes you need to divide it into parts that can be quoted. You can then use the Xpath function concat() to rebuild the orignal value again:

//div[. = concat("Singe Quote: '", 'Double Quote: "')]

Matching text structures is the domain of regular expression. So lets use them:

preg_match_all('("[^\']*|[^"]+)', 'Double Quote ", Single Quote \'', $matches);
var_dump($matches);


Output:

array(1) {
  [0]=>
  array(3) {
    [0]=>
    string(13) "Double Quote "
    [1]=>
    string(16) "", Single Quote "
    [2]=>
    string(1) "'"
  }
}

The pattern matches any string that start with a double quote and contains no single quote or any string that does not contain any double quote.

All that is left is quoting the parts and join them back together:

foreach ($matches[0] as $part) {
  $quoteChar = (substr($part, 0, 1) == '"') ? "'" : '"';
  $result .= ", ".$quoteChar.$part.$quoteChar;
}
return 'concat('.substr($result, 2).')';

Put Together

It does not make sense to create the function call for a single argument. So the check is still needed:

  1. If the value contains no single quote, use single quotes
  2. If the value contains no double quote, use the double quotes
  3. Otherwise divide the string and use concat()

A complete implementation can be found in FluentDOM\Xpath::quote().

2014-08-06

FluentDOM 5 + XML Namespaces

FluentDOM 5 allows to register namespaces on the DOM document class. These are not the namespaces of a loaded document, but a definition of namespaces for your programming logic.

Namespaces In An XML File

People mistake the namespace prefixes for the namespaces often. The namespace definitions are the xmlns:* Attributes. Let's take a really simple example:

<atom:feed xmlns:atom="http://www.w3.org/2005/Atom"/>

This is a feed element in the Atom namespace. The actual namespace is "http://www.w3.org/2005/Atom". Because this would be difficult to read and write the prefix/alias 'atom' is defined for the namespace. 

You can read the element name as '{http://www.w3.org/2005/Atom}:feed'

It is possible to define a default namespace for elements in an XML. 

<feed xmlns="http://www.w3.org/2005/Atom"/> 

This should still be interpreted as '{http://www.w3.org/2005/Atom}:feed'. The namespace prefix in the source document is not relevant. You need a way to match the namespace itself and not the alias.

Hint 1: Namespace prefixes can be redefined on any element node in the document.

Hint 2: Attributes always need a prefix to use a namespace. Any attribute without a prefix is in the "none/empty" namespace.

Namespaces in XPath

An XPath expresssion that could match the Atom feed element would need to use a prefix and needs a way to resolve that prefix into the actual namespace. 

A PHP Example:

$xpath = new DOMXpath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$nodes = $xpath->evaluate('/a:feed');


In PHP the DOMXpath class provides the evaluate() method and the namespace resolver. You register your own namespace prefixes. '/a:feed' gets resolved to '/{http://www.w3.org/2005/Atom}:feed'. The prefixes in the document and the expression can be different, but the resolved namespace has to be the same.

A JavaScript example:

var xmlns = {
  namespaces : {
    'a' : 'http://www.w3.org/2005/Atom'
  },
  lookupNamespaceURI : function(prefix) {
    return this.namespaces[prefix] || null;
  }
};
var nodes = xmlDocument.evaluate('/a:feed', xmlDocument, xmlns, XPathResult.ANY_TYPE, null); 


In JavaScript the document object provides the evaluate() method and the namespace resolver is an argument for it. This will work in most of the current browsers, but not IE. 

Namespaces in FluentDOM

In FluentDOM 5 the document and the element classes both have an evaluate() method. To define your namespace mapping you register the namespace on the document object:

$document->registerNamespace('a', 'http://www.w3.org/2005/Atom')
$nodes = $document->evaluate('/a:feed');


Now because the document has a way to resolve namespaces, it can do this for other methods, too. Basically for all methods a have a namespace aware variant, like 'createElement()' and 'createElementNS()' or 'setAttribute()' and 'setAttributeNS()'. If FluentDOM can resolve a tagname it will call the namespace aware variant of the method.

// Standard DOM
$feed = $document->createElementNS(
  'http://www.w3.org/2005/Atom', 'atom:feed'
);
// FluentDOM
$document->registerNamespace('atom', 'http://www.w3.org/2005/Atom');
$feed = $document->createElement('atom:feed');

Default Namespace 

FluentDOM adds the concept of a default element namespace, too.

// Standard DOM
$feed = $document->createElementNS(
  'http://www.w3.org/2005/Atom', 'feed'
);
// FluentDOM
$document->registerNamespace('#default', 'http://www.w3.org/2005/Atom');
$feed = $document->createElement('feed');

appendElement()


FluentDOM\Document and FluentDOM\Element implement an appendElement() method. It is a shortcut for serveral methods. It creates the element, adds attributes and a text node and appends it to the parent node. Together with the namespace handling, creating an XML document becomes a lot simpler. You can find an example in the wiki.


2014-08-01

FluentDOM 5

FluentDOM 5.0.0 is now released.

Up to version 4.1, FluentDOM was an implementation of the jQuery Traversing and Manipulation APIs in PHP. Version 5 is a complete rewrite and adds a secondary focus. FluentDOM now provides extended variants of PHPs DOM classes, too. This allows workarounds for bugs, syntax sugar and shortcuts.

Bugs

#39521

The second argument to DOMDocument::createElement() it breaks if it contains an entity. FluentDOM avoids that by creating and appending a text node. Additionally it adds a third argument to provide attributes.

#55700

By default PHP registers the namespace definitions of the current context. This uses the namespace prefixes as identifiers, but they are not. They are allowed to change (even on different elements in the same document). You should always register your own prefixes so you do not depend on the prefixes in the document - they are just not relevant. So the automatic registration costs performance without a real gain in the best case. In the worst case it overrides a your namespace registration and you can not fetch the data you want.

FluentDOM adds a property to change this behavior. The automatic namespace registration is disabled by default and can be activated using the property or the third argument for evaluate()/query().

Syntax Sugar

Cast To String


Most of the nodes can be cast to string. for example the following will return the whole text content of an document.

echo $dom->documentElement;

Iterator For Child Nodes


FluentDOM\Element is iterate-able. Using foreach() on it will iterate over the child nodes. The Iterator is a RecursiveIterator, too.

ArrayAccess

FluentDOM\Element allows array syntax. A numeric key like $element[1] will access the child node. An qualified name string like $element['href'] will access the attribute.

Namespaces

FluentDOM\Document allows to register namespaces on the document. If methods like setAttribute() or createElement() recognize a colon in the tag name, they will resolve the namespace prefix an call their namespace aware variant.

Shortcuts

FluentDOM\Document::createElement() allows to provided content and attributes. FluentDOM\Element::appendElement() allows to create and append an element with a single call.

Other methods of the FluentDOM\Element class are variants of the document variants using the element node as default context. 

Backwards Compatbility

FluentDOM 5 is mostly backwards compatible. The FluentDOM() function still exists and the returned FluentDOM\Query instance has the same jQuery like API. Only the loaders where changed.

CSS Selectors

The FluentDOM\Query class allows to set an callback function that is used to convert the provided selectors to xpath expressions. FluentDOM::QueryCss() returns a FluentDOM\Query instance that supports CSS selectors. You will need to have Carica PhpCss or Symfony CSS-Selector installed in you project.



x