2010-05-09

Highlight Words In HTML

I got an interesting question on IRC recently. How can you highlight some words/word parts in an HTML document?

The Challenge

  • Wrap given words in text content with a span.
  • Add a class to the span depending on the word.
  • Do not touch elements, attributes, comments or processing instructions.
  • Do it case insensitive.
  • Do it the safe way.

Select The Text Content

Well this is the easy part. Get some FluentDOM object, find the part of the document to edit, select all text nodes in it.

$fd = FluentDOM($html, 'html')
  ->find('/html/body')
  ->find('descendant-or-self::text()');

I used two Xpath expressions because it are two steps. This way I can separate them later. In a single expression I could use the short syntax for the axis, shortening it to "/html/body//text()".

Loop

FluentDOM provides an "each()" method, expecting a callback for argument. The callback is executed for each node (in this case each text node). The first argument of the callback is the node itself.

$fd->each(
  function ($node) use ($check, $highlights) {
    ...
  }
);

Prepare The Words

$highlights = array(
  'word' => 'classNameOne',
  'word_two' => 'classNameTwo'
);

I need to check each node against the words and split it at the words. Is is a text value now, so the tool of choice are PCRE. To build a pattern from the words I sort them by length first, then loop, escape and concatinate them. The sorting is important if one word is part of another.

uksort(
  $highlights,
  function ($stringOne, $stringTwo) {
    $lengthOne = strlen($stringOne);
    $lengthTwo = strlen($stringTwo);
    if ($lengthOne > $lengthTwo) {
      return -1;
    } elseif ($lengthOne < $lengthTwo) {
      return 1;
    } else {
      return strcmp($stringOne, $stringTwo);
    }
  }
);
$check = '';
foreach ($highlights as $string => $class) {
  $check .= '|'.preg_quote(strtolower($string));
}
$check = '(('.substr($check, 1).'))iS';

Check And Divide

This pattern can now be used to check, as well to divide the text. A direct replace would be a bad idea, because I need to insert a new element node (the span). Creating nodes using the DOM functions takes care of any special chars.

if (preg_match($check, $node->nodeValue)) {
  $parts = preg_split(
    $check, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE
  );
  ...
}

The option PREG_SPLIT_DELIM_CAPTURE puts the submatch into the $parts array, too. So it is possible to loop over all parts in their original order.

To Wrap Or Not To Wrap

The $parts array contains the words as well as the text around in separate strings. For each word, a span with the class is needed, all other become separate text nodes.

foreach ($parts as $part) {
  $string = strtolower($part);
  if (isset($highlights[$string])) {
    $span = $node
      ->ownerDocument
      ->createElement('span');
    $items[] = FluentDOM($span)
      ->addClass($highlights[$string])
      ->text($part)
      ->item(0);
  } else {
    $items[] = $node
      ->ownerDocument
      ->createTextNode($part);
  }
}

You now see the reason why I used lowercase versions of the words for keys in the $highlights array. It is easy to check if the $part is a word and get the class for the span.

Replace The Text

The last step is easy again, replace the node with the list of created ones.

FluentDOM($node)->replaceWith($items);

More

This is the basic solution and will only work with PHP 5.3, but I created another version defining a class. You can find the full source of the class example in the FluentDOM SVN at svn://svn.fluentdom.org in examples/tasks/highlightWords.php or on Gist.