A Basketful Of Papayas: Highlight Words In HTML

2010-05-09

Highlight Words In HTML

I got an interesting question on IRC recently. How can you highlight some words/word parts in an HTML document?

The Challenge

Wrap given words in text content with a span.
Add a class to the span depending on the word.
Do not touch elements, attributes, comments or processing instructions.
Do it case insensitive.
Do it the safe way.

Select The Text Content

Well this is the easy part. Get some FluentDOM object, find the part of the document to edit, select all text nodes in it.

$fd = FluentDOM($html, 'html') ->find('/html/body') ->find('descendant-or-self::text()');

I used two Xpath expressions because it are two steps. This way I can separate them later. In a single expression I could use the short syntax for the axis, shortening it to "/html/body//text()".

Loop

FluentDOM provides an "each()" method, expecting a callback for argument. The callback is executed for each node (in this case each text node). The first argument of the callback is the node itself.

$fd->each( function ($node) use ($check, $highlights) { ... } );

Prepare The Words

$highlights = array( 'word' => 'classNameOne', 'word_two' => 'classNameTwo' );

I need to check each node against the words and split it at the words. Is is a text value now, so the tool of choice are PCRE. To build a pattern from the words I sort them by length first, then loop, escape and concatinate them. The sorting is important if one word is part of another.

uksort( $highlights, function ($stringOne, $stringTwo) { $lengthOne = strlen($stringOne); $lengthTwo = strlen($stringTwo); if ($lengthOne > $lengthTwo) { return -1; } elseif ($lengthOne < $lengthTwo) { return 1; } else { return strcmp($stringOne, $stringTwo); } } ); $check = ''; foreach ($highlights as $string => $class) { $check .= '|'.preg_quote(strtolower($string)); } $check = '(('.substr($check, 1).'))iS';

Check And Divide

This pattern can now be used to check, as well to divide the text. A direct replace would be a bad idea, because I need to insert a new element node (the span). Creating nodes using the DOM functions takes care of any special chars.

if (preg_match($check, $node->nodeValue)) { $parts = preg_split( $check, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE ); ... }

The option PREG_SPLIT_DELIM_CAPTURE puts the submatch into the $parts array, too. So it is possible to loop over all parts in their original order.

To Wrap Or Not To Wrap

The $parts array contains the words as well as the text around in separate strings. For each word, a span with the class is needed, all other become separate text nodes.

foreach ($parts as $part) { $string = strtolower($part); if (isset($highlights[$string])) { $span = $node ->ownerDocument ->createElement('span'); $items[] = FluentDOM($span) ->addClass($highlights[$string]) ->text($part) ->item(0); } else { $items[] = $node ->ownerDocument ->createTextNode($part); } }

You now see the reason why I used lowercase versions of the words for keys in the $highlights array. It is easy to check if the $part is a word and get the class for the span.

Replace The Text

The last step is easy again, replace the node with the list of created ones.

FluentDOM($node)->replaceWith($items);

This is the basic solution and will only work with PHP 5.3, but I created another version defining a class. You can find the full source of the class example in the FluentDOM SVN at svn://svn.fluentdom.org in examples/tasks/highlightWords.php or on Gist.

6 comments:

AnonymousSunday, May 09, 2010 11:59:00 AM
If you were to do this to, for example, highlight words that were searched on through a search engine, I would recommend doing this all in javascript, and don't do the post-processing in PHP.

It will make it easier to cache the PHP response, and IMHO a bit more reliable.
ReplyDelete
Replies
ThWSunday, May 09, 2010 1:31:00 PM
This would be a basic function of the result page, so it should not depend on JavaScript.
ReplyDelete
Replies
AnonymousSunday, May 09, 2010 8:11:00 PM
to expect javascript to be disabled is the biggest failure nowadays.
ReplyDelete
Replies
ThWSunday, May 09, 2010 10:46:00 PM
I expect it to be selective enabled.
ReplyDelete
Replies
Shashikant KoreMonday, May 10, 2010 4:50:00 AM
Here is how you can do it with Lucene and Solr. It can support highlighting phrases, wildcards, boolean queries, etc.

http://sigabrt.blogspot.com/2010/04/highlighting-query-in-entire-html.html
ReplyDelete
Replies
AnonymousMonday, May 10, 2010 10:36:00 AM
@Anonymous on my terms I do use the NoScript Firefox plugin to ensure a minimum of anonymity while I am surfing throught the net (which btw. is not too seldom) or I am using the Lynx commandline browser in an ssh session to find a quick answer to a question. With this I do have Javascript disabled by default or in case I use Lynx not available.
But on the other hand in my opinon it is a shame that I do have to activate Javascript to browse a plain webpage just returning plain text representing a search result for example. Did every one forget about 'seamless degration' and responsible and lightweight webdesign?
ReplyDelete
Replies

Add comment

A Basketful Of Papayas

Pages