Sticky Posts
Aug 11, 2009
PHP DOM getElementsByTagName() Is Obsolete
getElementsByTagName is cool but not enough for my DOM scraping parsing needs. So I wrote a new function: el_dom_childrenByTagCB. This works a bit differently. It can be limited in depth and can accept callbacks as parameters. Why callbacks? Read on.
The function declaration looks like this:
function el_dom_childrenByTagCB($node, $tag = null, $depth = 0, $validator_callback = null, $keeper_callback = null)
This is a breakdown of the parameters.
- $node is the DOMNode where the function searches in. It can be of any DOMNode type: DOMDocument or DOMElement.
- $tag is a string and can be of three type. Tag Name: A, IMG, SPAN, DIV, ... or Node Name: #comment, #text or Attribute: @src. For tag name or node name, the node is added to the results only if it matches the name. With attribute ... the node is added only if it has that attribute ... regardless of its value.
- $depth is the recursive depth to go inside the$node. 0 means only direct children, 1 means $node and it's children ... and so on. -1 means infinite.
- $validator_callback is a callback function where the magic kicks in. This can be null and any node matching $tag will be added. But, if this is a callback function, it will be asked if the node is worthy of keeping. The callback gets the $child as only parameter and can decide based on attributes or whatever ... to add it to results or not.
- $keeper_callback is another magical parameter. If it's null ... the DOMElement $child itself is added to results. But ... if this is a callback ... the $child will be fed as only parameter to it and it can decide what to add to results. It can return the node itself or values of attributes or an array ... or an object ... or the current time ... and so on.
I'm pretty sure you did not understand a lot from this even if I hope you did. But as a photo means a thousand words ... so do a few lines of code. Let me give you an example on how to get all SRC attribute value from all IMG elements inside a parent node where SRC ends with .jpg or .png.
<?php
$urls = el_dom_childrenByTagNameCB(
$DOMDocument /* DOMNode to look in */,
'IMG' /* Tag Name */,
-1 /* Infinite Depth */,
/* Validator Checks existence of SRC */
create_function('$n', 'return $n->hasAttribute("src") && preg_match("~^.+\.(jpg|png)$~i", $n->getAttribute("src"));'),
/* Keeper returns value of SRC */
create_function('$n', 'return $n->getAttribute("src");')
);
?>
What do you think? Do you see potential??? If you do ... here's the code. I take tips ... but I'm not waiter ;)
Registration is quick, painless and worth its weight in gold.


OOT: Surprised that the source code is shown without me being logged in. But, After second check, its turn out that I’ve already login yesterday:D.
back to topic: Great function! this will come in handy for my next experiment.
You scared me there :) Glad it helps.
I’ll publish some advanced examples of using it in the next few days.
[...] nice functions and routines and is definitely worth checking out. Here is a recent article on the PHP DOM getElementsByTagName function. He leans towards the Blackhat side of things so that might be another reason to check out how he [...]
[...] case-insensitive version of : PHP DOM’s getElementsByTagName. Even if I know you remember the ultimate el_dom_childrenByTagCB function I shared here, do put to good its lighter version. No comments [...]