Even if HTML allows case insensitive tags, PHP and it's getElementsByTagName is case-sensitive. I don't know about you but for me, this is really messed up. I think you remeber the ultimate el_dom_childrenByTagCB function I shared here.
I'm sharing now it's lighter sister: el_dom_childrenByTagName. The one it evolved from. It's the same without bells and whistles [callbacks]. And it's CASE INSENSITIVE like it's supposed to.
Read the rest of this entry »
getElementsByTagName is cool but not enough for my DOM scraping parsing needs. So I wrote a new function: el_dom_childrenByTagCB. This works a bit differently. It can be limited in depth and can accept callbacks as parameters. Why callbacks? Read on.
Read the rest of this entry »
Got a comment today on how to get Domain Age using Archive.org and, as the comment was too flattering to resist ... here it is. It uses the XPath query and it could have been done with RegExp only too. But ... it's a case study.
I think you are aware that this estimates the age. It's not exact as not all sites get scraped / accept scraping from the archive.org robot. More accurate results can be achieved using Whois Domain Age.
Read the rest of this entry »
I wrote a post on some XPath magic for all you evil scrapers out there. Now I will show you how to scrape RSS feeds. I used to do it the RegExp way but now I decided to head over to XML parsing and DOM processing. Lazy enough I decided to look for an already made version and found a quite good one actually. Close to my needs but not exactly. I took it, used and abused the source (ended up changing almost completely), and achieved the one I needed. The good thing about the RSS Scraper using DOM XML + PHP is that it's way shorter and much more reliable than the RegExp version.
Read the rest of this entry »