Sticky Posts
Apr 2, 2009
RSScraping | Scraping RSS With PHP, DOM and XPath Magic
I wrote a post on some XPath magic for all you evil scrapers out there. Now I will show you how to scrape RSS feeds. I used to do it the RegExp way but now I decided to head over to XML parsing and DOM processing. Lazy enough I decided to look for an already made version and found a quite good one actually. Close to my needs but not exactly. I took it, used and abused the source (ended up changing almost completely), and achieved the one I needed. The good thing about the RSS Scraper using DOM XML + PHP is that it's way shorter and much more reliable than the RegExp version.
The Changes I Made In Functionality
RSS items are grouped by channel. I added a lot more typecasting, changed some calls to XPaths and some NameSpace usage. Plus two more helper functions for combined channel items and only titles + links. It supports full content feeds also (why would people to that ... I don't know?).
The RSScraper Source Code
Registration is quick, painless and worth its weight in gold.
Enjoy it and scrape the full content RSS publishers to death (they're asking for it) ;) ... but don't infringe copyright!


You’re reinventing the wheel somewhat. Take a look at SimplePie.
My wheel rolls better :) I’m crazy that way.
You are aware the SimplePIE.inc is about 350KB in size while mine is just one simple function…?
I only need what mine does, I’m cheap about includes’ sizes … and performance.
But to compare them you need to have a look at the code and, as you are not registered/logged, I don’t think you have ;)
Okay, fair point :)
Thanks for this. Do you have a simple script for scraping Google for RSS feeds related to a subject area?
At the moment i do a search like
subject rss filetype:xml
and them just copy and paste the url from the serp. What i could really use is a script that scrapes those serps and puts the urls in a text file but my PHP isn’t quite up to it!
Checkout my old blog.