RSScraping | Scraping RSS With PHP, DOM and XPath Magic

I wrote a post on some XPath magic for all you evil scrapers out there. Now I will show you how to scrape RSS feeds. I used to do it the RegExp way but now I decided to head over to XML parsing and DOM processing. Lazy enough I decided to look for an already made version and found a quite good one actually. Close to my needs but not exactly. I took it, used and abused the source (ended up changing almost completely), and achieved the one I needed. The good thing about the RSS Scraper using DOM XML + PHP is that it's way shorter and much more reliable than the RegExp version.

The Changes I Made In Functionality

RSS items are grouped by channel. I added a lot more typecasting, changed some calls to XPaths and some NameSpace usage. Plus two more helper functions for combined channel items and only titles + links. It supports full content feeds also (why would people to that ... I don't know?).

The RSScraper Source Code

Zone unavailable to unregistered users.
Registration is quick, painless and worth its weight in gold.

Enjoy it and scrape the full content RSS publishers to death (they're asking for it) ;) ... but don't infringe copyright!

Category: PHP, Scraping
Tagged: , , , ,

5 Responses

  1. You’re reinventing the wheel somewhat. Take a look at SimplePie.

    • $@5ubliminal64:361 — #1 says:

      My wheel rolls better :) I’m crazy that way.

      You are aware the SimplePIE.inc is about 350KB in size while mine is just one simple function…?
      I only need what mine does, I’m cheap about includes’ sizes … and performance.

      But to compare them you need to have a look at the code and, as you are not registered/logged, I don’t think you have ;)

  2. +time2lite1:8 — #14 says:

    Thanks for this. Do you have a simple script for scraping Google for RSS feeds related to a subject area?

    At the moment i do a search like

    subject rss filetype:xml

    and them just copy and paste the url from the serp. What i could really use is a script that scrapes those serps and puts the urls in a text file but my PHP isn’t quite up to it!