DOMDocument ft. DOMXPath | Scraping Google Search Results

Scraping with RegExp is most of the times effective enough but, every now and then, you will need to make more advanced queries on the DOM to reach the elements you need or to have the DOM fix very peculiar HTML errors that just don't make sense in the mind of a HTML aware coders. I'm not saying it's not doable in RegExp but this approach might be actually easier for those who think RegExp is EVAL.

The step by step guide

When using the DOM to retrieve some HTML document internal slices you need to do these:

  1. Retrieve the page ... using file_get_contents or elHttpClient
  2. Load the page into a DOMDocument
  3. Preare an DOMXPath query and attach the Document to it
  4. XPath Query the Document
  5. Loop through the results
  6. Get what you need

Pretty simple huh ...?

What Is XPath Anyway?

If you're asking yourself this ... not cool. Anyway, XPath is to the DOM as SQL is to mySQL / MSSQL. It's the XML query language and allows you to reach branches within the XML tree using some combinations of tags, attributes and indexes. I'm not gonna teach you XPath here but this is a good reading for you. Learning by example, courtesy of Microsoft.

Finding The XPath of DOMElements

Finding the XPath for a DOMElement is not really easy but there are two ways:

  1. By hand
  2. Using XPather for Firefox

I use second now but I learnt to use XPath with the first method.

Using XPather

XPather for Firefox allows you to do two things. First you right click any element, click Show In XPather and retrieves its exact XPath. Exact means it retrieves it's internal index. So ... in a repetitive pattern like, let's say, the Google search results the index will bind the query to one element by it's index (like LIMIT 5,1 for index 5 in MySQL).

Second thing it allows you to do is to test your XPath queries and fine-tune them to reach what you need.

Learning by example

  1. OpenFirefox
  2. Install the Firefox plugin
  3. Go to google.com (in English)
  4. Search 5ubliminal
  5. Right click first result
  6. Click Show In XPather

What you will see will blow you mind :) XPath will be like this and notice the bolded [#]:

  • Result #1: /html/body[@id='gsr']/div[@id='res']/div/ol/li[1]/h3/a
  • Result #2: /html/body[@id='gsr']/div[@id='res']/div/ol/li[2]/h3/a

The bold [#] is the index and will limit results to only that offset. Remove [#] and you are left with:

/html/body[@id='gsr']/div[@id='res']/div/ol/li/h3/a

The above is the exact XPath pattern required to reach all the Google search results links. If you append /@href you get:

/html/body[@id='gsr']/div[@id='res']/div/ol/li/h3/a/@href

And, after you eval it in XPather, you will see this will retrieve directly href attributes of search result links. Bullzeye!

//a/@href will get all links from a page. XPath power!

Wrappin' it up

Wrapping the above pieces of code for the most basic method to parse Google search results. This is so basic it's embarassing for me to publish it, but I'll do it.

Zone unavailable to unregistered users.
Registration is quick, painless and worth its weight in gold.

Can it get easier than this? The real scraping method using DOM will be published in a few days to subscribers only.

What have you learned today?

  • XPath queries and DOM make scraping a breeze, especially for complex sites
  • Any site goes as long as there's a repetitive pattern
  • XPather helps you find the repetitive patterns and test your XPath queries
  • PHP DOM + XPath are magic
  • ... 5ubliminal owns!

Category: PHP, Scraping

18 Responses

  1. Good stuff! I definitely prefer RegEx, but that has more to do with familiarity than usability.

    • $@5ubliminal58:361 — #1 says:

      True. But RegExp is usually custom tailored on (rather) static HTML.
      The DOM parses HTML by (HTML) rules and even if things change a bit (in code) same results can be achieved easier.

      Nevertheless I choose RegExp over DOM almost anytime.

  2. Man you made me make a leap forward ! Thanks a lot, i didn’t even know it existed ! The nicest thing I’ve learnt since Curl :)))

  3. +twola1:12 — #6 says:

    Thanks a lot for this post.

    I am trying to grasp the Xpath but am still being a newb. How would you use/modify your example and get lets say a domains age from waybackmachine?

    (I really cant think of any way to calculate that)

    I love your blog and consider you an awesome authority. Your shit is practical. Anyhow help me if you get a chance.

  4. +gorthal1:2 — #55 says:

    Hi all !
    Thank you man for this great piece of work.
    I didn’t know xpath, and it’s a blats!

    Now I am trying to scapp google image result, but I am stupid boy.
    Have you try this kind of shit ?

  5. +xentech10:17 — #4 says:

    Hey 5ub. This technique is the titties, regex confuses the shit out of me. Anyway, I’ve been able to reproduce this and scrape a few different sites, however when I try and use this on phazeddl.com the $domDocument->loadHTML bit throws up all kinds of errors. Any ideas?

    • $@5ubliminal163:361 — #1 says:

      It works well for me:

      $domDoc	= new DOMDocument();
      $domDoc->loadHTML(file_get_contents('http://www.phazeddl.com/'));
      $xPather	= new DOMXPath($domDoc);
      $domNodes	= $xPather->query("//a/@href");
      $linkHrefs	= array();
      foreach($domNodes as $domNode){ $linkHrefs[] = strval($domNode->textContent); }
  6. +xentech11:17 — #4 says:

    Even using your code I get the same:

    Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Opening and ending tag mismatch: form and table in Entity, line: 2 in C:\XXXXXXXXXXXXXX\scrape\phaz.php on line 4

    Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: ID searchResult already defined in Entity, line: 2 in C:\XXXXXXXXXXXXXX\scrape\phaz.php on line 4

    Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: ID tableHead already defined in Entity, line: 2 in C:\XXXXXXXXXXXXXX\scrape\phaz.php on line 4

    etc, etc, etc, etc

    Very weird.. It works with other sites, any ideas? Thanks for the response.

    • $@5ubliminal164:361 — #1 says:

      I’m sorry but warnings are irrelevant to me. I always use error_erporting(E_ERROR | E_PARSE | E_USER_ERROR); That’s it. Warnings are made … for the weak :) and most warnings are of virtually no use. What if their HTML is malformed ??? … you can’t fix it. So just silence the warning.

      I’m pretty sure many PHP coders would crucify me for disabling warnings but … we’ll talk 5 yrs. later.

      To skip the warning either reset your error_reporting as above or use @$domDoc->loadHTML which silences errors for a single function. I think the @ is your best choice.

  7. +xentech12:17 — #4 says:

    Well the warnings are gone but I still can’t grab the data. I’m trying to get download names from http://www.phazeddl.com/pg/apps1.html with this string:

    /html/body/form/table/tbody/tr/td/div[@id='SearchResults']/div[@id='content']/div[@id='main-content']/table[@id='searchResult']/tbody/tr/td[2]/a

    Can’t seem to return anything. Thanks for your help.

    • $@5ubliminal166:361 — #1 says:

      I might have forgotten to mention that TBODY is not a valid XPATH element unless you can actually find it in the source code.
      Remove both TBODY from your XPATH query and it will work:

      /html/body/form/table/tr/td/div[@id='SearchResults']/div[@id='content']/div[@id='main-content']/table[@id='searchResult']/tr/td[2]/a
      or use the shortie:
      //table[@id='searchResult']/tr/td[2]/a

      ob_start();
      foreach($nodes as $node){
      	echo $node->tagName," :: ",$node->textContent,"\r\n",$node->getAttribute('href'),"\r\n\r\n";
      }
      $text = ob_get_clean();
      echo '<pre>', $text, '</pre>';
      • +xentech13:17 — #4 says:

        Sex on toast. Thanks dude. I’m going to learn more about these and how they work, although I bet there’s a lot less info about them than RegExp.

        Thanks again!

  8. +elescondite1:2 — #60 says:

    Nothing to be embarassed about, I know a lot of good coders that know zip about xpaths.

    By the way, to make it more accurate while staying simple, add an extra slash before the /h3 — this will pick up entries such as youtube that wrap the result in a table (to embed the thumbnail).

    /html/body[@id='gsr']/div[@id='res']/div/ol/li//h3/a/@href

    I will leave it at that. Those who know, know how to make this flexible, those who don’t can stick with it as is.

    Then again, //h3/a/@href works just fine as well.

    • $@5ubliminal180:361 — #1 says:

      Thanks for the hint but I intentionally avoid Youtube and other funny out of place results. No point in scraping those … is there ?:)

      As for the long ass example … it’s educational … look at the comment right above yours for a // relative path example ;)

      • +elescondite2:2 — #60 says:

        Agreed. Depends on what you are trying to achieve. Mosts of the twits, oops, valued clients, I code for want their rankings - remove enough “out of place junk” and their SERPS start to look pretty good ;-)

  9. +donnykurnia1:1 — #211 says:

    Hi, I found this blog via google, but unable to read it using Google Chrome, since you only allow Opera, Firefox, and IE. Maybe it’s time to include Google Chrome too, since I’m using it all the time. Thank you.

    PS: This is my second attempt to post this comment. I have NoScript add-ons in Firefox and by default it block the js in this blog. You should put a warning near comment form to make sure that visitor activate javascript in their browser before post a comment. If this is a e-commerce website, you surely will lose a customer because of this unnecessary hassle.