New here? Read Greetings Earthling!

Scraping Pages Written By HTML-Challenged People

There are moments when you scrape stuff but you notice that, even if your browser shows you a functional HTML page, you just can't parse it using RegExp. Parsing it with the DOM would work but using RegExp just won't. This usually means the HTML is malformed and the DOM/Browser fixes it but ... that would be quite difficult to do yourself as there's a lot of messed up shit people do in HTML.

When I scrape in PHP I fix HTML pages automatically all the time. I don't trust those amateurs out there who don't close tags and so on. But fixing pages in PHP is actually easy using tidy_repair_string and tidy config settings are here.

Zone unavailable to unregistered users.
Registration is quick, painless and worth its weight in gold.

Have fun and fix thy scraped HTML pages.

Category: PHP, RegExp, Scraping

2 Responses

  1. +kundi2:5 — #23 says:

    Should I read encoding from html file before processing it with el_tidy_fixML so that it wouldnt be messed up if theres no utf8?

    • I think they detect it themselves. You should and add input encoding as a parameter.
      You can find encoding in HTTP Content-Type or meta http-equiv=Content-Type.

      PS: Haven’t tried it against weird encodings.