Stupid Simple Web Scraping with SimpleXML

Stupid Simple Web Scraping with SimpleXML


Add Comment
Source
Tutorials
Bookmark/Search this post with:
Delicious Delicious
Digg Digg
StumbleUpon StumbleUpon
Facebook Facebook
Google Google
Yahoo Yahoo
Technorati Technorati
20
Wednesday, February 20, 2008
By admin
The other day, I was tasked with building a data scraper. Having never built such a contraption, I naturally turned to the Internets for preexisting code. I was horrified with what I found.

The “free” PHP scripts (that’s “free” as in “free baby vomit”) were all infested with the worst sorts of newfangled regex, and PHP 4 era DOM traversing.

Making matters worse, the scripts didn’t offer much of an API, or interface for data mining – rather they provided a rigid, and worthless example – leaving their hapless users to mutilate whatever useful lines they could find, and create an even more horrid fraken-script.*

GOD OFFERED SIMPLEXML, AND IT WAS GOOD
It didn’t take me long to realize that PHP 5’s simpleXML was the answer. And indeed, after an hour of practice, simpleXML turned me into a scraping Ninja.

Below, is a very simple example [for drupal 6] that parses the drupal planet blogroll, and makes this neat little table out of it. Hopefully, you’ll find this method as easy, and useful as I did.

*Disclosure: I am not among the sadistic few that think Perl’s regular expressions are the greatest invention since sex. So you call simpleXML a crutch, and I’ll call you sick.

Leave a Reply

Your email address will not be published. Required fields are marked *