Scraping Websites With cURL

Web Page Scraping is a hot topic of discussion around the Internet as more and more people are looking to create applications that pull data in from many different data sources and websites.

But what if you wanted to download pictures, graphics, or video off a number of websites and store them on your server? This is were PHP’s file_get_contents can not help us.

Introducing cURL!

cURL is command line tool for transferring files with URL syntax, which means that we can transfer most any type of file using this tool. Most, not all, web servers have the cURL library module installed already so you won’t have to do anything to begin using this powerful library.

cURL has the ability to transfer files using an extensive list of protocols, including:

FTP
FTPS
HTTP
HTTPS
TFTP
SCP
SFTP
Telnet
DICT
FILE
LDAP

As you can see cURL can not only use the HTTP protocol (which is what PHP’s file_get_contents function uses), but also the FTP protocol which can prove very useful if you want to create a web spider to upload files to server automatically or FTP videos to video sharing sites.

The good news is that cURL is so powerful that it can do most everything that you will ever need to do when it comes to web page scraping. The down-side is that cURL can be very tricky to deal with because there are a tremendous number of options to set and pit-falls to side step.

What I hope to do in this series of tutorials is show you how to work with cURL and how to create you own web scraping class in PHP so you can reuse the code time and time again. So let’s begin…

cURL and Your Web Server

Like I had mentioned that most of the time cURL is already set-up on your web server if you are using a hosted plan. (Sometimes on the “cheaper” plans, cURL is disabled so contact your administrator to see if they will enable it for you)

I personally do most of my web page scraping using my local web server. That’s right, you don’t even need to pay for a hosted server to scrape web pages. All you need is a computer and a web server like Xampp!

If you are using Xampp, like I recommended in my tutorial Creating a Local Development Environment, you will need to enable the cURL module in PHP.

To do this goto the PHP.ini file in your Xampp/php folder and the Xamp/apache/bin folder and uncomment the “php_curl.dll” line by removing the semi-colon.

; Windows Extensions
; Note that ODBC support is built in, so no dll is needed for it.
; Note that many DLL files are located in the extensions/ (PHP 4) ext/ (PHP 5)
; extension folders as well as the separate PECL DLL download (PHP 5).
; Be sure to appropriately set the extension_dir directive.

;extension=php_apc.dll
;extension=php_apd.dll
;extension=php_bcompiler.dll
;extension=php_bitset.dll
;extension=php_blenc.dll
;extension=php_bz2.dll
;extension=php_bz2_filter.dll
;extension=php_classkit.dll
;extension=php_cpdf.dll
;extension=php_crack.dll
extension=php_curl.dll
;extension=php_cvsclient.dll
;extension=php_db.dll
;extension=php_dba.dll
;extension=php_dbase.dll
;extension=php_dbx.dll

Save the changes and restart your web server.

You are now ready to start scraping the web. In the next tutorial, I will show you how you can create your own web scraping class in PHP using cURL.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s