PHP Web Crawler, spider, bot, or whatever you want to call it, is a program that automatically gets and processes data from sites, for many uses.

Google, for example, indexes and ranks pages automatically via powerful spiders, crawlers and bots. We have also link checkers, HTML validators, automated optimizations, and web spies. Yeah, web spies. This is what we will be doing now.

Your Designer Toolbox
Unlimited Downloads: 500,000+ Web Templates, Icon Sets, Themes & Design Assets


Actually I don’t know if this is a common term, or if its ever been used before, but I think it perfectly describes this kind of application. The main goal here is to create a software that monitors the prices of your competitors so you can always be up to date with market changes.

You might think “Well, it is useless to me. You know, I’m a freelancer, I don’t have to deal with this ‘price comparison’ thing.” Don’t worry, you are right. But you may have customers that have a lot of competitors they want to watch closely. So you can always offer this as a “plus” service (feel free to charge for it, I’ll be glad to know that), and learn a little about this process.

So, let’s rock!


Table of Contents:

  1. PHP Web Crawler Tutorial
  2. Getting Smarter Code with PHP Variable Variables and Variable Functions

PHP Web Crawler Tutorial

In this tutorial you will learn how to create a great app for gathering data  from other sites, that can be used for your advantage over competitors.


1 – Requirements

  • PHP Server with linux – We need to use crontab here, so it is better to get a good online server
  • MYSQL – We will store data with it, so you will need a database

2 – Basic crawling

We will start by trying a basic crawling function: get some data. Let’s say that I sell shoes, and Zappos is my competitor (just dreaming about it). The first product I want to monitor is a beautiful pair of  Nike Free Run+. We will use now fopen to open the page, fgets to read each line of the page and feof to check when we need to finish the reading. At this time, you need to have fopen enabled in your server (you can check it via phpinfo ). Our first piece of code will be:

<?php if(!$fp = fopen("http://www.zappos.com/nike-free-run-black-victory-green-anthracite-white?zlfid=111" ,"r" )) { return false; } //our fopen is right, so let's go $content = ""; while(!feof($fp)) { //while it is not the last line, we will add the current line to our $content $content .= fgets($fp, 1024); } fclose($fp); //we are done here, don't need the main source anymore ?>

At this point, if you echo the $content you will notice that it has all page contents without any CSS or JS, because on zappos site they are all with relative paths.

Now we have the content, we need to process the product price.

How do you know the difference between price and  other ordinary data in our page? Well, it is easy to notice that all prices must have a “$” before them, so what we will do is get all data and run a Regular Expression to see which prices where we have a dollar sign,  we have on that page.

But our regular expression will match every price on the page. Since Zappos is a good friend of spies, it has made the “official” price as the first, always. The others are just used in JavaScript, so we can ignore them.

Our REGEX and price output will be something like this:

<?php
//our fopen, fgets here

//our magic regex here
	preg_match_all("/([$][0-9]*[,]*[.][0-9]{2})/", $content, $prices, PREG_SET_ORDER);
	echo $prices[0][0]."
";
?>

Wow, we now have the price. Don’t forget the other prices, we will need them if Zappos changes something in their site.

3 – Save data in MYSQL

Let’s prepare our DB to receive this data. Let’s create a table called zappos. Inside of it we will have four columns:

  • ID – Primary key on this table
  • Date – When data was stored. It’s good to store this so you can do some reports.
  • Value – Value that you’ve found
  • Other_Values – Values that aren’t what you want, but it’s important to store them so if the site owner changes the code you have a “backup” of the possible values

In my phpmyadmin I’ve created a database called spy, and inside it my table zappos, this way:

CREATE TABLE IF NOT EXISTS `zappos` (
  `ID` int(5) NOT NULL AUTO_INCREMENT,
  `Date` date NOT NULL,
  `Value` float NOT NULL,
  `Other_Values` char(100) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB  DEFAULT CHARSET=latin1 AUTO_INCREMENT=3 ;

Once you’ve created your table, we will start adding some data. So we will need to do a mysql connect in our PHP and prepare our prices to be saved.

Since all our data is not perfect floats, we need to prepare it so we will have just numbers and a dot.
To connect in our db we will use mysql_connect, and after we will use mysql_select_db to select “spy” and then we can do our mysql_query to save or get our data.

<?php

//preparing to save all other prices that isn't our "official" price
	$otherValues = "";
	foreach ($prices as $price) {
		$otherValues .= str_replace( array("$", ",", " "), '', $price[0]); //we need to save it as "float" value, without string stuff like spaces, commas and anything else you have just remove here
		$otherValues .= ","; //so we can separate each value via explode when we need
	}

//if someday Zappos changes his order (or you change the site you want to spy), just change here
	$mainPrice = str_replace( array("$", ",", " "), '', $prices[0][0]);

//lets save our date in format YYYY-MM-DD
	$date = date('Y\-m\-d');

	$dbhost  = 'localhost';
	$dbuser  = 'root';
	$dbpass  = '';
	$dbname  = "spy";
	$dbtable = "zappos";

	$conn = mysql_connect($dbhost, $dbuser, $dbpass)
		or die ('Error connecting to mysql');
		echo "
Connected to MySQL
";

		$selected = mysql_select_db($dbname)
			or die( mysql_error() );
			echo "Connected to Database
";

			//save data
			$insert = mysql_query("
						INSERT INTO `$dbname`.`$dbtable` (
							`ID` ,
							`Date` ,
							`Value` ,
							`Other_values`
						)
						VALUES (
							NULL , '$date', '$mainPrice', '$otherValues'
						);
					");
			//get data
			$results = mysql_query("SELECT * FROM $dbtable");

	mysql_close($conn);

//all data comes as MYSQL resources, so we need to prepare it to be shown
	while($row = mysql_fetch_array($results, MYSQL_ASSOC)) {
		echo "ID :{$row['ID']} " .
			 "Date : {$row['Date']} " .
			 "Value : {$row['Value']}";
		echo "
";
	}

?>

4 – Smarter spy with Crontab

Well, with crontab we can schedule some tasks in our (linux) system so it runs automatically. It is useful for backup routines, site optimizing routines and many more things that you just don’t want to do manually.

Since our crawler needs some fresh data, we will create a cron job that runs every day at 1am. On net.tuts+ we have a really good tutorial on how to schedule tasks with cron, so if you aren’t too familiar with it, feel free to check it out.

In short, we have command lines that we could use for it, (second is my favorite):

#here we load php and get the physical address of the file
#0 2 * * * says that it should run in minute zero, hour two, any day of month, any month and any day of week
0 2 * * * /usr/bin/php /www/virtual/username/cron.php > /dev/null 2>&1

#my favorite, with wget the page is processed as it were loaded in a common browser
0 2 * * * wget http://whereismycronjob/cron.php

5 – Let’s do some pretty charts

If you are planning to use this data, just a db record won’t be too useful. So after all this work we need to present it in a sexier way.

Almost all our jobs here will be done by the gvChart jQuery plugin. It gets all our data from tables and make some cool charts out of it. What we have to do actually is print our results as a table, so it can be used by gvChart. Our code this time will be (download our demo for more info!):

<?php
	$dbhost  = 'localhost';
	$dbuser  = 'root';
	$dbpass  = '';
	$dbname  = "spy";
	$dbtable = "zappos";

	$conn = mysql_connect($dbhost, $dbuser, $dbpass)
		or die ('Error connecting to mysql');

		$selected = mysql_select_db($dbname)
			or die( mysql_error() );

			//get data
			$results = mysql_query("SELECT * FROM $dbtable ORDER BY `ID` DESC LIMIT 15");

			mysql_close($conn);

			$dates  = array();
			$values = array();
			while($row = mysql_fetch_array($results, MYSQL_ASSOC)) {
				$dates[] = "{$row['Date']}";
				$values[] = "{$row['Value']}";
			}

			echo "
<table id='real'>";
				echo "
<caption>Real Prices on Zappos.com</caption>

";
				echo "
<thead>";
					echo "
<tr>";
						echo "
<th></th>

";
						foreach($dates as $date) {
							$date = explode('-', $date);
							echo "
<th>" . $date[2] . "</th>

";
						}
					echo "</tr>

";
				echo "</thead>

";
				echo "
<tbody>";
					echo "
<tr>";
						echo "
<th>" . $date[0] . "-" . $date[1] . "</th>

";
						foreach($values as $value) {
							echo "
<td>" . $value . "</td>

";
						}
					echo "</tr>

";
				echo "</tbody>

";
?>

Are you hungry yet?

I think there’s a lot to improve on yet. You could, for example, do a “waiting list” of urls so you could crawl a lot of URL’s with a single call (of course each URL could have his own REGEX and “official price”, if they are from different sites).

And what do you think we could improve?

Now that we are already in PHP let’s take a look at PHP more in depth.


Getting Smarter Code with PHP Variable Variables and Variable Functions

Oh, variables. Who doesn’t love them? They are such nice guys, you know. When you need something, just call $something and your value is there. But something really cool is that you actually don’t need the name of the variable. You can use other variables to access a value of one variable.


For example, let’s say you have two variables $jude = "hey" and $hey = "jude". If you echo $$jude $$hey(yeah, double “$”) your output will be “hey jude”.

But, as you might be thinking it is not just about variables. You can name dynamic functions, methods, arrays, and almost anything you want to.

This time we will see some uses for it, with arrays, functions, classes and how this technique can help you write better code.

So, let’s rock!

Why You Should Use It

Sometimes we need software that is extremely flexible and that we can parametrize. You have to prepare the whole thing, of course, but part of it just comes from user input, and we have no time to change the software just because the user needs a new input.

With variable variables and variable functions you can solve problems that would be much harder to solve without them. We will see some examples below, so I won’t spend long on this, I just want you to keep in mind that you have an option, dear padawan. You can do what your customer wants in a simple way.

Why You Shouldn’t Use It

As anything in our lives this too has a downside.

Truth be told, I don’t use variable variables all the time, and this is because they can make your code a mess. Believe me, for real, a mess.

Instead of reading $imTheProductID you have to read $$product and remember that it refers to product ID. Thus, use it with caution, and when you use it comment the code. I don’t want anybody saying “Oh, man, I’ll kill that Rochester. This guy made my code impossible to read!”

Let’s see what you can do with it.

Variable Ordinary Variables

This is the basic of the basic usage. As I said in my introduction, you can write:

&lt;?php
	$she = "loves";
	$loves = "you";
	$you = "she";

	echo $$loves." ".$$you." ".$$she; // ♫♫ yeah, yeah, yeah ♫♫
	// same that:
	// echo $you.$she.$loves;
?&gt;

If you try this, you will see the output “She loves you”. As you may notice, this is quite confusing, so when you use it be careful.

But you can do much more than echo a song in unreadable variables. Let’s say you want to generate some dummy vars, for testing purposes, as Andre exemplified in php.net documentation, you can do this (modified a little bit):

&lt;?
$a = "a";
for ($i = 1; $i &lt;= 5; $i++) {
  ${$a.$i} = "value";
}

echo "$a1, $a2, $a3, $a4, $a5";
//Output is value, value, value, value, value
?&gt;

The important thing to note here is the curly brackets. They are, in this context, similar to parenthesis in math operations, they say to the PHP processor “Hey, you should first join $a to $i and then you create the variable”. With it you can create variables joining other values, and mixing with strings, with you want (if you do that, I would recommend you use single quotes to prevent PHP warnings).

Variable Arrays

Let’s say you have two groups of data, stored in arrays. If you want to switch between them, the usual way is create an if / else statement, right?

Well, you could do it via variable arrays. Lets see how it could be:

&lt;?php
	$i = 3; //we want the 4th item in the array

	$product = array ( 'TLP2844', 'OSX214Plus', 'E-4205', 'TTP244Plus' );
	$manufacturer = array ( 'Zebra', 'Argox', 'Datamax', 'TSC' );

	$select = $_GET['filter'];
        //hard way
        if ( $select == "product") {
        echo $product[$i];
         } else {
        echo $manufacturer[$i];
        }

        //easy way
        echo ${$select}[$i];
?&gt;

Again, look at the curly brackets. If you don’t use them you will get an ugly error.

This is a very simple example, but you could apply this to many other things, and the main advantage is that if you add, let’s say, another 100 arrays, you don’t have to create 99 “if / elseif / else” statements (or switch, for the smarter programmers :D).

Variable Functions and get_class_methods

As the code above, variable functions is a good alternative to endless if / else or switches.

But another really good use of variable functions is dynamically define which method should be called, based on a variable. Well, examples are always better for this.

Let’s say you sell gadgets online. As a good seller, you have a lot of  transport companies that you use, but which one you will use depends on the product bought. When one of your 1,000 employees registers a new product it is saved which company it can use.

Again, if you use common logic you would use a switch, and when you add a new method, it would be a nightmare.

Here, what we could do is use our variable functions and when you save the product data, you also save its shipping method. The magic here is to use the get_class_methods to save the name of the method so we can save it and set it as our method name when we calculate the shipment price.

So it would be something like this:

&lt;?php
/*******************
OUR CLASS
********************/
	class Shipping {
		function free( $data ) {
			return 0;
		}

		function smallProduct( $data ) {
			$price = 100;
			return $price;
		}

		function mediumProduct( $data ) {
			$price = 300;
			return $price;
		}

		function fragileProduct( $data ) {
			$price = 1000;
			return $price;
		}
	}

/****************
WHEN SAVE OUR PRODUCT
*****************/
	include ('pathToOurClassFile.php');
	$class_methods = get_class_methods('Shipping'); // Shipping is our class name!
	//$class_methods output Array ( [0] =&gt; free [1] =&gt; smallProduct [2] =&gt; mediumProduct [3] =&gt; fragileProduct )

/***************
WHEN SET OUR SHIPMENT PRICE
****************/
	$myMethod = "mediumProduct"; //it should come from our BD, stored in product data

	$shipment = new Shipping();
	$price = $shipment-&gt; $myMethod ( $data );
	echo $price;

?&gt;

Are you hungry yet?

I think it is a really interesting topic. Why not read a little bit more about it? My main source was php.net manuals, about variable variablesvariable functions, and the magic get_class_methods php function.

This post may contain affiliate links. See our disclosure about affiliate links here.