January 1, 2012

How to skip links containing file extensions while web scraping using PHP

Question by Spoilt

Here is a function that validates .edu TLD and checks that the url does not point to a .pdf document or a .doc document.

public function validateEduDomain($url) {
    if( preg_match('/^https?://[A-Za-z]+[A-Za-z0-9.-]+.edu/i', $url) && !preg_match('/.(pdf)|(doc)$/i', $url) )  {
        return TRUE;
    }
    return FALSE;

Now I am encountering links that point to jpg, rtf and others that simple_html_dom tries to parse and return its content. I want to avoid this happening by skipping all such links. The problem is that the list is non-exhaustive and I want the code to skip all such links. How am I supposed to do that??

Answer by Maerlyn

Tring to filter urls by guessing what’s behind it will always fail in a number of cases. Assuming you are using curl to download, you should check if the response document-type header is among the acceptable ones:

<?php

require "simple_html_dom.php";

$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); //default is to output it

$urls = array(
  "google.com", 
  "https://www.google.com/logos/2012/newyearsday-2012-hp.jpg", 
  "http://cran.r-project.org/doc/manuals/R-intro.pdf",
);
$acceptable_types = array("text/html", "application/xhtml+xml");

foreach ($urls as $url) {
  curl_setopt($curl, CURLOPT_URL, $url);
  $contents = curl_exec($curl);

  //we need to handle content-types like "text/html; charset=utf-8"
  list($response_type) = explode(";", curl_getinfo($curl, CURLINFO_CONTENT_TYPE));

  if (in_array($response_type, $acceptable_types)) {
    echo "accepting {$url}n";
    // create a simple_html_dom object from string
    $obj = str_get_html($contents);
  } else {
    echo "rejecting {$url} ({$response_type})n";
  }
}

running the above results in:

accepting google.com
rejecting https://www.google.com/logos/2012/newyearsday-2012-hp.jpg (image/jpeg)
rejecting http://cran.r-project.org/doc/manuals/R-intro.pdf (application/pdf)

Answer by Starx

Update the last regex to something like this

!preg_match('/.(pdf)|(doc)|(jpg)|(rtf)$/i', $url) )

will filter out the jpgs and rtf documents.

You have to add the extensions to the regex above to omit them..

Update

I dont think its possible to block all sort of extensions and i personally do not recommend it for scraping usage also. You will have to skip some extensions to keep crawling.. why dont you change you regex filter to the ones you would like to accept like

preg_match('/.(html)|(html)|(php)|(aspx)$/i', $url) )

Author: Nabin Nepal (Starx)

Hello, I am Nabin Nepal and you can call me Starx. This is my blog where write about my life and my involvements. I am a Software Developer, A Cyclist and a Realist. I hope you will find my blog interesting. Follow me on Google+

...

Please fill the form - I will response as fast as I can!