How to skip links containing file extensions while web scraping using PHP
Question by Spoilt
Here is a function that validates .edu TLD and checks that the url does not point to a .pdf document or a .doc document.
public function validateEduDomain($url) {
if( preg_match('/^https?://[A-Za-z]+[A-Za-z0-9.-]+.edu/i', $url) && !preg_match('/.(pdf)|(doc)$/i', $url) ) {
return TRUE;
}
return FALSE;
Now I am encountering links that point to jpg, rtf and others that simple_html_dom tries to parse and return its content. I want to avoid this happening by skipping all such links. The problem is that the list is non-exhaustive and I want the code to skip all such links. How am I supposed to do that??
Answer by Maerlyn
Tring to filter urls by guessing what’s behind it will always fail in a number of cases. Assuming you are using curl to download, you should check if the response document-type header is among the acceptable ones:
<?php
require "simple_html_dom.php";
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); //default is to output it
$urls = array(
"google.com",
"https://www.google.com/logos/2012/newyearsday-2012-hp.jpg",
"http://cran.r-project.org/doc/manuals/R-intro.pdf",
);
$acceptable_types = array("text/html", "application/xhtml+xml");
foreach ($urls as $url) {
curl_setopt($curl, CURLOPT_URL, $url);
$contents = curl_exec($curl);
//we need to handle content-types like "text/html; charset=utf-8"
list($response_type) = explode(";", curl_getinfo($curl, CURLINFO_CONTENT_TYPE));
if (in_array($response_type, $acceptable_types)) {
echo "accepting {$url}n";
// create a simple_html_dom object from string
$obj = str_get_html($contents);
} else {
echo "rejecting {$url} ({$response_type})n";
}
}
running the above results in:
accepting google.com
rejecting https://www.google.com/logos/2012/newyearsday-2012-hp.jpg (image/jpeg)
rejecting http://cran.r-project.org/doc/manuals/R-intro.pdf (application/pdf)
Answer by Starx
Update the last regex to something like this
!preg_match('/.(pdf)|(doc)|(jpg)|(rtf)$/i', $url) )
will filter out the jpgs and rtf documents.
You have to add the extensions to the regex above to omit them..
Update
I dont think its possible to block all sort of extensions and i personally do not recommend it for scraping usage also. You will have to skip some extensions to keep crawling.. why dont you change you regex filter to the ones you would like to accept like
preg_match('/.(html)|(html)|(php)|(aspx)$/i', $url) )