March 3, 2013

How to extract HTML tags from the web page generated at runtime

Question by shailbenq

I am using a SimpleHTMLDOM parser to extract HTML data from web pages. But I came across websites such as www.coursera.com wherein the webpage is generated at runtime.

I need to know has anyone tried parsing such pages?

I am new to this field so some theory on this topic would help my understanding in parsing webpages.

Answer by Starx

John Resig wrote an HTML Parser.

Demo: http://ejohn.org/blog/pure-javascript-html-parser/

This can workout for you.

March 31, 2012

preg_replace script, link tag not working

Question by john

I used the following code to remove script, link tags from my string,

$contents='<script>inside tag</script>hfgkdhgjh<script>inside 2</script>';
$ss=preg_replace('#<script(.*?)>(.*?)</script>#is', '', $contents);
echo htmlspecialchars($ss);

it works fine. But can I use anything that similar to html parsing rather than preg_match for this?

Answer by Starx

Here are few things you can do

  1. htmlspecialchars() can prove those tags useless
  2. striptags() removes all HTML tags

But the technique you are using is the correct one. However here is a improved version for that

echo preg_replace('/<scriptb[^>]*>(.*?)</script>/is', "", $contents);
...

Please fill the form - I will response as fast as I can!