May 6, 2012

RegEx for <title> with leading, trailing, linebreak

Question by user1377738

Most website I can parse its title easily with RegEx “(.)” or “s(.+?)s*”. However some sites have a bit different formatting, like http://www.youtube.com (see below). The expression above does not work. Any help catching this kind of format and any other HTML formats?

Thanks
-Tim.

<title>
  YouTube - Broadcast Yourself.

Answer by Fèlix Galindo Allué

If you want to include the line break to the regular expression, in most cases you would only need to use the n inside the expression. That said, which language/interpreter are you using? Some of them doesn’t allow multiline expressions.

If they are permitted, something like (.|n|r)* would suffice.

In case your language or interpreter is not compatible to multiline regular expressions, you could always replace the newlines characters with spaces, and then pass the resulting string to the regular expression parser. That again also depends on your programming environment.

Hope helped!

Answer by Starx

There are various ways to get this done. For only title, SIMPLEHTMLDOM is more than enough.

$html = file_get_html('http://www.youtube.com/');
$title = $html -> find("title") -> innerHTML;
echo $title;

Author: Nabin Nepal (Starx)

Hello, I am Nabin Nepal and you can call me Starx. This is my blog where write about my life and my involvements. I am a Software Developer, A Cyclist and a Realist. I hope you will find my blog interesting. Follow me on Google+

...

Please fill the form - I will response as fast as I can!