Php Buddies,
What I am trying to do is learn to build a simple web crawler.
So at first, I will feed it a url to start with.
It will then fetch that page and extract all the links into a single array.
Then it will fetch each of those links pages and extract all their links into a single array likewise. It will do this until it reaches it's max link deep level.
Here is how I coded it:
I have a feeling I got confused and messed it up in the foreach loops. Nestled too much. Is that the case ? Hint where I went wrong.
Unable to test the script as I have to first sort out this error:
Fatal error: Uncaught Error: Call to a member function find() on string in C:\xampp\h
After that, I will be able to test it. Anyway, just looking at the script, you think I got it right or what ?
Thanks
I just replaced:
with:
That is all!
That should not result in that error!
UPDATE:
I have been given this sample code just now ...
Gonna experiment with it.
Just sharing it here for other future newbies!
I am told:
"file_get_html is a special function from simple_html_dom library. If you open source code for simple_html_dom you will see that file_get_html() does a lot of things that your curl replacement does not. That's why you get your error."
Anyway, folks, I really don't wanna be using this limited capacity file_get_html() and so let's replace it with cURL. I tried my best in giving a shot at cURL here. What-about you ? Care to show how to fix this thingY ?
I did a search on the php manual for str_get_html to be sure what the function does. But, I am shown no results.
And so, I ask: Just what does it do ?
Php Buddies,
Look at these 2 updates. They both succeed in fetching the php manual page but fail to fetch the yahoo homepage. Why is that ?
The 2nd script is like the 1st one except a small change. Look at the commented-out parts in script 2 to see the difference. The added code comes after the commented-out code part.
SCRIPT 1
SCRIPT 2
Don't forget my previous post!
Cheers!
What I am trying to do is learn to build a simple web crawler.
So at first, I will feed it a url to start with.
It will then fetch that page and extract all the links into a single array.
Then it will fetch each of those links pages and extract all their links into a single array likewise. It will do this until it reaches it's max link deep level.
Here is how I coded it:
PHP:
<?php
include('simple_html_dom.php');
$current_link_crawling_level = 0;
$link_crawling_level_max = 2
if($current_link_crawling_level == $link_crawling_level_max)
{
exit();
}
else
{
$url = 'https://www.yahoo.com';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$html = curl_exec($curl);
$current_link_crawling_level++;
//to fetch all hyperlinks from the webpage
$links = array();
foreach($html->find('a') as $a)
{
$links[] = $a->href;
echo "Value: $value<br />\n";
print_r($links);
$url = '$value';
$curl = curl_init($value);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$html = curl_exec($curl);
//to fetch all hyperlinks from the webpage
$links = array();
foreach($html->find('a') as $a)
{
$links[] = $a->href;
echo "Value: $value<br />\n";
print_r($links);
$current_link_crawling_level++;
}
echo "Value: $value<br />\n";
print_r($links);
}
?>
Unable to test the script as I have to first sort out this error:
Fatal error: Uncaught Error: Call to a member function find() on string in C:\xampp\h
After that, I will be able to test it. Anyway, just looking at the script, you think I got it right or what ?
Thanks
I just replaced:
PHP:
//$html = file_get_html('http://example.com');
PHP:
$url = 'https://www.yahoo.com';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$html = curl_exec($curl);
That should not result in that error!
UPDATE:
I have been given this sample code just now ...
PHP:
Possible solution with str_get_html:
$url = 'https://www.yahoo.com';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$response_string = curl_exec($curl);
$html = str_get_html($response_string);
//to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
$links[] = $a->href;
}
print_r($links);
echo "<br />";
Just sharing it here for other future newbies!
I am told:
"file_get_html is a special function from simple_html_dom library. If you open source code for simple_html_dom you will see that file_get_html() does a lot of things that your curl replacement does not. That's why you get your error."
Anyway, folks, I really don't wanna be using this limited capacity file_get_html() and so let's replace it with cURL. I tried my best in giving a shot at cURL here. What-about you ? Care to show how to fix this thingY ?
I did a search on the php manual for str_get_html to be sure what the function does. But, I am shown no results.
And so, I ask: Just what does it do ?
Php Buddies,
Look at these 2 updates. They both succeed in fetching the php manual page but fail to fetch the yahoo homepage. Why is that ?
The 2nd script is like the 1st one except a small change. Look at the commented-out parts in script 2 to see the difference. The added code comes after the commented-out code part.
SCRIPT 1
PHP:
<?php
//HALF WORKING
include('simple_html_dom.php');
$url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref'; // WORKS ON URL
//$url = 'https://yahoo.com'; // FAILS ON URL
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$response_string = curl_exec($curl);
$html = str_get_html($response_string);
//to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
$links[] = $a->href;
}
print_r($links);
echo "<br />";
?>
PHP:
<?php
//HALF WORKING
include('simple_html_dom.php');
$url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref'; // WORKS ON URL
//$url = 'https://yahoo.com'; // FAILS ON URL
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$response_string = curl_exec($curl);
$html = str_get_html($response_string);
/*
//to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
$links[] = $a->href;
}
print_r($links);
echo "<br />";
*/
// Hide HTML warnings
libxml_use_internal_errors(true);
$dom = new DOMDocument;
if($dom->loadHTML($html, LIBXML_NOWARNING)){
// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
echo $href,"\t",$anchor,"\n";
}
echo '</pre>';
}else{
echo "Failed to load html.";
}
?>
Cheers!