Php Folks,
How to weed-out empty array values ?
The above example from the following link did not work.
https://stackoverflow.com/questions/3654295/remove-empty-array-elements
My code so far. Building a web crawler. It crawls your page and notes the keywords & links and counts them. Not fully finished.
Look at the attached image and you will notice blank values on the column "keywords". That is due to array values being empty.
Therefore, need to weed-out the empty values from the array values before dumping the array values onto mysql tbl.
And, I get this error:
Notice: Undefined index: links_count in C:\xampp\htdocs\test\crawler.php on line 71
How to rid this error ? Wanting to echo each array values in the foreach loop.
Line 71:
And, I don't know why the "url_indexing_date" column showing zero values. I got another tbl that shows the dates in such a column.
I will need to find a regex to weed-out the html tags so they don't get dumped into the "keywords" column in the tbl but only the keywords extracted from the webpage content that the visitor sees.
How to weed-out empty array values ?
PHP:
print_r(array_filter($keywords_array, 'strlen'));
https://stackoverflow.com/questions/3654295/remove-empty-array-elements
My code so far. Building a web crawler. It crawls your page and notes the keywords & links and counts them. Not fully finished.
Look at the attached image and you will notice blank values on the column "keywords". That is due to array values being empty.
Therefore, need to weed-out the empty values from the array values before dumping the array values onto mysql tbl.
PHP:
<?php
//Required PHP Files.
include 'config.php';
include 'header.php';
//1). Set Banned Words.
$banned_words = array("asshole", "nut", "bullshit");
$url = 'https://www.york.ac.uk/teaching/cws/wws/webpage1.html';
// 2). $curl is going to be data type curl resource.
$curl = curl_init();
// 3). Set cURL options.
curl_setopt($curl, CURLOPT_URL, "$url");
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
// 4). Run cURL (execute http request).
$result = curl_exec($curl);
if (curl_errno($curl))
{
echo 'Error:' . curl_error($curl);
}
$response = curl_getinfo( $curl );
//If page is fetched then replace banned words found on page.
if($response['http_code'] == '200' )
{
$regex = '/\b';
$regex .= implode('\b|\b', $banned_words);
$regex .= '\b/i';
$substitute = 'BANNED WORD REPLACED';
$clean_result = preg_replace($regex, $substitute, $result);
//Present the banned words filtered webpage.
echo $clean_result;
}
else
{
//Show error if page fetching fails.
echo "Page fetching problem!";
echo "$response[http_code]";
exit();
}
curl_close($curl);
//Define Variables
$keywords_number = "0";
$keywords_count = "0";
$links_count = "0";
$keywords_links_count = "0";
$images_count = "0";
$keywords_images_count = "0";
$keywords_internal_links_count = "0";
$keywords_external_links_count = "0";
//Link Exractor starts here. It will extract all links present on the page.
function linkExtractor($clean_result)
{
$linkArray = array();
if(preg_match_all('/<a\s+.*?href=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/a>/i', $clean_result, $link_matches, PREG_SET_ORDER))
{
foreach ($link_matches as $link_match)
{
GLOBAL $url,$links_count,$keywords_links_count,$images_count,$keywords_images_count,$keywords_internal_links_count,$keywords_external_links_count;
echo "url: $url<br>";
echo "link_match: $link_match[links_count]<br>";
$links_count++;
echo "links_count: $links_count++<br>";
$keywords_links_count++;
echo "keywords_links_count: $keywords_links_count++<br>";
$images_count++;
echo "images_count: $images_count++<br>";
$keywords_images_count++;
echo "keywords_images_count: $keywords_images_count++<br>";
$keywords_internal_links_count++;
echo "keywords_internal_links_count: $keywords_internal_links_count++<br>";
$keywords_external_links_count++;
echo "keywords_external_links_count: $keywords_external_links_count++<br>";
}
}
return $linkArray;
}
echo '<pre>' . print_r(linkExtractor($clean_result), true) . '<pre>';
//Content Filter starts here to check for banned words present on the page.
$keywords_array = explode(" ", $clean_result);
$keywords_count = "0";
foreach($keywords_array as $keyword)
{
echo $keyword."\n";
echo "keyword: $keyword<br>";
$keywords_count++;
echo "Keywords_count: $keywords_count++<br>";
print_r(array_filter($keywords_array, 'strlen'));
}
foreach($keywords_array as $keyword)
{
$keywords_number++;
//Insert the user's inputs into Mysql database using php's sql injection prevention method "Prepared Statements".
$stmt = mysqli_prepare($conn, "INSERT INTO searchengine_index(url,keywords,keywords_number,keywords_count,links,links_count,keywords_links_count,images_count,keywords_images_count,keywords_internal_links_count,keywords_external_links_count) VALUES (?,?,?,?,?,?,?,?,?,?,?)");
GLOBAL $url,$keywords_number,$links_count,$keywords_links_count,$images_count,$keywords_images_count,$keywords_internal_links_count,$keywords_external_links_count;
mysqli_stmt_bind_param($stmt, 'ssisiiiiiii', $url,$keyword,$keywords_number,$keywords_count,$link_match[$keywords_links_count],$links_count,$keywords_links_count,$images_count,$keywords_images_count,$keywords_internal_links_count,$keywords_external_links_count);
mysqli_stmt_execute($stmt);
//Check if data was successfully submitted or not.
if(!$stmt)
{
echo "Sorry! Our system is currently experiencing a problem indexing your website. We will try some other time!";
exit();
}
}
?>
Notice: Undefined index: links_count in C:\xampp\htdocs\test\crawler.php on line 71
How to rid this error ? Wanting to echo each array values in the foreach loop.
Line 71:
PHP:
echo "link_match: $link_match[links_count]<br>";
I will need to find a regex to weed-out the html tags so they don't get dumped into the "keywords" column in the tbl but only the keywords extracted from the webpage content that the visitor sees.