function feeds_imagegrabber_webpage_scraper in Feeds Image Grabber 6
Same name and namespace in other branches
- 7 feeds_imagegrabber.module \feeds_imagegrabber_webpage_scraper()
Scrape the webpage using the id or the css class of a tag and returns the HTML between the tag.
Parameters
$page_url: A string specifying the page url to scrape. If there is a redirect, it is changed to the redirect_url.
$itype: A positive integer value representing the identifier type for the tag:
- 0 : selects content between <body> </body>.
- 1 : selects content between the tag identified by an ID.
- 2 : selects content between the first tag identified by a CSS class.
$ivalue: A string specifying the ID or the CSS class.
$timeout: A float representing the maximum number of seconds the function call may take. The default is 15 seconds. If a timeout occurs, the retuen code is set to the FIG_HTTP_REQUEST_TIMEOUT constant.
$max_redirects: An integer representing how many times a redirect may be followed. Defaults to 3.
$error_log: An array which contains the error codes and messages in case the functions fails.
Return value
FALSE on failure, OR content between the tags as XML on success
1 call to feeds_imagegrabber_webpage_scraper()
- feeds_imagegrabber_feeds_set_target in ./
feeds_imagegrabber.module - Implementation of hook_feeds_set_target().
File
- ./
feeds_imagegrabber.module, line 420 - Grabs image for each feed-item from their respective web pages and stores it in an image field. Requires Feeds module.
Code
function feeds_imagegrabber_webpage_scraper(&$page_url, $itype, $ivalue = '', $timeout = 15, $max_redirects = 3, &$error_log = array()) {
$options = array(
'headers' => array(),
'method' => 'GET',
'data' => NULL,
'max_redirects' => $max_redirects,
'timeout' => $timeout,
);
$result = feeds_imagegrabber_http_request($page_url, $options);
if (in_array($result->redirect_code, array(
301,
302,
307,
))) {
$page_url = $result->redirect_url;
}
if ($result->code != 200) {
$error_log['code'] = $result->code;
$error_log['error'] = "unable to retrieve content from web page";
return FALSE;
}
if (empty($result->data) || drupal_strlen($result->data) <= 0) {
$error_log['code'] = -1;
$error_log['error'] = "no data available on url";
return FALSE;
}
$doc = new DOMDocument();
if (@$doc
->loadHTML($result->data) === FALSE) {
$error_log['code'] = -2;
$error_log['error'] = "unable to parse the html content";
return FALSE;
}
if ($itype == 0) {
$items = @$doc
->getElementsByTagName("body");
if ($items != NULL && $items->length > 0) {
$dist = $items
->item(0);
}
else {
$dist = NULL;
}
}
elseif ($itype == 1) {
$dist = @$doc
->getElementById($ivalue);
}
elseif ($itype == 2) {
$xpath = new DOMXPath($doc);
//Normalize whitespaces.
$ivalue = preg_replace('/\\s\\s+/', ' ', trim($ivalue));
$items = $xpath
->query("//*[@class and contains(concat(' ',normalize-space(@class),' '), ' {$ivalue} ')]");
if ($items != NULL && $items->length > 0) {
$dist = $items
->item(0);
}
else {
$dist = NULL;
}
}
else {
//not supported yet
$dist = NULL;
}
if ($dist == NULL) {
$error_log['code'] = -3;
$error_log['error'] = "tag not found";
return FALSE;
}
$content = '';
if (($content = @$dist->ownerDocument
->saveXML($dist)) === FALSE) {
$error_log['code'] = -4;
$error_log['error'] = "error converting content to XML";
return FALSE;
}
return $content;
}