You are here

function feeds_imagegrabber_webpage_scraper in Feeds Image Grabber 7

Same name and namespace in other branches
  1. 6 feeds_imagegrabber.module \feeds_imagegrabber_webpage_scraper()

Scrape the webpage using the id or the css class of a tag and returns the HTML between the tag.

Parameters

$page_url: A string specifying the page url to scrape. If there is a redirect, it is changed to the redirect_url.

$itype: A positive integer value representing the identifier type for the tag:

  • 0 : selects content between <body> </body>.
  • 1 : selects content between the tag identified by an ID.
  • 2 : selects content between the first tag identified by a CSS class.

$ivalue: A string specifying the ID or the CSS class.

$timeout: A float representing the maximum number of seconds the function call may take. The default is 15 seconds. If a timeout occurs, the retuen code is set to the HTTP_REQUEST_TIMEOUT constant.

$max_redirects: An integer representing how many times a redirect may be followed. Defaults to 3.

$error_log: An array which contains the error codes and messages in case the functions fails.

Return value

FALSE on failure, OR content between the tags as XML on success.

1 call to feeds_imagegrabber_webpage_scraper()
feeds_imagegrabber_feeds_set_target in ./feeds_imagegrabber.module
Callback for mapping. Here is where the actual mapping happens.

File

./feeds_imagegrabber.module, line 427
Grabs images for items imported using the feeds module.

Code

function feeds_imagegrabber_webpage_scraper(&$page_url, $itype, $ivalue = '', $timeout = 15, $max_redirects = 3, &$error_log = array()) {
  $options = array(
    'headers' => array(),
    'method' => 'GET',
    'data' => NULL,
    'max_redirects' => $max_redirects,
    'timeout' => $timeout,
  );
  $result = drupal_http_request($page_url, $options);
  if (isset($result->redirect_code) && in_array($result->redirect_code, array(
    301,
    302,
    307,
  ))) {
    $page_url = $result->redirect_url;
  }
  if ($result->code != 200) {
    $error_log['code'] = $result->code;
    $error_log['error'] = "unable to retrieve content from web page";
    return FALSE;
  }
  if (empty($result->data) || drupal_strlen($result->data) <= 0) {
    $error_log['code'] = -1;
    $error_log['error'] = "no data available on url";
    return FALSE;
  }
  $doc = new DOMDocument();
  if (@$doc
    ->loadHTML($result->data) === FALSE) {
    $error_log['code'] = -2;
    $error_log['error'] = "unable to parse the html content";
    return FALSE;
  }
  if ($itype == 0) {
    $items = @$doc
      ->getElementsByTagName("body");
    if ($items != NULL && $items->length > 0) {
      $dist = $items
        ->item(0);
    }
    else {
      $dist = NULL;
    }
  }
  elseif ($itype == 1) {
    $dist = @$doc
      ->getElementById($ivalue);
  }
  elseif ($itype == 2) {
    $xpath = new DOMXPath($doc);

    //Normalize whitespaces.
    $ivalue = preg_replace('/\\s\\s+/', ' ', trim($ivalue));
    $items = $xpath
      ->query("//*[@class and contains(concat(' ',normalize-space(@class),' '), ' {$ivalue} ')]");
    if ($items != NULL && $items->length > 0) {
      $dist = $items
        ->item(0);
    }
    else {
      $dist = NULL;
    }
  }
  else {

    //not supported yet
    $dist = NULL;
  }
  if ($dist == NULL) {
    $error_log['code'] = -3;
    $error_log['error'] = "tag not found";
    return FALSE;
  }
  $content = '';
  if (($content = @$dist->ownerDocument
    ->saveXML($dist)) === FALSE) {
    $error_log['code'] = -4;
    $error_log['error'] = "error converting content to XML";
    return FALSE;
  }
  return $content;
}