You are here

protected function SearchApiAttachmentsAlterSettings::extractSolr in Search API attachments 7

Extract data using Solr.

This is done via the ExtractingRequestHandler or using the remote Tika servlet.

Parameters

object $file: The file.

Return value

string The file content.

Throws

SearchApiException

See also

http://wiki.apache.org/solr/ExtractingRequestHandler

http://wiki.apache.org/tika/TikaJAXRS

1 call to SearchApiAttachmentsAlterSettings::extractSolr()
SearchApiAttachmentsAlterSettings::getFileContent in includes/callback_attachments_settings.inc
Extracts th file content.

File

includes/callback_attachments_settings.inc, line 531
Search API data alteration callback.

Class

SearchApiAttachmentsAlterSettings
Indexes files content.

Code

protected function extractSolr($file) {
  $extraction = FALSE;
  $filepath = $this
    ->getRealpath($file);
  try {
    $filename = basename($filepath);

    // Server name is stored in the index.
    $server_name = $this->index->server;
    $server = search_api_server_load($server_name, TRUE);

    // Make sure this is a solr server.
    $class_info = search_api_get_service_info($server->class);
    $classes = class_parents($class_info['class']);
    $classes[$class_info['class']] = $class_info['class'];
    if (!in_array('SearchApiSolrService', $classes)) {
      throw new SearchApiException(t('Server %server is not a Solr server, unable to extract file.', array(
        '%server' => $server_name,
      )));
    }

    // Open a connection to the server.
    $solr_connection = $server
      ->getSolrConnection();

    // Path for our servlet request.
    $servlet_path = variable_get('search_api_attachments_extracting_servlet_path', 'update/extract');

    // Parameters for the extraction request.
    $params = array(
      'extractOnly' => 'true',
      'resource.name' => $filename,
      // Matches the -t command for the tika CLI app.
      'extractFormat' => 'text',
      'wt' => 'json',
      'hl' => 'on',
    );

    // Heavily inspired by apachesolr_file.
    // @see apachesolr_file_extract().
    // Construct a multi-part form-data POST body in $data.
    $boundary = '--' . md5(uniqid(REQUEST_TIME));
    $data = "--{$boundary}\r\n";

    // The 'filename' used here becomes the property name in the response.
    $data .= 'Content-Disposition: form-data; name="file"; filename="extracted"';
    $data .= "\r\nContent-Type: application/octet-stream\r\n\r\n";
    $data .= file_get_contents($filepath);
    $data .= "\r\n--{$boundary}--\r\n";
    $headers = array(
      'Content-Type' => 'multipart/form-data; boundary=' . $boundary,
    );
    $options = array(
      'method' => 'POST',
      'headers' => $headers,
      'data' => $data,
    );

    // Make a servlet request using the solr connection.
    $response = $solr_connection
      ->makeServletRequest($servlet_path, $params, $options);

    // If we have an extracted response, all is well.
    if (isset($response->extracted)) {
      $extraction = $response->extracted;
    }
  } catch (Exception $e) {

    // Log the exception to watchdog. Exceptions from Solr may be transient,
    // or indicate a problem with a specific file.
    watchdog('search_api_attachments', 'Exception occurred sending %filepath to Solr.', array(
      '%filepath' => $file['uri'],
    ));
    watchdog_exception('search_api_attachments', $e);
  }
  return $extraction;
}