You are here

README.txt in Search API attachments 7

INTRODUCTION
------------

Search API Attachments

This module is an add-on to the Search API which allows the indexing and
searching of attachments.

The extraction can be done using :
Apache Tika Library
or
The Solr built-in extractor
or
Acquia Search
or
pdftotext for pdfs
or
python pdf2text for pdfs

REQUIREMENTS
------------

Requires the ability to run java on your server and an installation of the
Apache Tika library if you don't want to use the Solr build in extractor.

PHP-iconv to index txt files.

MODULE INSTALLATION
-------------------
Copy search_api_attachments into your modules folder

Install the search_api_attachments module in your Drupal site

Go to the configuration: admin/config/search/search_api/attachments

Choose an extraction method and follow the instructions under the respective
heading below.


EXTRACTION CONFIGURATION (Tika)
-------------------------------

Install java

Download Apache Tika library: http://tika.apache.org/download.html
Downloaded file is something like tika-app-y.x.jar

in admin/config/search/search_api/attachments, Enter the parent directory
and the file name of the .jar file.

- Hidden settings

search_api_attachments_java:
  By changing this variable, you can set the path to your java executable. The
  default is 'java'.

EXTRACTION CONFIGURATION (Solr)
-------------------------------

This requires Solr search (search_api_solr) module and the Solr config files
that come with it.

Please follow the Solr search module instructions for configuring asearch api
solr server.

EXTRACTION CONFIGURATION (Pdftotext)
------------------------------------

Pdftotext is a command line utility tool included by default on many linux
distributions. See the wikipedia page for more info:
https://en.wikipedia.org/wiki/Pdftotext

EXTRACTION CONFIGURATION (python Pdf2txt)
-----------------------------------------

Install Pdf2txt (tested with package version 20110515+dfsg-1 and python 2.7.9)
> sudo apt-get install python-pdfminer

SUBMODULES
-------------------------------
For each of these, find more details in contrib folder.

search_api_attachments_comment
search_api_attachments_commerce_product_reference
search_api_attachments_entityreference
search_api_attachments_field_collections
search_api_attachments_links
search_api_attachments_multifield
search_api_attachments_multiple_entities
search_api_attachments_paragraphs
search_api_attachments_references
search_api_attachments_user_content

CACHING
-------
Extracting files content can take a long time and it may not be needed to do it
again each time a node gets reindexed.
search_api_attachments have a cache bin where we store all the extracted files
contents: this is the cache_search_api_attachments table.
cache keys are in the form of: 'cached_extraction_[fid]' where [fid] is the file
id.
When a file is deleted or updated, we drop its extracted stored cache.
When the sidewide cache is deleted (drush cc all per example) we drop all the
stored extracted files cache only if 'Preserve cached extractions across cache
 clears.' option is unchecked in the configuration form of the module.

DEVELOPMENT
-----------
On the admin form of Search API attachements, you can enable the debug feature.
It will add a lot of information in the watchdog while indexing.

Hidden Features
---------------
This module suggests a Views filter to choose to search in attachments files too
or not.

HOOKS
-----
This module provides hook_search_api_attachments_indexable.
See more details in search_api_attachments.api.php

File

README.txt
View source
  1. INTRODUCTION
  2. ------------
  3. Search API Attachments
  4. This module is an add-on to the Search API which allows the indexing and
  5. searching of attachments.
  6. The extraction can be done using :
  7. Apache Tika Library
  8. or
  9. The Solr built-in extractor
  10. or
  11. Acquia Search
  12. or
  13. pdftotext for pdfs
  14. or
  15. python pdf2text for pdfs
  16. REQUIREMENTS
  17. ------------
  18. Requires the ability to run java on your server and an installation of the
  19. Apache Tika library if you don't want to use the Solr build in extractor.
  20. PHP-iconv to index txt files.
  21. MODULE INSTALLATION
  22. -------------------
  23. Copy search_api_attachments into your modules folder
  24. Install the search_api_attachments module in your Drupal site
  25. Go to the configuration: admin/config/search/search_api/attachments
  26. Choose an extraction method and follow the instructions under the respective
  27. heading below.
  28. EXTRACTION CONFIGURATION (Tika)
  29. -------------------------------
  30. Install java
  31. Download Apache Tika library: http://tika.apache.org/download.html
  32. Downloaded file is something like tika-app-y.x.jar
  33. in admin/config/search/search_api/attachments, Enter the parent directory
  34. and the file name of the .jar file.
  35. - Hidden settings
  36. search_api_attachments_java:
  37. By changing this variable, you can set the path to your java executable. The
  38. default is 'java'.
  39. EXTRACTION CONFIGURATION (Solr)
  40. -------------------------------
  41. This requires Solr search (search_api_solr) module and the Solr config files
  42. that come with it.
  43. Please follow the Solr search module instructions for configuring asearch api
  44. solr server.
  45. EXTRACTION CONFIGURATION (Pdftotext)
  46. ------------------------------------
  47. Pdftotext is a command line utility tool included by default on many linux
  48. distributions. See the wikipedia page for more info:
  49. https://en.wikipedia.org/wiki/Pdftotext
  50. EXTRACTION CONFIGURATION (python Pdf2txt)
  51. -----------------------------------------
  52. Install Pdf2txt (tested with package version 20110515+dfsg-1 and python 2.7.9)
  53. > sudo apt-get install python-pdfminer
  54. SUBMODULES
  55. -------------------------------
  56. For each of these, find more details in contrib folder.
  57. search_api_attachments_comment
  58. search_api_attachments_commerce_product_reference
  59. search_api_attachments_entityreference
  60. search_api_attachments_field_collections
  61. search_api_attachments_links
  62. search_api_attachments_multifield
  63. search_api_attachments_multiple_entities
  64. search_api_attachments_paragraphs
  65. search_api_attachments_references
  66. search_api_attachments_user_content
  67. CACHING
  68. -------
  69. Extracting files content can take a long time and it may not be needed to do it
  70. again each time a node gets reindexed.
  71. search_api_attachments have a cache bin where we store all the extracted files
  72. contents: this is the cache_search_api_attachments table.
  73. cache keys are in the form of: 'cached_extraction_[fid]' where [fid] is the file
  74. id.
  75. When a file is deleted or updated, we drop its extracted stored cache.
  76. When the sidewide cache is deleted (drush cc all per example) we drop all the
  77. stored extracted files cache only if 'Preserve cached extractions across cache
  78. clears.' option is unchecked in the configuration form of the module.
  79. DEVELOPMENT
  80. -----------
  81. On the admin form of Search API attachements, you can enable the debug feature.
  82. It will add a lot of information in the watchdog while indexing.
  83. Hidden Features
  84. ---------------
  85. This module suggests a Views filter to choose to search in attachments files too
  86. or not.
  87. HOOKS
  88. -----
  89. This module provides hook_search_api_attachments_indexable.
  90. See more details in search_api_attachments.api.php