You are here

README.txt in Search API attachments 8

Same filename and directory in other branches
  1. 7 README.txt
  2. 9.0.x README.txt
INTRODUCTION
------------
Search Api Attachments

This module will extract the content out of attached files using chosen method
among:
 - the Tika App JAR
 - the Tika Server JAR
 - the build in Solr extractor
 - the Pdftotext command line tool
 - the python Pdf2txt extractor
 - the docconv extractor
and index it.
Search API attachments will index many file formats.

REQUIREMENTS
------------
This module needs search_api module to be enabled on your site.
Depending on the extracting method you want to use, you may need java on your
server or python or ...

HOOKS
-----
This module provides hook_search_api_attachments_indexable.
See more details in search_api_attachments.api.php

MODULE INSTALLATION
-------------------
Copy search_api_attachments into your modules folder

Install the search_api_attachments module in your Drupal site

Go to the configuration: admin/config/search/search_api_attachments

Choose an extraction method and follow the instructions under the respective
heading below.

DEVELOPMENT
-----------
To generate a pareview.sh report, submit the form in https://bit.ly/2TmdFFz
To check the number of items in search_api_attachments queue: drush queue-list
Items are added to the queue table in the database
To run items in the queue : drush queue-run search_api_attachments

EXTRACTION CONFIGURATION: TIKA APP
----------------------------------
On Ubuntu 18.04

Install java
> sudo apt-get install openjdk-7-jdk

Download Apache Tika App JAR: http://tika.apache.org/download.html
> wget http://mir2.ovh.net/ftp.apache.org/dist/tika/tika-app-1.18.jar

Enter the full path on your server where you downloaded the jar
e.g. /var/apache-tika/tika-app-1.18.jar.

EXTRACTION CONFIGURATION: TIKA SERVER
-------------------------------------
On Ubuntu 18.04

Install java
> sudo apt-get install openjdk-7-jdk

Download Apache Tika Server JAR: http://tika.apache.org/download.html
> wget https://www-eu.apache.org/dist/tika/tika-server-1.20.jar
OR
> wget https://www-us.apache.org/dist/tika/tika-server-1.20.jar

Launch Tika server
> java -jar tika-server-1.20.jar

Configure search_api_attachments to use it at the following path:
/admin/config/search/search_api_attachments

More info:
- https://wiki.apache.org/tika/TikaJAXRS
- https://github.com/apache/tika/tree/master/tika-server

EXTRACTION CONFIGURATION: SOLR
------------------------------
Install and configure the search_api_solr module
https://www.drupal.org/project/search_api_solr
Make sure to configure it as explained in its README.txt
Create at least one solr server (/admin/config/search/search-api/add-server)
Now you can choose it from /admin/config/search/search_api_attachments


EXTRACTION CONFIGURATION: PDFTOTEXT
-----------------------------------
Pdftotext is a command line utility tool included by default on many linux
distributions. See the wikipedia page for more info:
https://en.wikipedia.org/wiki/Pdftotext

EXTRACTION CONFIGURATION: PYTHON PDF2TXT
----------------------------------------
On Debian 8

Install python or make sure you already have it
Get Pdf2txt (https://github.com/euske/pdfminer)
Install Pdf2txt as described in https://github.com/euske/pdfminer
or try
> sudo apt-get install python-pdfminer

EXTRACTION CONFIGURATION: GO DOCCONV
------------------------------------
Install golang or make sure you already have it
get docconv (https://github.com/sajari/docconv)
Install docconv as described in https://github.com/sajari/docconv


SIMPLE USAGE EXAMPLE 1: FILE FIELDS CONTENT: FILE ENTITIES
----------------------------------------------------------
0) This is tested with :
   drupal 8.8.x
   search_api 8.x-1.x
   search_api_attachments 8.x-1.x

1) Install drupal, search_api search_api_db and search_api_attachments.

2) Go to admin/structure/types/manage/article/fields/add-field and add a
   file field 'My pdfs' (field_my_pdfs).

3) Go to node/add/article and add an article node with a pdf.

4) Go to admin/config/search/search_api_attachments and configure the
   Tika extractor.

5) Go to admin/config/search/search-api/add-server and add server 'My server'
   (my_server) with the default Database Backend.

6) Go to admin/config/search/search-api/add-index and add a new index 'My index'
   (my_index) with 'Content' as Data source and 'My server' as Server.

7) Go to admin/config/search/search-api/index/my_index/processors and enable
   the File attachments processor.

8) Go to admin/config/search/search-api/index/my_index/fields/add/nojs and:
   - in the General section, add the "Search api attachments: My pdfs" field.
   - in the Content section, add the "Title".
   - in the Content section, add the "Body".

9) Go to /admin/config/search/search-api/index/my_index/fields to configure
   "Search api attachments: My pdfs" and "Title" to Fulltext.

10) Go to admin/structure/views/add and add a Page view:
    - View name: SAA
    - View settings:Show: Index My index
    - Page settings: Check Create a page with title and path 'saa' that
      displays "Rendered entity" format.
      ("Search results" format seems not working for now)

11) Add a filter to the view: the 'Fulltext search' with
    - Operator : Contains any of these words
    - Check the Expose checkbox

12) Go to admin/structure/views/view/saa and in the "Exposed Form" section (in
       the ADVANCED section), hit the 'Basic' link and choose 'Input required'
       so that the view doesn't display any default results.

13) Go to admin/config/search/search-api/index/my_index and Index items.

14) Go to /saa and search for any term in the title, body or in the pdf file :)



SIMPLE USAGE EXAMPLE 2: MEDIA FIELDS CONTENT : MEDIA ENTITIES OF TYPE FILE
--------------------------------------------------------------------------
0) This is tested with :
   drupal 8.8.x
   search_api 8.x-1.x
   search_api_attachments 8.x-1.x

1) Install drupal, media, search_api search_api_db and search_api_attachments.

2) Go to admin/structure/types/manage/article/fields/add-field and add a
   media field 'My medias' (field_my_medias).
   (choose File in the Media type settings)

3 ) Go to media/add/file and add a media with a pdf file

4) Go to node/add/article and add an article node that references the media
   entity created at step 3

5) Configure the extractor at admin/config/search/search_api_attachments and Go
   to admin/config/search/search-api/add-server and add server 'My server'
   (my_server) with the default Database Backend.

6) Go to admin/config/search/search-api/add-index and add a new index 'My index'
   (my_index) with 'Content' as Data source and 'My server' as Server.

7) Go to admin/config/search/search-api/index/my_index/processors and enable
   the File attachments processor.

8) Go to admin/config/search/search-api/index/my_index/fields/add/nojs and:
   - in the General section, add the "Search api attachments: My medias" field.
   - in the Content section, add the "Title".
   - in the Content section, add the "Body".

9) Go to /admin/config/search/search-api/index/my_index/fields to configure
   "Search api attachments: My medias" and "Title" to Fulltext.

10) Go to admin/structure/views/add and add a Page view:
    - View name: SAA
    - View settings:Show: Index My index
    - Page settings: Check Create a page with title and path 'saa' that
      displays "Rendered entity" format.
    ("Search results" format seems not working for now)

11) Add a filter to the view: the 'Fulltext search' with
    - Operator : Contains any of these words
    - Check the Expose checkbox

12) Go to admin/structure/views/view/saa and in the "Exposed Form" section (in
       the ADVANCED section), hit the 'Basic' link and choose 'Input required'
       so that the view doesn't display any default results.

13) Go to admin/config/search/search-api/index/my_index and Index items.

14) Go to /saa and search for any term in the title, body or in the pdf file :)

File

README.txt
View source
  1. INTRODUCTION
  2. ------------
  3. Search Api Attachments
  4. This module will extract the content out of attached files using chosen method
  5. among:
  6. - the Tika App JAR
  7. - the Tika Server JAR
  8. - the build in Solr extractor
  9. - the Pdftotext command line tool
  10. - the python Pdf2txt extractor
  11. - the docconv extractor
  12. and index it.
  13. Search API attachments will index many file formats.
  14. REQUIREMENTS
  15. ------------
  16. This module needs search_api module to be enabled on your site.
  17. Depending on the extracting method you want to use, you may need java on your
  18. server or python or ...
  19. HOOKS
  20. -----
  21. This module provides hook_search_api_attachments_indexable.
  22. See more details in search_api_attachments.api.php
  23. MODULE INSTALLATION
  24. -------------------
  25. Copy search_api_attachments into your modules folder
  26. Install the search_api_attachments module in your Drupal site
  27. Go to the configuration: admin/config/search/search_api_attachments
  28. Choose an extraction method and follow the instructions under the respective
  29. heading below.
  30. DEVELOPMENT
  31. -----------
  32. To generate a pareview.sh report, submit the form in https://bit.ly/2TmdFFz
  33. To check the number of items in search_api_attachments queue: drush queue-list
  34. Items are added to the queue table in the database
  35. To run items in the queue : drush queue-run search_api_attachments
  36. EXTRACTION CONFIGURATION: TIKA APP
  37. ----------------------------------
  38. On Ubuntu 18.04
  39. Install java
  40. > sudo apt-get install openjdk-7-jdk
  41. Download Apache Tika App JAR: http://tika.apache.org/download.html
  42. > wget http://mir2.ovh.net/ftp.apache.org/dist/tika/tika-app-1.18.jar
  43. Enter the full path on your server where you downloaded the jar
  44. e.g. /var/apache-tika/tika-app-1.18.jar.
  45. EXTRACTION CONFIGURATION: TIKA SERVER
  46. -------------------------------------
  47. On Ubuntu 18.04
  48. Install java
  49. > sudo apt-get install openjdk-7-jdk
  50. Download Apache Tika Server JAR: http://tika.apache.org/download.html
  51. > wget https://www-eu.apache.org/dist/tika/tika-server-1.20.jar
  52. OR
  53. > wget https://www-us.apache.org/dist/tika/tika-server-1.20.jar
  54. Launch Tika server
  55. > java -jar tika-server-1.20.jar
  56. Configure search_api_attachments to use it at the following path:
  57. /admin/config/search/search_api_attachments
  58. More info:
  59. - https://wiki.apache.org/tika/TikaJAXRS
  60. - https://github.com/apache/tika/tree/master/tika-server
  61. EXTRACTION CONFIGURATION: SOLR
  62. ------------------------------
  63. Install and configure the search_api_solr module
  64. https://www.drupal.org/project/search_api_solr
  65. Make sure to configure it as explained in its README.txt
  66. Create at least one solr server (/admin/config/search/search-api/add-server)
  67. Now you can choose it from /admin/config/search/search_api_attachments
  68. EXTRACTION CONFIGURATION: PDFTOTEXT
  69. -----------------------------------
  70. Pdftotext is a command line utility tool included by default on many linux
  71. distributions. See the wikipedia page for more info:
  72. https://en.wikipedia.org/wiki/Pdftotext
  73. EXTRACTION CONFIGURATION: PYTHON PDF2TXT
  74. ----------------------------------------
  75. On Debian 8
  76. Install python or make sure you already have it
  77. Get Pdf2txt (https://github.com/euske/pdfminer)
  78. Install Pdf2txt as described in https://github.com/euske/pdfminer
  79. or try
  80. > sudo apt-get install python-pdfminer
  81. EXTRACTION CONFIGURATION: GO DOCCONV
  82. ------------------------------------
  83. Install golang or make sure you already have it
  84. get docconv (https://github.com/sajari/docconv)
  85. Install docconv as described in https://github.com/sajari/docconv
  86. SIMPLE USAGE EXAMPLE 1: FILE FIELDS CONTENT: FILE ENTITIES
  87. ----------------------------------------------------------
  88. 0) This is tested with :
  89. drupal 8.8.x
  90. search_api 8.x-1.x
  91. search_api_attachments 8.x-1.x
  92. 1) Install drupal, search_api search_api_db and search_api_attachments.
  93. 2) Go to admin/structure/types/manage/article/fields/add-field and add a
  94. file field 'My pdfs' (field_my_pdfs).
  95. 3) Go to node/add/article and add an article node with a pdf.
  96. 4) Go to admin/config/search/search_api_attachments and configure the
  97. Tika extractor.
  98. 5) Go to admin/config/search/search-api/add-server and add server 'My server'
  99. (my_server) with the default Database Backend.
  100. 6) Go to admin/config/search/search-api/add-index and add a new index 'My index'
  101. (my_index) with 'Content' as Data source and 'My server' as Server.
  102. 7) Go to admin/config/search/search-api/index/my_index/processors and enable
  103. the File attachments processor.
  104. 8) Go to admin/config/search/search-api/index/my_index/fields/add/nojs and:
  105. - in the General section, add the "Search api attachments: My pdfs" field.
  106. - in the Content section, add the "Title".
  107. - in the Content section, add the "Body".
  108. 9) Go to /admin/config/search/search-api/index/my_index/fields to configure
  109. "Search api attachments: My pdfs" and "Title" to Fulltext.
  110. 10) Go to admin/structure/views/add and add a Page view:
  111. - View name: SAA
  112. - View settings:Show: Index My index
  113. - Page settings: Check Create a page with title and path 'saa' that
  114. displays "Rendered entity" format.
  115. ("Search results" format seems not working for now)
  116. 11) Add a filter to the view: the 'Fulltext search' with
  117. - Operator : Contains any of these words
  118. - Check the Expose checkbox
  119. 12) Go to admin/structure/views/view/saa and in the "Exposed Form" section (in
  120. the ADVANCED section), hit the 'Basic' link and choose 'Input required'
  121. so that the view doesn't display any default results.
  122. 13) Go to admin/config/search/search-api/index/my_index and Index items.
  123. 14) Go to /saa and search for any term in the title, body or in the pdf file :)
  124. SIMPLE USAGE EXAMPLE 2: MEDIA FIELDS CONTENT : MEDIA ENTITIES OF TYPE FILE
  125. --------------------------------------------------------------------------
  126. 0) This is tested with :
  127. drupal 8.8.x
  128. search_api 8.x-1.x
  129. search_api_attachments 8.x-1.x
  130. 1) Install drupal, media, search_api search_api_db and search_api_attachments.
  131. 2) Go to admin/structure/types/manage/article/fields/add-field and add a
  132. media field 'My medias' (field_my_medias).
  133. (choose File in the Media type settings)
  134. 3 ) Go to media/add/file and add a media with a pdf file
  135. 4) Go to node/add/article and add an article node that references the media
  136. entity created at step 3
  137. 5) Configure the extractor at admin/config/search/search_api_attachments and Go
  138. to admin/config/search/search-api/add-server and add server 'My server'
  139. (my_server) with the default Database Backend.
  140. 6) Go to admin/config/search/search-api/add-index and add a new index 'My index'
  141. (my_index) with 'Content' as Data source and 'My server' as Server.
  142. 7) Go to admin/config/search/search-api/index/my_index/processors and enable
  143. the File attachments processor.
  144. 8) Go to admin/config/search/search-api/index/my_index/fields/add/nojs and:
  145. - in the General section, add the "Search api attachments: My medias" field.
  146. - in the Content section, add the "Title".
  147. - in the Content section, add the "Body".
  148. 9) Go to /admin/config/search/search-api/index/my_index/fields to configure
  149. "Search api attachments: My medias" and "Title" to Fulltext.
  150. 10) Go to admin/structure/views/add and add a Page view:
  151. - View name: SAA
  152. - View settings:Show: Index My index
  153. - Page settings: Check Create a page with title and path 'saa' that
  154. displays "Rendered entity" format.
  155. ("Search results" format seems not working for now)
  156. 11) Add a filter to the view: the 'Fulltext search' with
  157. - Operator : Contains any of these words
  158. - Check the Expose checkbox
  159. 12) Go to admin/structure/views/view/saa and in the "Exposed Form" section (in
  160. the ADVANCED section), hit the 'Basic' link and choose 'Input required'
  161. so that the view doesn't display any default results.
  162. 13) Go to admin/config/search/search-api/index/my_index and Index items.
  163. 14) Go to /saa and search for any term in the title, body or in the pdf file :)