You are here

README.txt in Apache Solr Attachments 7

Same filename and directory in other branches
  1. 6.3 README.txt
  2. 6 README.txt
  3. 6.2 README.txt
Apache Solr Attachments for 7.x

Requires the ability to run java and installation of tika 0.3 or higher,
or access to a solr server set up for content extraction (e.g. a Solr
1.4 final release).  For Solr, there is a patch to apply to the solrconfig
to add another request handler.

When Apache Solr Attachments was added afterwards, then re-indexing is
necessary. To diagnose functionality, use the solr search at
<your-domain>:<your-port>/solr/admin/ for something like
url:*.pdf or url:*.doc. The results will show if any files were indexed.

see:
http://lucene.apache.org/tika/gettingstarted.html
http://lucene.apache.org/tika/formats.html

Tika will extract many file formats, including PDFs, MS Office (2003 format
as well as new docx format).  Java 6 or 7 may be needed on some
platforms to support all formats.  The page on formats seems not to be 100%
up to date.  In particular, https://issues.apache.org/jira/browse/TIKA-152
is committed, so it does currently support MS Office 2007 documents to
some reasonable degree.

The easiest-to-find pre-built Tika app is available from the download page:
http://tika.apache.org/download.html

e.g. tika-app-1.1.jar

You can copy/move the jar to somewhere convenient, though it's probably a good idea
to keep it outside your docroot.

Solr 1.4.1 uses tika 0.4, 1.4.2-dev uses tika 0.7, and Solr 3.5.0 uses 0.10,
and newer releases of Sorl are likey to use 1.1+.  Note that the version
numbering changed, so 0.1 was followed by 1.0 and 1.1.

If you need to build tika from source using maven (mvn).  Get the tika
source from:
http://lucene.apache.org/tika/download.html

You may need to increase the memory for java/mvn using (for example):
export MAVEN_OPTS="-Xmx1024m -Xms512m"

mvn install

will build the full set of tika applications - it will build the app jar
in a location like tika-app/target/tika-app-1.1.jar

Copy tika-app-1.1.jar from there or point the module path to it.

See also build instructions at: http://drupal.org/node/540974#comment-1944082

If you are using Solr to extract your content, you need to copy (or symlink)
the contents of contrib/extraction/lib to a directory named lib under your
solr home, or alter solrconfig.xml to add the orgiginal directory as a
lib directory.

File

README.txt
View source
  1. Apache Solr Attachments for 7.x
  2. Requires the ability to run java and installation of tika 0.3 or higher,
  3. or access to a solr server set up for content extraction (e.g. a Solr
  4. 1.4 final release). For Solr, there is a patch to apply to the solrconfig
  5. to add another request handler.
  6. When Apache Solr Attachments was added afterwards, then re-indexing is
  7. necessary. To diagnose functionality, use the solr search at
  8. :/solr/admin/ for something like
  9. url:*.pdf or url:*.doc. The results will show if any files were indexed.
  10. see:
  11. http://lucene.apache.org/tika/gettingstarted.html
  12. http://lucene.apache.org/tika/formats.html
  13. Tika will extract many file formats, including PDFs, MS Office (2003 format
  14. as well as new docx format). Java 6 or 7 may be needed on some
  15. platforms to support all formats. The page on formats seems not to be 100%
  16. up to date. In particular, https://issues.apache.org/jira/browse/TIKA-152
  17. is committed, so it does currently support MS Office 2007 documents to
  18. some reasonable degree.
  19. The easiest-to-find pre-built Tika app is available from the download page:
  20. http://tika.apache.org/download.html
  21. e.g. tika-app-1.1.jar
  22. You can copy/move the jar to somewhere convenient, though it's probably a good idea
  23. to keep it outside your docroot.
  24. Solr 1.4.1 uses tika 0.4, 1.4.2-dev uses tika 0.7, and Solr 3.5.0 uses 0.10,
  25. and newer releases of Sorl are likey to use 1.1+. Note that the version
  26. numbering changed, so 0.1 was followed by 1.0 and 1.1.
  27. If you need to build tika from source using maven (mvn). Get the tika
  28. source from:
  29. http://lucene.apache.org/tika/download.html
  30. You may need to increase the memory for java/mvn using (for example):
  31. export MAVEN_OPTS="-Xmx1024m -Xms512m"
  32. mvn install
  33. will build the full set of tika applications - it will build the app jar
  34. in a location like tika-app/target/tika-app-1.1.jar
  35. Copy tika-app-1.1.jar from there or point the module path to it.
  36. See also build instructions at: http://drupal.org/node/540974#comment-1944082
  37. If you are using Solr to extract your content, you need to copy (or symlink)
  38. the contents of contrib/extraction/lib to a directory named lib under your
  39. solr home, or alter solrconfig.xml to add the orgiginal directory as a
  40. lib directory.