?

Log in

No account? Create an account
Apache Tika powered updates to Alfresco (Part 2) - Nick [entries|archive|friends|userinfo]
Nick

[ website | gagravarr.org ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Apache Tika powered updates to Alfresco (Part 2) [Jan. 30th, 2013|03:04 pm]
Nick
[Tags|, ]

This post was originally written in September 2010, and was available at http://blogs.alfresco.com/wp/nickb/2010/09/24/apache-tika-powered-updates-to-alfresco-part-2/ - I have reposted it here as that blog is no longer online.

In Part 1, we learnt a little about what Apache Tika is and does. In this part, we’ll see what new features using Tika gives us in Cheetah.

On the metadata side, Tika delivers three important things for us. These are support for a wider range of formats (see below), enhanced ease of adding custom parsers (you can just spring in a bean with the class name of your parser and you’re done), and consistent metadata.

This last one is less of an issue for Alfresco users than for many other users, but is a real issue for extractor developers. Within Alfresco, we always map the raw properties onto ones in the content model, but this is handled at the extractor level. As such, it shouldn’t matter to the user if one document format has a “last author”, another “last editor” and a third “last edited by”, they’ll all turn up in the same property in Alfresco. However, the extractor writer has to know about this, to provide the mapping, and this makes writing an extractor harder, and increases the chance of error.

Within Tika, there is a set list of common metadata keys, and each Tika parser internally maps its properties onto these. As such, when you receive your metadata back from Tika, it all looks the same no matter what file you got it from. If the metadata is a date, then Tika will also take care of converting it to a common format, so you don’t have to worry about parsing a dozen different date representations.

Finally, because the metadata is in a common format, we can more easily map it to the content model. Thus, in Alfresco 3.4, we see most of the common extractors have a wider range of metadata mappings to the content model as standard. One big example of this is in the case of images – EXIF tags are now automatically extracted and mapped onto the content model, and if the image was geotagged, then the location of the image is also mapped onto the content model.

Give it a try – upload a geotagged image to Alfresco share in 3.4, and see all the new metadata that shows up such as the location, camera, focal length and more!

In the past, most text extractors that were used in Alfresco only able to produce plain text. However, all the Tika parsers generate XHTML sax events, and sure are able to produce not only plain text, but also HTML and XHTML. Also, since Tika the XHTML is a true XML document, we can make use of XSLT to chain transformations.

The immediate benefit then is that all plain text content transformers that are powered by Tika can deliver an HTML version at no effort. Thus, HTML versions of PDFs, Word Documents etc can now be requested.

(At the moment, the HTML generated is very clean, but not always all that complex. The Tika community is gradually improving the markup generated to include more meaning, especially semantic information, and Alfresco is pleased to be involved in this effort)

Does being able to generate XHTML help that much? I’d say yes! With the forthcoming WCM Quick Start, we’ll shortly be adding some features around HTML versions of some kinds of uploaded documents. Using Tika, we were able to implement this feature very quickly, allowing us to concentrate the developer time on enhancing Tika. Next up, for some cases we wanted a whole XHTML document, and others we only wanted the body content. Using Tika and the SAX handlers, it’s a one line change to toggle between the whole document, or just the body contents, by picking a different transform handler. Finally, the output is XHTML, so for demo’s we’ve been able to use XSLT and E4X (from within a script action) to effortlessly manipulate the content.

Finally, as mentioned, using Tika delivers us support for a large number of new file formats. The current list of files supported via Tika is:

  • Audio – wav, riff, midi
  • DWG (CAD files)
  • Epub
  • RSS and ATOM feeds
  • True Type Fonts
  • HTML
  • Image – JPEG, PNG, Gif, TIFF and Bitmap (including EXIF information where found)
  • iWork (Keynote, Pages etc)
  • Mbox mail
  • Microsoft Office – Word, PowerPoint, Excel, Visio, Outlook, Publisher, Works
  • Microsoft Office OOXML – Word (docx), PowerPoint (pptx), Excel (xlsx)
  • MP3 (ID3 v1 and v2)
  • CDF (scientific data)
  • Open Document Format
  • PDF
  • Zip and Tar archives
  • RDF
  • Plain Text
  • FLV Video
  • XML
  • Java class files

What’s more, generally it’s just a case of dropping new Tika jars into Alfresco with little/no configuration changes, so we can look forward to easy addition of new formats with each new Alfresco release as the Tika support grows!

In Part 3, we will look at mapping between Tika’s common metadata, and the Alfresco content model.

More information on Tika and Alfresco is available on the Alfresco Wiki. Tika will also be discussed at the Alfresco Developer Conferences in Paris and New York later this year.

linkReply