Nick (gagravarr) wrote,

Apache Tika and Alfresco – Part 1

This post was originally written in September 2010, and was available at - I have reposted it here as that blog is no longer online.

For the forthcoming Project Cheetah release, there have been a number of improvements to Metadata Extraction and Content Transformations. These improvements have been delivered by using Apache Tika to power many of the standard extractors and transformers.

In this series of blog posts, we’ll be looking at what Apache Tika is and what it does, how it fits into Alfresco, what new features it has delivered, how you can customise how Tika works, and how you can add new Tika parsers to easily support new formats.

The idea for Apache Tika was hatched in 2006, largely from people involved in Apache Lucene, who were struggling to sensibly index all of their documents. The project went through the Apache Incubator, and after a period of time as a Lucene sub-project, in 2010 became it’s own top level Apache project. Tika is used by people indexing content, spidering the web, doing NLP and text processing, as well as with content repositories.

For all these use cases, the problems are largely the same. You start with a number of documents in a variety of formats. You wish to know what they are, and hence which libraries may be useful in processing them. You then want to get some consistent metadata out of them, and possibly a rich textual representation of the content. You also probably wanted all of this yesterday!

(As a side note, Alfresco users have historically been in a more fortunate position than most when faced with these challenges, as the Metadata Extractor and Content Transformation services have handled most of these for you.)

What services does Tika provide then?

Firstly, Tika offers content and language detection. Through this, you can pass Tika a piece of unknown content, and get back information on what kind of file it is (eg pdf, docx), along with what language the text is written in (eg utf-8 english). Within Alfresco we tend to already know this information, so as yet don’t make much use of detection.

Secondly, through the parser system, Tika provides access to the metadata of the document. You can use Tika to find out the last author of a word file, the title of an HTML page, or even the location where a geo-tagged image was taken. In addition, Tika provides a consistent view across the different format’s metadata, mapping internally from document specific to general metadata entries. As such, you don’t need to know if a format uses “last author”, “last editor” or “last edited by”, Tika instead always provides the same information. We’ll see more on using Tika for metadata in part 2.

Thirdly, through the parsers, Tika provides access to the textual content of files. The text is available as plain text, html and xhtml, with the latter offering options for onward transformations through SAX and XSLT to additional representations. This can be used for full text indexing, for web previews, and much more. Again in part 2 we’ll see how this is being used in Alfresco.

Finally, Tika provides access to the embedded resources within files. This could be 2 images embedded in a word document, or an excel spreadsheet held within an powerpoint file, or even half a dozen PDFs contained within a zip file. This is quite a new Tika feature, and we’ll hopefully be making more use of it in the future. For now, it offers the adventurous a consistent way to get at resources inside other files.

In Part 2, we’ll look at the new features and support that Tika delivers to Cheetah.

More information on Tika and Alfresco is available on the Alfresco Wiki. Tika will also be discussed at the Alfresco Developer Conferences in Paris and New York later this year.
Tags: alfresco, tika

  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.