January 30th, 2013

Apache Tika and Alfresco – Part 1

This post was originally written in September 2010, and was available at http://blogs.alfresco.com/wp/nickb/2010/09/24/apache-tika-and-alfresco-part-1/ - I have reposted it here as that blog is no longer online.

For the forthcoming Project Cheetah release, there have been a number of improvements to Metadata Extraction and Content Transformations. These improvements have been delivered by using Apache Tika to power many of the standard extractors and transformers.

In this series of blog posts, we’ll be looking at what Apache Tika is and what it does, how it fits into Alfresco, what new features it has delivered, how you can customise how Tika works, and how you can add new Tika parsers to easily support new formats.

The idea for Apache Tika was hatched in 2006, largely from people involved in Apache Lucene, who were struggling to sensibly index all of their documents. The project went through the Apache Incubator, and after a period of time as a Lucene sub-project, in 2010 became it’s own top level Apache project. Tika is used by people indexing content, spidering the web, doing NLP and text processing, as well as with content repositories.

For all these use cases, the problems are largely the same. You start with a number of documents in a variety of formats. You wish to know what they are, and hence which libraries may be useful in processing them. You then want to get some consistent metadata out of them, and possibly a rich textual representation of the content. You also probably wanted all of this yesterday!

(As a side note, Alfresco users have historically been in a more fortunate position than most when faced with these challenges, as the Metadata Extractor and Content Transformation services have handled most of these for you.)

What services does Tika provide then?

Firstly, Tika offers content and language detection. Through this, you can pass Tika a piece of unknown content, and get back information on what kind of file it is (eg pdf, docx), along with what language the text is written in (eg utf-8 english). Within Alfresco we tend to already know this information, so as yet don’t make much use of detection.

Secondly, through the parser system, Tika provides access to the metadata of the document. You can use Tika to find out the last author of a word file, the title of an HTML page, or even the location where a geo-tagged image was taken. In addition, Tika provides a consistent view across the different format’s metadata, mapping internally from document specific to general metadata entries. As such, you don’t need to know if a format uses “last author”, “last editor” or “last edited by”, Tika instead always provides the same information. We’ll see more on using Tika for metadata in part 2.

Thirdly, through the parsers, Tika provides access to the textual content of files. The text is available as plain text, html and xhtml, with the latter offering options for onward transformations through SAX and XSLT to additional representations. This can be used for full text indexing, for web previews, and much more. Again in part 2 we’ll see how this is being used in Alfresco.

Finally, Tika provides access to the embedded resources within files. This could be 2 images embedded in a word document, or an excel spreadsheet held within an powerpoint file, or even half a dozen PDFs contained within a zip file. This is quite a new Tika feature, and we’ll hopefully be making more use of it in the future. For now, it offers the adventurous a consistent way to get at resources inside other files.


In Part 2, we’ll look at the new features and support that Tika delivers to Cheetah.

More information on Tika and Alfresco is available on the Alfresco Wiki. Tika will also be discussed at the Alfresco Developer Conferences in Paris and New York later this year.

Apache Tika powered updates to Alfresco (Part 2)

This post was originally written in September 2010, and was available at http://blogs.alfresco.com/wp/nickb/2010/09/24/apache-tika-powered-updates-to-alfresco-part-2/ - I have reposted it here as that blog is no longer online.

In Part 1, we learnt a little about what Apache Tika is and does. In this part, we’ll see what new features using Tika gives us in Cheetah.

On the metadata side, Tika delivers three important things for us. These are support for a wider range of formats (see below), enhanced ease of adding custom parsers (you can just spring in a bean with the class name of your parser and you’re done), and consistent metadata.

This last one is less of an issue for Alfresco users than for many other users, but is a real issue for extractor developers. Within Alfresco, we always map the raw properties onto ones in the content model, but this is handled at the extractor level. As such, it shouldn’t matter to the user if one document format has a “last author”, another “last editor” and a third “last edited by”, they’ll all turn up in the same property in Alfresco. However, the extractor writer has to know about this, to provide the mapping, and this makes writing an extractor harder, and increases the chance of error.

Within Tika, there is a set list of common metadata keys, and each Tika parser internally maps its properties onto these. As such, when you receive your metadata back from Tika, it all looks the same no matter what file you got it from. If the metadata is a date, then Tika will also take care of converting it to a common format, so you don’t have to worry about parsing a dozen different date representations.

Finally, because the metadata is in a common format, we can more easily map it to the content model. Thus, in Alfresco 3.4, we see most of the common extractors have a wider range of metadata mappings to the content model as standard. One big example of this is in the case of images – EXIF tags are now automatically extracted and mapped onto the content model, and if the image was geotagged, then the location of the image is also mapped onto the content model.

Give it a try – upload a geotagged image to Alfresco share in 3.4, and see all the new metadata that shows up such as the location, camera, focal length and more!

In the past, most text extractors that were used in Alfresco only able to produce plain text. However, all the Tika parsers generate XHTML sax events, and sure are able to produce not only plain text, but also HTML and XHTML. Also, since Tika the XHTML is a true XML document, we can make use of XSLT to chain transformations.

The immediate benefit then is that all plain text content transformers that are powered by Tika can deliver an HTML version at no effort. Thus, HTML versions of PDFs, Word Documents etc can now be requested.

(At the moment, the HTML generated is very clean, but not always all that complex. The Tika community is gradually improving the markup generated to include more meaning, especially semantic information, and Alfresco is pleased to be involved in this effort)

Does being able to generate XHTML help that much? I’d say yes! With the forthcoming WCM Quick Start, we’ll shortly be adding some features around HTML versions of some kinds of uploaded documents. Using Tika, we were able to implement this feature very quickly, allowing us to concentrate the developer time on enhancing Tika. Next up, for some cases we wanted a whole XHTML document, and others we only wanted the body content. Using Tika and the SAX handlers, it’s a one line change to toggle between the whole document, or just the body contents, by picking a different transform handler. Finally, the output is XHTML, so for demo’s we’ve been able to use XSLT and E4X (from within a script action) to effortlessly manipulate the content.

Finally, as mentioned, using Tika delivers us support for a large number of new file formats. The current list of files supported via Tika is:

  • Audio – wav, riff, midi
  • DWG (CAD files)
  • Epub
  • RSS and ATOM feeds
  • True Type Fonts
  • HTML
  • Image – JPEG, PNG, Gif, TIFF and Bitmap (including EXIF information where found)
  • iWork (Keynote, Pages etc)
  • Mbox mail
  • Microsoft Office – Word, PowerPoint, Excel, Visio, Outlook, Publisher, Works
  • Microsoft Office OOXML – Word (docx), PowerPoint (pptx), Excel (xlsx)
  • MP3 (ID3 v1 and v2)
  • CDF (scientific data)
  • Open Document Format
  • PDF
  • Zip and Tar archives
  • RDF
  • Plain Text
  • FLV Video
  • XML
  • Java class files

What’s more, generally it’s just a case of dropping new Tika jars into Alfresco with little/no configuration changes, so we can look forward to easy addition of new formats with each new Alfresco release as the Tika support grows!

In Part 3, we will look at mapping between Tika’s common metadata, and the Alfresco content model.

More information on Tika and Alfresco is available on the Alfresco Wiki. Tika will also be discussed at the Alfresco Developer Conferences in Paris and New York later this year.

Adding new Apache Tika Plugins to Alfresco (Part 3)

This post was originally written in October 2010, and was available at http://blogs.alfresco.com/wp/nickb/2010/10/27/adding-new-apache-tika-plugins-to-alfresco-part-3/ - I have reposted it here as that blog is no longer online.

In Part 1 we saw what Apache Tika is and does, and in Part 2 we saw what it has brought to Alfresco. Now it’s time to look at adding new Tika Parsers, to support new file formats.

Firstly, why might you want to add a new parser? The most common reason is licensing – all the parsers that ship as standard with Apache Tika are Apache Licensed or similar, along with their dependencies, and so can be freely distributed and included in other projects. However, some file formats only have libraries that are available under GPL or proprietary licenses, and so these can’t be included in the standard Tika distribution.

There is a list of available 3rd party parsers on the Tika 3rd Party Plugins wiki page, currently made up of GPL licensed parsers + dependencies. If your format isn’t listed there, and you want to add it to Tika within Alfresco, then what to do?

Firstly, you need to write / acquire a Tika Parser. Writing a Tika Parser is quite easy, as the 5 minute parser guide explains. There are basically two methods to implement:

Set getSupportedTypes(ParseContext context);
void parse(InputStream stream, ContentHandler handler, Metadata metadata,
           ParseContext context) throws IOException, SAXException, TikaException;

The first allows you to indicate the file types your parser can handle. This is needed when registering the parser with the AutoDetectParser and similar, but isn’t needed if you select the parser explicitly. The second method is the one where you do the real work of outputting the contents and populating the metadata object.

To see this in action, let’s take a look at a simple “Hello World” Tika Parser:

package example;
public class HelloWorldParser implements Parser {
  public Set getSupportedTypes(ParseContext context) {
    Set types = new HashSet();
    types.add(MediaType.parse("hello/world"));
    return types;
  }
  public void parse(InputStream stream, ContentHandler handler,
         Metadata metadata, ParseContext context) throws SAXException {
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    xhtml.startElement("h1");
    xhtml.characters("Hello, World!");
    xhtml.endElement("h1");
    xhtml.endDocument();

    metadata.set("hello","world");
    metadata.set("title","Hello World!");
    metadata.set("custom1","Hello, Custom Metadata 1!");
    metadata.set("custom2","Hello, Custom Metadata 2!");
  }
}

Before we can use this in Alfresco, we need to compile this against tika-core.jar (note – you may need to implement the parse method without a ParseContext object if you’re using an older version of Tika), and then wrap our classfile up in a jar. Once our jar is deployed into our application container (eg the shared lib of tomcat), we’re ready to configure it.

For 3rd party parsers which provide the Tika service metadata files, if we don’t want to control the registration in Alfresco then we can simply allow the default Tika-Auto metadata and transformer classes to handle it. In our case, we want to register it explicitly. To do that, we’ll create a new extension spring context file, and populate it:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans>
    <bean id="extracter.MyCustomTika"
          class="org.alfresco.repo.content.metadata.TikaSpringConfiguredMetadataExtracter"
          parent="baseMetadataExtracter" >
        <!-- This is the name of our example parser compiled above -->
        <property name="tikaParserName">
           <value>example.HelloWorldParser</value>
        </property>

         <!-- Use the default mappings from TikaSpringConfiguredMetadataExtracter.properties -->
        <property name="inheritDefaultMapping">
            <value>true</value>
        </property>
        <!-- Map our extra keys to the content model -->
        <property name="mappingProperties">
            <props>
                <prop key="namespace.prefix.cm">http://www.alfresco.org/model/content/1.0</prop>
                <prop key="custom1">cm:description</prop>
                <prop key="custom2">cm:author</prop>
            </props>
        </property>
    </bean>

    <bean id="transformer.MyCustomTika"
          class="org.alfresco.repo.content.transform.TikaSpringConfiguredContentTransformer"
          parent="baseContentTransformer">
        <!-- Same as above -->
        <property name="tikaParserName">
           <value>example.HelloWorldParser</value>
        </property>
    </bean>
</beans>

To test this, we’ll need a node with the special fake mimetype of “hello/world”, which is what our Tika Parser is configured to handle. We can do that with a snippet of JavaScript like this:

var doc = userhome.createFile("hello.world");
doc.content = "This text will largely be ignored";
doc.mimetype = "hello/world";

If we run the above JavaScript, we’ll get a node called “hello.world”. If we run the “extract common metadata fields” action on it, we’ll then see the metadata properties showing through. Then, if we transform it to text/html, then we see a text heading of “Hello, World”. Thus we have verified that our custom Tika parser has been wired into Alfresco, is available for text transformation, and can do metadata extraction including custom keys.

More information on Tika and Alfresco is available on the Alfresco wiki.

Spring in a QName

This post was originally written in September 2010, and was available at http://blogs.alfresco.com/wp/nickb/2010/09/28/spring-in-a-qname/ - I have reposted it here as that blog is no longer online.

Within Alfresco, we make a lot of use of Qualified Names (QNames) for addressing and naming things. Generally, when configuring Alfresco through Spring or properties files, we can use the short form, eg

<bean id="coreBean" class="org.alfresco.some.thing.core">
  <property name="typeQName">
    <value>cm:description</value>
  </property>
</bean>

Within the bean, the NamespaceResolver is used to turn the friendly, short form (eg cm:description) into the full form (eg {http://www.alfresco.org/model/content/1.0}description).

However, every so often you may find yourself trying to configure something with Spring that no-one ever expected you to be trying to do… In this situation, the string form isn’t accepted by the class, and only a real QName object may be sprung in.

As it turns out, creating a real QName object from within Spring isn’t actually too hard to do. So, in case you ever find yourself needing to do it, the definition will look something like this:

<bean id="coreBean" class="org.alfresco.some.thing.core">
  <property name="typeQName">
    <value>
      <bean class="org.alfresco.service.namespace.QName"
              factory-method="createQName">
        <constructor-arg value="http://www.alfresco.org/model/content/1.0" />
        <constructor-arg value="description"/>
      </bean>
    </value>
  </property>
</bean>

Overriding built-in Java backed WebScripts in Alfresco

Generally, Alfresco makes it fairly easy for you to override the built-in core repository Services. There's this wiki page and this one to get you started, then once you know about Spring you can largely make it happen.

The Alfresco spring context file load order for services etc is controlled by application-context.xml. The last bean defined with a given name wins, so if Alfresco ships with a bean definition for fooService and you define your own one later with the same name, yours will be used. Handily, as shown in that context file, spring context files in /extensions/ and /modules/ are loaded late, so your modules can include override services.


The picture is not quite so rosy for overriding the built-in webscripts. Here, the key context file is web-application-context.xml. This loads application-context.xml which we saw above, and then it loads the web scripts. This means that the built-in webscritps are always loaded after modules, so your module can never override a built in webscript.

However, at the bottom of that file we see the answer - we need to provide a single alfresco/extension/custom-web-context.xml which pulls in our module webscripts. That gets loaded last, so will win!

For my current project, we have added the following file as alfresco/extension/custom-web-context.xml

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans>
    <import resource="classpath*:alfresco/module/*/module-webscripts-context.xml" />
</beans>
All of our Repository modules already had a module-context.xml file in alfresco/module/<name>/. Those that need to override webscripts can now also provide a second file in that directory, module-webscripts-context.xml, and any webscripts defined in there can be guaranteed to win and be used.

(The new custom-web-context.xml file can be bundled in a jar, shipped in an existing AMP, packaged as a brand new AMP, or dropped in your classes override directory similar to your repository properties file - it all depends what'll be cleaner for your setup)

Overriding Alfresco webscript library ftl files

On the whole, Alfresco is pretty good for customising and overriding, and there's been a lot of work done in 4.2 to make it even easier to customise Share and Surf. Overriding the spring beans for Java backed webscripts can be done with a bit of fiddling. One problem we have hit though is with overriding FTL files, especially library ones.

I'll use the Workflow REST API as an example here, as it's one we needed to change, but it applies to a lot of the webscripts too. A common JSON emitting Alfresco REST API has a FTL file somewhat like:

<#import "workflow.lib.ftl" as workflowLib />
{
   "data":
   <@workflowLib.taskJSON task=workflowTask detailed=true/>
}
It imports a library file, then calls one of the functions in there. If you want to inject a few extra bits into the JSON structure, then your only option is to edit the library file, or clone it. For simple libraries with a single macro, you could just clone the whole thing and customise, but for a library with many macros that gets rather nasty especially when upgrading. FreeMarker macros get replaced when a new one is defined with the same name, much like spring beans, so for our hypothetical macro file:

<#macro taskJSON task detailed=false>
<#escape x as jsonUtils.encodeJSONString(x)>
      {
         "id": "${task.id}",
          ....
      }
</#escape>
</#macro>

<#macro propertiesJSON properties>
<#escape x as jsonUtils.encodeJSONString(x)>
{
<#list properties?keys as key>
   "${key}":
   ...
</#escape>
</#macro>
To change taskJSON to inject some extra bits from the model, we'd have to copy the whole file and customise. However, there is a way that makes things a little less evil, which relies on macros overwriting each other. We take the original file, and copy it to the same directory structure in our module, but with a new name. Instead of workflow.lib.ftl we might instead go for workflow-original.lib.ftl. Next, add a new comment to the top, documenting where it originally came from, what version it is taken from etc. Now, create a new workflow.lib.ftl, and copy into that the whole of the macro we want to override. Add in the extra few JSON clauses. At the top of the file, add some documentation saying what version you copied and pasted from, what file it came from, and what changes you made. Trust me, you'll be thankful you did this when you come to upgrade... Finally, at the top of the file, before your customised macro, add in an include (not import, include) of the renamed original. Your new lib, with the name of the old one, now looks something like:

<#-- Customised File -->
<#-- Original: remote-api/config/alfresco/templates/webscripts/org/alfresco/repository/workflow/workflow.lib.ftl -->
<#-- Alfresco Version: 4.1.1.3 -->
<#-- Modification: Added custom1 and custom2 properties to the JSON -->

<#-- This pulls in all the macros we haven't needed to change -->
<#include "workflow-original.lib.ftl">

<#-- Override and customise the one we do need to change -->
<#macro taskJSON task detailed=false>
<#escape x as jsonUtils.encodeJSONString(x)>
      {
         "id": "${task.id}",
          ....
         "myCustom1": ${task.custom1},
         "myCustom1": ${task.custom2}
      }
</#escape>
</#macro>
(It's probably best to put your custom json at the start or end, to make upgrading easier)