?

Log in

No account? Create an account
Adding new Apache Tika Plugins to Alfresco (Part 3) - Nick [entries|archive|friends|userinfo]
Nick

[ website | gagravarr.org ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Adding new Apache Tika Plugins to Alfresco (Part 3) [Jan. 30th, 2013|03:09 pm]
Nick
[Tags|, ]

This post was originally written in October 2010, and was available at http://blogs.alfresco.com/wp/nickb/2010/10/27/adding-new-apache-tika-plugins-to-alfresco-part-3/ - I have reposted it here as that blog is no longer online.

In Part 1 we saw what Apache Tika is and does, and in Part 2 we saw what it has brought to Alfresco. Now it’s time to look at adding new Tika Parsers, to support new file formats.

Firstly, why might you want to add a new parser? The most common reason is licensing – all the parsers that ship as standard with Apache Tika are Apache Licensed or similar, along with their dependencies, and so can be freely distributed and included in other projects. However, some file formats only have libraries that are available under GPL or proprietary licenses, and so these can’t be included in the standard Tika distribution.

There is a list of available 3rd party parsers on the Tika 3rd Party Plugins wiki page, currently made up of GPL licensed parsers + dependencies. If your format isn’t listed there, and you want to add it to Tika within Alfresco, then what to do?

Firstly, you need to write / acquire a Tika Parser. Writing a Tika Parser is quite easy, as the 5 minute parser guide explains. There are basically two methods to implement:

Set getSupportedTypes(ParseContext context);
void parse(InputStream stream, ContentHandler handler, Metadata metadata,
           ParseContext context) throws IOException, SAXException, TikaException;

The first allows you to indicate the file types your parser can handle. This is needed when registering the parser with the AutoDetectParser and similar, but isn’t needed if you select the parser explicitly. The second method is the one where you do the real work of outputting the contents and populating the metadata object.

To see this in action, let’s take a look at a simple “Hello World” Tika Parser:

package example;
public class HelloWorldParser implements Parser {
  public Set getSupportedTypes(ParseContext context) {
    Set types = new HashSet();
    types.add(MediaType.parse("hello/world"));
    return types;
  }
  public void parse(InputStream stream, ContentHandler handler,
         Metadata metadata, ParseContext context) throws SAXException {
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    xhtml.startElement("h1");
    xhtml.characters("Hello, World!");
    xhtml.endElement("h1");
    xhtml.endDocument();

    metadata.set("hello","world");
    metadata.set("title","Hello World!");
    metadata.set("custom1","Hello, Custom Metadata 1!");
    metadata.set("custom2","Hello, Custom Metadata 2!");
  }
}

Before we can use this in Alfresco, we need to compile this against tika-core.jar (note – you may need to implement the parse method without a ParseContext object if you’re using an older version of Tika), and then wrap our classfile up in a jar. Once our jar is deployed into our application container (eg the shared lib of tomcat), we’re ready to configure it.

For 3rd party parsers which provide the Tika service metadata files, if we don’t want to control the registration in Alfresco then we can simply allow the default Tika-Auto metadata and transformer classes to handle it. In our case, we want to register it explicitly. To do that, we’ll create a new extension spring context file, and populate it:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans>
    <bean id="extracter.MyCustomTika"
          class="org.alfresco.repo.content.metadata.TikaSpringConfiguredMetadataExtracter"
          parent="baseMetadataExtracter" >
        <!-- This is the name of our example parser compiled above -->
        <property name="tikaParserName">
           <value>example.HelloWorldParser</value>
        </property>

         <!-- Use the default mappings from TikaSpringConfiguredMetadataExtracter.properties -->
        <property name="inheritDefaultMapping">
            <value>true</value>
        </property>
        <!-- Map our extra keys to the content model -->
        <property name="mappingProperties">
            <props>
                <prop key="namespace.prefix.cm">http://www.alfresco.org/model/content/1.0</prop>
                <prop key="custom1">cm:description</prop>
                <prop key="custom2">cm:author</prop>
            </props>
        </property>
    </bean>

    <bean id="transformer.MyCustomTika"
          class="org.alfresco.repo.content.transform.TikaSpringConfiguredContentTransformer"
          parent="baseContentTransformer">
        <!-- Same as above -->
        <property name="tikaParserName">
           <value>example.HelloWorldParser</value>
        </property>
    </bean>
</beans>

To test this, we’ll need a node with the special fake mimetype of “hello/world”, which is what our Tika Parser is configured to handle. We can do that with a snippet of JavaScript like this:

var doc = userhome.createFile("hello.world");
doc.content = "This text will largely be ignored";
doc.mimetype = "hello/world";

If we run the above JavaScript, we’ll get a node called “hello.world”. If we run the “extract common metadata fields” action on it, we’ll then see the metadata properties showing through. Then, if we transform it to text/html, then we see a text heading of “Hello, World”. Thus we have verified that our custom Tika parser has been wired into Alfresco, is available for text transformation, and can do metadata extraction including custom keys.

More information on Tika and Alfresco is available on the Alfresco wiki.

linkReply