Text analytics in BlueMix using UIMA

In this post, I want to explain how to create a text analytics application in BlueMix using UIMA, and share sample code to show how to get started.

First, some background if you’re unfamiliar with the jargon.

What is UIMA?

UIMA (Unstructured Information Management Architecture) is an Apache framework for building analytics applications for unstructured information and the OASIS standard for content analytics.

I’ve written about it before, having used it on a few projects when I was in ETS, and on other side projects since such as building a conversational interface to web pages.

It’s perhaps better known for providing the architecture for the question answering system IBM Watson.

What is BlueMix?

BlueMix is IBM’s new Platform-as-a-Service (PaaS) offering, built on top of Cloud Foundry to provide a cloud development platform.

It’s in open beta at the moment, so you can sign up and have a play.

I’ve never used BlueMix before, or Cloud Foundry for that matter, so this was a chance for me to write my first app for it.

A UIMA “Hello World” for BlueMix

I’ve written a small sample to show how UIMA and BlueMix can work together. It provides a REST API that you can submit text to, and get back a JSON response with some attributes found in the text (long words, capitalised words, and strings that look like email addresses).

The “analytics” that the app is doing is trivial at best, but this is just a Hello World. For now my aim isn’t to produce a useful analytics solution, but to walk through the configuration needed to define a UIMA analytics pipeline, wrap it in a REST API using Wink, and deploy it as a BlueMix application.

When I get a chance, I’ll write a follow-up post on making something more useful.

You can try out the sample on BlueMix as it’s deployed to bluemix.net

The source is on GitHub at github.com/dalelane/bluemixuima.

In the rest of this post, I’ll walk through some of the implementation details.

Runtimes and services

Creating an application in BlueMix is already well documented so I won’t reiterate those steps, other than to say that as Apache UIMA is a Java SDK and framework, I use the Liberty for Java runtime.

I’m not using any of the services in this simple sample.

Manifest

The app is bundled up in a war file, which is what we deploy. This is specified in manifest.yml.

Building

The war file is built by an ant task which has to include the UIMA jar in the classpath, and copy my UIMA descriptor XML files into the war.

I’m developing in eclipse, so I set up an ant builder to run the build, and configured the project to do it automatically.

I’m deploying from eclipse, too, using the Cloud Foundry plugins for eclipse.

XML descriptors

The type system is defined in an XML descriptor file and specifies the different annotations that can be created by this pipeline, and the attributes that they have.

Running JCasGen in eclipse on that descriptor generates Java classes representing those types.

The pipeline is also defined in XML descriptors: one overall aggregate descriptor which imports three primitive descriptors for each of the three annotators in my sample pipeline : one to find email addresses, one to find capitalised words and one to find long words.

Note that the imports in the aggregate descriptor need to be relative so that they keep working once you deploy to BlueMix.

These XML descriptor files are all added to the war file by being included in the build.xml with a fileset include.

Annotators

Each of the primitive descriptor files specifies the fully qualified class name for the Java implementation of the annotator.

There are three annotators in this sample. (XML files with names starting “primitiveAeDescriptor”).

Each one is implemented by a Java class that extends JCasAnnotator_ImplBase.

Each uses a regular expression to find things to annotate in the text. This isn’t intended to be an indication that this is how things should be done, just that it makes for a simple and stateless demonstration without any additional dependencies.

The simplest is the regex used to find capitalised words in WordCaseAnnotator and the most complex is the ridiculously painful one used to find email addresses in EmailAnnotator.

Note that the regexes are prepared in the annotator initializer, and reused for each new CAS to process, to improve performance.

UIMA pipeline

The UIMA pipeline is defined in a single Java class.

It finds the XML descriptor for the pipeline by looking in the location where BlueMix will unpack the war.

It creates a CAS pool to make it easier to handle multiple concurrent requests, and avoid the overhead of creating a CAS for every request.

Once the pipeline is initialised, it is ready to handle incoming analysis requests.

Once the CAS has passed through the pipeline, the annotations are immediately copied out of the CAS into a POJO, so that the CAS can be returned to the pool.

REST API

The war file deployed to BlueMix contains a web.xml which specifies the servlet that implements the REST API.

I’m using Wink to implement the API. The servlet definition in the web.xml specifies where to find the list of API endpoints and the URL where the API should be.

The list of API endpoints is a list of classes that Wink uses. There is only one API endpoint, so only one class listed.

The API implementation is a very thin wrapper around the Pipeline class.

Everything is defined using annotations, and Wink handles turning the response into a JSON payload.

That’s it

I think that’s pretty much it.

I’ve added a simple front-end webpage, with a script to submit API requests for people who don’t want to do it with something like curl.

It’s live at uimahelloworld.mybluemix.net.

Like I said, it’s very simple. The Java itself isn’t particularly complex. My reason for sharing it was to provide a boilerplate config for defining a UIMA analytics pipeline, wrapping it in a REST API, and deploying it to BlueMix.

Once you’ve got that working, you can do text analytics in BlueMix as complex as whatever you can dream up for your annotators.

When I get time, I’ll write a follow-up post sharing what that could look like.