Text analytics in BlueMix using UIMA

In this post, I want to explain how to create a text analytics application in BlueMix using UIMA, and share sample code to show how to get started.

First, some background if you’re unfamiliar with the jargon.

What is UIMA?

UIMA (Unstructured Information Management Architecture) is an Apache framework for building analytics applications for unstructured information and the OASIS standard for content analytics.

I’ve written about it before, having used it on a few projects when I was in ETS, and on other side projects since such as building a conversational interface to web pages.

It’s perhaps better known for providing the architecture for the question answering system IBM Watson.

What is BlueMix?

BlueMix is IBM’s new Platform-as-a-Service (PaaS) offering, built on top of Cloud Foundry to provide a cloud development platform.

It’s in open beta at the moment, so you can sign up and have a play.

I’ve never used BlueMix before, or Cloud Foundry for that matter, so this was a chance for me to write my first app for it.

A UIMA “Hello World” for BlueMix

I’ve written a small sample to show how UIMA and BlueMix can work together. It provides a REST API that you can submit text to, and get back a JSON response with some attributes found in the text (long words, capitalised words, and strings that look like email addresses).

The “analytics” that the app is doing is trivial at best, but this is just a Hello World. For now my aim isn’t to produce a useful analytics solution, but to walk through the configuration needed to define a UIMA analytics pipeline, wrap it in a REST API using Wink, and deploy it as a BlueMix application.

When I get a chance, I’ll write a follow-up post on making something more useful.

You can try out the sample on BlueMix as it’s deployed to bluemix.net

The source is on GitHub at github.com/dalelane/bluemixuima.

In the rest of this post, I’ll walk through some of the implementation details.

Runtimes and services

Creating an application in BlueMix is already well documented so I won’t reiterate those steps, other than to say that as Apache UIMA is a Java SDK and framework, I use the Liberty for Java runtime.

I’m not using any of the services in this simple sample.

Manifest

The app is bundled up in a war file, which is what we deploy. This is specified in manifest.yml.

Building

The war file is built by an ant task which has to include the UIMA jar in the classpath, and copy my UIMA descriptor XML files into the war.

I’m developing in eclipse, so I set up an ant builder to run the build, and configured the project to do it automatically.

I’m deploying from eclipse, too, using the Cloud Foundry plugins for eclipse.

XML descriptors

The type system is defined in an XML descriptor file and specifies the different annotations that can be created by this pipeline, and the attributes that they have.

Running JCasGen in eclipse on that descriptor generates Java classes representing those types.

The pipeline is also defined in XML descriptors: one overall aggregate descriptor which imports three primitive descriptors for each of the three annotators in my sample pipeline : one to find email addresses, one to find capitalised words and one to find long words.

Note that the imports in the aggregate descriptor need to be relative so that they keep working once you deploy to BlueMix.

These XML descriptor files are all added to the war file by being included in the build.xml with a fileset include.

Annotators

Each of the primitive descriptor files specifies the fully qualified class name for the Java implementation of the annotator.

There are three annotators in this sample. (XML files with names starting “primitiveAeDescriptor”).

Each one is implemented by a Java class that extends JCasAnnotator_ImplBase.

Each uses a regular expression to find things to annotate in the text. This isn’t intended to be an indication that this is how things should be done, just that it makes for a simple and stateless demonstration without any additional dependencies.

The simplest is the regex used to find capitalised words in WordCaseAnnotator and the most complex is the ridiculously painful one used to find email addresses in EmailAnnotator.

Note that the regexes are prepared in the annotator initializer, and reused for each new CAS to process, to improve performance.

UIMA pipeline

The UIMA pipeline is defined in a single Java class.

It finds the XML descriptor for the pipeline by looking in the location where BlueMix will unpack the war.

It creates a CAS pool to make it easier to handle multiple concurrent requests, and avoid the overhead of creating a CAS for every request.

Once the pipeline is initialised, it is ready to handle incoming analysis requests.

Once the CAS has passed through the pipeline, the annotations are immediately copied out of the CAS into a POJO, so that the CAS can be returned to the pool.

REST API

The war file deployed to BlueMix contains a web.xml which specifies the servlet that implements the REST API.

I’m using Wink to implement the API. The servlet definition in the web.xml specifies where to find the list of API endpoints and the URL where the API should be.

The list of API endpoints is a list of classes that Wink uses. There is only one API endpoint, so only one class listed.

The API implementation is a very thin wrapper around the Pipeline class.

Everything is defined using annotations, and Wink handles turning the response into a JSON payload.

That’s it

I think that’s pretty much it.

I’ve added a simple front-end webpage, with a script to submit API requests for people who don’t want to do it with something like curl.

It’s live at uimahelloworld.mybluemix.net.

Like I said, it’s very simple. The Java itself isn’t particularly complex. My reason for sharing it was to provide a boilerplate config for defining a UIMA analytics pipeline, wrapping it in a REST API, and deploying it to BlueMix.

Once you’ve got that working, you can do text analytics in BlueMix as complex as whatever you can dream up for your annotators.

When I get time, I’ll write a follow-up post sharing what that could look like.


Conversational Internet

tl;dr

We’ve built a prototype to show how we could interact with the Internet using a command-driven approach.

  • A screen reader, but one that uses machine learning and natural language processing, in order to better understand both what the user wants to do, and what the web page says.
  • One that can offer a conversational interface instead of just reading out everything on the page.

It’s a proof-of-concept, but it’s an exciting idea with a lot of potential and we’ve got a demo that shows it in action.

The problem : screen readers today

I’ve written about this before but here is a recap.

Visually impaired people can interact with the web using screen readers. These read out every element on a page.

The user has to make a mental model of the structure of the page as it’s read out, and keep this in their head as they arrow-key around the page.

For example, on a news site’s front page, once the screen reader has read out the page, you have to remember if the story you want is the fifth or sixth story in the list so you can tab the right number of times to get to it.

Imagine an automated telephone menu:
“for blah-blah-blah, press 1, for blather-blather-blather, press 2, for something-or-other, press 3 … for something-else-vague, press 9 …”

Imagine this menu was so long it took 15 minutes or more to read.

Imagine none of the options are an exact match for what you want. But by the time you get to the end, you can’t remember whether the closest match was the third or fourth, or fiftieth option.

The vision : a Conversational Internet

Software could be smarter.

If it understood more about the web page, it could describe it at a higher, task-oriented level. It could read out the relevant bits, instead of everything.

If it understood more about what the user wants to do, the user could just say that, instead of working out the manual navigation steps themselves.

The vision is software that can interpret web pages and offer a conversational interface to web browsing.

Continue reading