Speech to Text

Apologies to the tl;dr brigade, this is going to be a long one... 

For a number of years I've been quietly working away with IBM research on our speech to text programme. That is, working with a set of algorithms that ultimately produce a system capable of listening to human speech and transcribing it into text. The concept is simple, train a system for speech to text - speech goes in, text comes out. However, the process and algorithms to do this are extremely complicated from just about every way you look at it – computationally, mathematically, operationally, evaluationally, time and cost. This is a completely separate topic and area of research from the similar sounding text to speech systems that take text (such as this blog) and read it aloud in a computerised voice.

Whenever I talk to people about it they always appear fascinated and want to know more. The same questions often come up. I'm going to address some of these here in a generic way and leaving out those that I'm unable to talk about here. I should also point out that I'm by no means a speech expert or linguist but have developed enough of an understanding to be dangerous in the subject matter and that (I hope) allows me to explain things in a way that others not familiar with the field are able to understand. I'm deliberately not linking out to the various research topics that come into play during this post as the list would become lengthy very quickly and this isn't a formal paper after all, Internet searches are your friend if you want to know more.

I didn't know IBM did that?
OK so not strictly a question but the answer is yes, we do. We happen to be pretty good at it as well. However, we typically use a company called Nuance as our preferred partner.

People have often heard of IBM's former product in this area called Via Voice for their desktop PCs which was available until the early 2000's. This sort of technology allowed a single user to speak to their computer for various different purposes and required the user to spend some time training the software before it would understand their particular voice. Today's speech software has progressed beyond this to systems that don't require any training by the user before they use it. Current systems are trained in advance in order to attempt to understand any voice.

What's required?
Assuming you have the appropriate software and the hardware required to run it on then you need three more things to build a speech to text system: audio, transcripts and a phonetic dictionary of pronunciations. This sounds quite simple but when you dig under the covers a little you realise it's much more complicated (not to mention expensive) and the devil is very much in the detail.

On the audial side you'll need a set of speech recordings. If you want to evaluate your system after it has been trained then a small sample of these should be kept to one side and not used during the training process. This set of audio used for evaluation is usually termed the held out set. It's considered cheating if you later evaluate the system using audio that was included in the training process – since the system has already “heard” this audio before it would have a higher chance of accurately reproducing it later. The creation of the held out set leads to two sets of audio files, the held out set and the majority of the audio that remains which is called the training set.

The audio can be in any format your training software is compatible with but wave files are commonly used. The quality of the audio both in terms of the digital quality (e.g. sample rate) as well as the quality of the speaker(s) and the equipment used for the recordings will have a direct bearing on the resulting accuracy of the system being trained. Simply put, the better quality you can make the input, the more accurate the output will be. This leads to another bunch of questions such as but not limited to “What quality is optimal?”, “What should I get the speakers to say?”, “How should I capture the recordings?” - all of which are research topics in their own right and for which there is no one-size-fits-all answer.

Capturing the audio is one half of the battle. The next piece in the puzzle is obtaining well transcribed textual copies of that audio. The transcripts should consist of a set of text representing what was said in the audio as well as some sort of indication of when during the audio a speaker starts speaking and when they stop. This is usually done on a sentence by sentence basis, or for each utterance as they are known. These transcripts may have a certain amount of subjectivity associated with them in terms of where the sentence boundaries are and potentially exactly what was said if the audio wasn't clear or slang terms were used. They can be formatted in a variety of different ways and there are various standard formats for this purpose from an XML DTD through to CSV.

If it has not already become clear, creating the transcription files can be quite a skilled and time consuming job. A typical industry expectation is that it takes approximately 10 man-hours for a skilled transcriber to produce 1 hour of well formatted audio transcription. This time plus the cost of collecting the audio in the first place is one of the factors making speech to text a long, hard and expensive process. This is particularly the case when put into context that most current commercial speech systems are trained on at least 2000+ hours of audio with the minimum recommended amount being somewhere in the region of 500+ hours.

Finally, a phonetic dictionary must either be obtained or produced that contains at least one pronunciation variant for each word said across the entire corpus of audio input. Even for a minimal system this will run into tens of thousands of words. There are of course, already phonetic dictionaries available such as the Oxford English Dictionary that contains a pronunciation for each word it contains. However, this would only be appropriate for one regional accent or dialect without variation. Hence, producing the dictionary can also be a long and skilled manual task.

What does the software do?
The simple answer is that it takes audio and transcript files and passes them through a set of really rather complicated mathematical algorithms to produce a model that is particular to the input received. This is the training process. Once system has been trained the model it generates can be used to take speech input and produce text output. This is the decoding process. The training process requires lots of data and is computationally expensive but the model it produces is very small and computationally much less expensive to run. Today's models are typically able to perform real-time (or faster) speech to text conversion on a single core of a modern CPU. It is the model and software surrounding the model that is the piece exposed to users of the system.

Various different steps are used during the training process to iterate through the different modelling techniques across the entire set of training audio provided to the trainer. When the process first starts the software knows nothing of the audio, there are no clever boot strapping techniques used to kick-start the system in a certain direction or pre-load it in any way. This allows the software to be entirely generic and work for all sorts of different languages and quality of material. Starting in this way is known as a flat start or context independent training. The software simply chops up the audio into regular segments to start with and then performs several iterations where these boundaries are shifted slightly to match the boundaries of the speech in the audio more closely.

The next phase is context dependent training. This phase starts to make the model a little more specific and tailored to the input being given to the trainer. The pronunciation dictionary is used to refine the model to produce an initial system that could be used to decode speech into text in its own right at this early stage. Typically, context dependent training, while an iterative process in itself, can also be run multiple times in order to hone the model still further.

Another optimisation that can be made to the model after context dependent training is to apply vocal tract length normalisation. This works on the theory that the audibility of human speech correlates to the pitch of the voice, and the pitch of the voice correlates to the vocal tract length of the speaker. Put simply, it's a theory that says men have low voices and women have high voices and if we normalise the wave form for all voices in the training material to have the same pitch (i.e. same vocal tract length) then audibility improves. To do this an estimation of the vocal tract length must first be made for each speaker in the training data such that a normalisation factor can be applied to that material and the model updated to reflect the change.

The model can be thought of as a tree although it's actually a large multi-dimensional matrix. By reducing the number of dimensions in the matrix and applying various other mathematical operations to reduce the search space the model can be further improved upon both in terms of accuracy, speed and size. This is generally done after vocal tract length normalisation has taken place.

Another tweak that can be made to improve the model is to apply what we call discriminative training. For this step the theory goes along the lines that all of the training material is decoded using the current best model produced from the previous step. This produces a set of text files. These text files can be compared with those produced by the human transcribers and given to the system as training material. The comparison can be used to inform where the model can be improved and these improvements applied to the model. It's a step that can probably be best summarised by learning from its mistakes, clever!

Finally, once the model has been completed it can be used with a decoder that knows how to understand that model to produce text given an audio input. In reality, the decoders tend to operate on two different models. The audio model for which the process of creation has just been roughly explained; and a language model. The language model is simply a description of how language is used in the specific context of the training material. It would, for example, attempt to provide insight into which words typically follow which other words via the use of what natural language processing experts call n-grams. Obtaining information to produce the language model is much easier and does not necessarily have to come entirely from the transcripts used during the training process. Any text data that is considered representative of the speech being decoded could be useful. For example, in an application targeted at decoding BBC News readers then articles from the BBC news web site would likely prove a useful addition to the language model.

How accurate is it?
This is probably the most common question about these systems and one of the most complex to answer. As with most things in the world of high technology it's not simple, so the answer is the infamous “it depends”. The short answer is that in ideal circumstances the software can perform at near human levels of accuracy which equates to in excess of 90% accuracy levels. Pretty good you'd think. It has been shown that human performance is somewhere in excess of 90% and is almost never 100% accuracy. The test for this is quite simple, you get two (or more) people to independently transcribe some speech and compare the results from each speaker, almost always there will be a disagreement about some part of the speech (if there's enough speech that is).

It's not often that ideal circumstances are present or can even realistically be achieved. Ideal would be transcribing a speaker with a similar voice and accent to those which have been trained into the model and they would speak at the right speed (not too fast and not too slowly) and they would use a directional microphone that didn't do any fancy noise cancellation, etc. What people are generally interested in is the real-world situation, something along the lines of “if I speak to my phone, will it understand me?”. This sort of real-world environment often includes background noise and a very wide variety of speakers potentially speaking into a non-optimal recording device. Even this can be a complicated answer for the purposes of accuracy. We're talking about free, conversational style, speech in this blog post and there's a huge different in recognising any and all words versus recognising a small set of command and control words for if you wanted your phone to perform a specific action. In conclusion then, we can only really speak about the art of the possible and what has been achieved before. If you want to know about accuracy for your particular situation and your particular voice on your particular device then you'd have to test it!

What words can it understand? What about slang?
The range of understanding of a speech to text system is dependent on the training material. At present, the state of the art systems are based on dictionaries of words and don't generally attempt to recognise new words for which an entry in the dictionary has not been found (although these types of systems are available separately and could be combined into a speech to text solution if necessary). So the number and range of words understood by a speech to text system is currently (and I'm generalising here) a function of the number and range of words used in the training material. It doesn't really matter what these words are, whether they're conversational and slang terms or proper dictionary terms, so long as the system was trained on those then it should be able to recognise them again during a decode.

Updates and Maintenance
For the more discerning reader, you'll have realised by now a fundamental flaw in the plan laid out thus far. Language changes over time, people use new words and the meaning of words changes within the language we use. Text-speak is one of the new kids on the block in this area. It would be extremely cumbersome to need to train an entire new model each time you wished to update your previous one in order to include some set of new language capability. The models produced are able to be modified and updated with these changes without the need to go back to a full standing start and training from scratch all over again. It's possible to take your existing model built from the set of data you had available at a particular point in time and use this to bootstrap the creation of a new model which will be enhanced with the new materials that you've gathered since training the first model. Of course, you'll want to test and compare both models to check that you have in fact enhanced performance as you were expecting. This type of maintenance and update to the model will be required to any and all of these types of systems as they're currently designed as the structure and usage of our languages evolve.

Conclusion
OK, so not necessarily a blog post that was ever designed to draw a conclusion but I wanted to wrap up by saying that this is an area of technology that is still very much in active research and development, and has been so for at least 40-50 years or more! There's a really interesting statistic I've seen in the field that says if you ask a range of people involved in this topic the answer to the question “when will speech to text become a reality” then the answer generally comes out at “in ten years time”. This question has been asked consistently over time and the answer has remained the same. It seems then, that either this is a really hard nut to crack or that our expectations of such a system move on over time. Either way, it seems there will always be something new just around the corner to advance us to the next stage of speech technologies.

Going Back to University



A couple of weeks ago I had the enormous pleasure of returning to Exeter University where I studied for my degree more years ago than seems possible.  Getting involved with the uni again has been something I've long since wanted to do in an attempt to give back something to the institution to which I owe so much having been there to get good qualifications and not least met my wife there too!  I think early on in a career it's not necessarily something I would have been particularly useful for since I was closer to the university than my working life in age, mentality and a bunch of other factors I'm sure.  However, getting a bit older makes me feel readier to provide something tangibly useful in terms of giving something back both to the university and to the current students.  I hope that having been there recently with work it's a relationship I can start to build up.

I should probably steer clear of saying exactly why we were there but there was a small team from work some of which I knew well such as @madieq and @andysc and one or two I hadn't come across before.  Our job was to work with some academic staff for a couple of days and so it was a bit of a departure from my normal work with corporate customers.  It's fantastic to see the university from the other side of the fence (i.e. not being a student) and hearing about some of the things going on there and seeing a university every bit as vibrant and ambitious as the one I left in 2000. Of course, there was the obligatory wining and dining in the evening which just went to make the experience all the more pleasurable.

I really hope to be able to talk a lot more about things we're doing with the university in the future.  Until then, I'm looking forward to going back a little more often and potentially imparting some words (of wisdom?) to some students too.

The Ambient Kettle

Back in 2007, my Mum and I got a pair of Internet-connected Nabaztag bunnies. Aside from all the online content we could subscribe to using the bunnies, the most fun thing for me was that we could ‘pair’ our bunnies so that they would talk to each other. If I moved the ears on my bunny, the ears on my Mum’s bunny would move to match, and vice versa. The 250 physical miles disappear for a few seconds when you see the ears move and know that it’s because Mum is physically moving the ears of her bunny. I know exactly what she’s doing at that particular pointing in time, as if we’re briefly in the same room. The technical term for this is, apparently, ambient awareness.

My Nabaztag bunny
My Nabaztag bunny

The bunny ears experience of ambient awareness inspired my first (and, so far, only) Arduino project: Monitoring electricity using Christmas lights. The red/orange lights indicated the current electricity usage of my house and the blue/green lights indicated the current electricity usage of Mum and Dad’s house. The more electricity currently being used, the faster the lights flashed. Again, it was just that tiny tiny insight into what was happening 250 miles away. Just the mundanity of everyday life shared.

So I was curious about the Kickstarter project for the Good Night Lamp. The Good Night Lamp is a really nice and simple concept. One person has a Big Lamp (shaped like a  house) and they give Little Lamps, associated with the Big Lamp, to friends and/or family anywhere in the world. When the owner switches off the Big Lamp (when they go out or go to bed), the associated Little Lamps also switch off. An appealing part of it is that you can collect a Little Lamp from each of your family or group of friends and arrange them on a shelf so that before you go to bed at night, you can see each of them ‘say goodnight’ as their respective lights go out.

Good-Night-Lamp-9
Good Night Lamp

The problem I see with the Good Night Lamp is similar to the one with the Nabaztag. While I think it’s great having simple devices that do just one thing well, it doesn’t half clutter up the place. These kinds of devices need shelf-space. And it has to be shelf-space you can see easily in a place you’ll often be or they don’t work. Maybe as people replace all their books with the more easily stored ebooks, living-room bookcases will become filled with ambient devices instead. I got to chatting with Ambient Orb fan Andy Stanford-Clark about it.

While my and my Mum’s’ Nabaztags have now died or gone into hibernation and the Christmas lights never made it as far as the tree, our more lasting providers of ambient awareness don’t even have their own physical forms. Instead, they’re software on our smartphones and tablets, devices that we have around anyway, wherever we are. In particular, SMS updates of my Mum and Dad’s Tweets.

Every morning, my Mum wakes up, has a coffee with my Dad, and reads interesting articles on her iPad. I know this from when I’ve visited them and because when she reads an interesting article, she tweets or retweets it and I receive about half-a-dozen txts in quick succession. Later in the afternoon, after they’ve got home from wherever they’ve been that day (or have found free wifi somewhere while they’re out) and are drinking another cup of coffee or tea, I receive another half-a-dozen txts pointing to interesting articles online. Just receiving the txts gives me an awareness of them waking up or sitting down to read the paper. Clicking the links to the articles gives me an insight into what they’re reading and how they’re probably feeling about the topics of the articles. The fairly mundane, everyday things that we wouldn’t remember, or bother, to talk about on the phone a week or so later.

As drinking coffee or tea seems to play a regular, if side, part in the activities I’m notified about, Andy and I came up with the idea of the Ambient Kettle. In my house, we have a whole house Current Cost monitor that is connected to a server out on the Internet. It was the feed from this server that we used in my Christmas Lights project. Since then, though, I’ve added individual appliance monitors (IAMs) to a few of the appliances around the house, including the kettle. The feeds from these IAMs also go to the server and so can be used by applications that know which data to request.

So Andy hacked up a (private) Twitter account, @ambientkettle, which my Mum follows. Each time the kettle boils in my house, the @ambientkettle account tweets to my Mum:

@ambientkettle tweets
@ambientkettle tweets

Without being physically present or explicitly letting her know that I am making a cup of tea, she can get a sense of what I’m doing. The messages in the tweets that @ambientkettle sends are pre-canned and chosen at random but made to be chatty enough that it seems a bit like the start of a conversation. Indeed, Mum sometimes tweets back to it to say that she and Dad are also having a cup of tea or are looking forward to one when they get home, or whatever. As I say, it’s mundane but it’s those kinds of mundane things that make everyday life.

I’ll be interested to see how the Good Night Lamp gets taken up. It was featured in the very mainstream Daily Mail yesterday and its founding team has a good record of startups, product design, interaction design, and Internet of Things creativeness. And there’s something very appealing about having ambient awareness of friends and family when we’re geographically spread apart.

The post The Ambient Kettle appeared first on LauraCowen.co.uk.

Monkigras 2013: Scaling craft

The work of William Morris, my GCSE history teacher said, was a bit of a moral dilemma. Morris was a British designer born during the Industrial Revolution. British (and then world) industry was moving rapidly towards mass production by replacing traditional, cottage-industry production processes with the more efficient, and therefore profitable, machines. One thing that suffered under this move to mass production was the focus on function and quantity over decoration and quality. Morris reacted against this by designing and producing decorations like wallpaper and textiles using the traditional craft techniques of skilled craftspeople. My history teacher’s point was that although Morris, a passionate socialist, was able to create high quality goods by using smaller-scale production methods, only wealthy people could afford to buy his designs; which was hardly equality in action. On the other hand, the skills of craftspeople were being retained, quality goods were being produced, and the craftspeople were getting paid for that quality of their work.

My pretty, handcrafted latte
My pretty, handcrafted latte

Monkigras 2013, in London last week, took on this theme of ‘scaling craft’ in the context of beer, coffee, and software. All parts of this trinity of software development can benefit hugely from a focus on quality over quantity. Before I went to Monkigras, I wasn’t really sure what to expect from a tech event advertised as having a lot of beer. It did have a lot of beer (and coffee) available but if you didn’t want it you could avoid it (several people I talked to said they didn’t usually drink beer). And no one seemed to get ridiculously drunk. And there were a lot of very cool talks.

The beer was also a fun analogy to apply to software development. Despite pubs in the UK closing hand over fist at the moment, microbreweries are on the rise. Microbrewing is about producing beer in small quantities on a commercial basis so that quality can be maintained whilst still viable as a business. One of the things we learnt from a brewer at Monkigras is that the taste of water varies according to where it comes from. Water is a major component of beer so if the taste of your water supply changes, the taste of your beer changes. To maintain the quality of the beer you brew, you must work within the natural resources available to you and not over-expand. Similarly, quality comes from skilled and knowledgeable people who need to be paid for their skill. If you take on cheaper staff and train them less so that you can make more profit, you will end up with a poorer quality product. You get the idea.

Handcrafting a wooden spoon.
Handcrafting a wooden spoon.

This principle applies to all areas of craft, whether it’s producing quality coffee, a quality wooden spoon, quality conference food, or organising a quality conference, you have to focus on quality and ensure that if you scale what you do so that it’s more readily available to more people, you don’t sacrifice quality at the same time. And, importantly, that you know when to stop. Bigger doesn’t necessarily mean better.

Software is misleadingly easy to produce. Unlike making physical objects, there is very little initial cost to producing software; you can make copies and then distribute them to customers over the Internet at very little cost. Initially, at least, it’s all in the skill of the craftspeople and their ability to identify their target users and market. If they can’t make what people will buy, they will go out of business very quickly. As software development companies get larger, the people who make the software inside the company become further removed from the selling of their software to their customers. So they become more focused on what they are close to, the technology but not who will use it.

Phil Gilbert on IBM Design Thinking
Phil Gilbert on IBM Design Thinking

Phil Gilbert, IBM’s new General Manager of Design, comes from a 30-year career in startups, most recently Lombardi, where design was core to their culture. IBM has a portfolio of 3000 software products so, when Lombardi was acquired by IBM, Phil set about simplifying the IBM Business Process Management portfolio of products, reducing 21 different products to just four and kicking off a cultural change to bring design and thinking about users to the centre of product development. Whilst praising IBM’s history of design and a recent server product design award, he also acknowledged at Monkigras: “We are rethinking everything at IBM. Our portfolio is a mess today and we need to get better”. Changing a culture like IBM’s isn’t easy but I’ve seen and experienced a big difference already. Phil’s challenge is to scale the high-quality user-focused design values of a startup to a century-old global corporation.

One of the things that struck me most at Monkigras, and appealed to me most as a social scientist, was the focus on the human side. Despite it being a developer conference, I remember seeing only one slide that contained code. The overriding theme was about people and culture, not technology; how to maintain quality by maintaining a culture that respects its craftspeople and how to retain both even if the organisation gets bigger, even if that naturally limits how much the organisation can grow. Personal analogy was also a big thing…

Laser-scanned model of the engine
Laser-scanned model of the engine

Cyndi Mitchell from Logspace talked about her family’s hog farm and working within the available resources. Shanley Kane from Basho used Dante’s spheres to describe best product management practices. Steve Citron-Pousty from RedHat use his background as an ecologist to manage communities and ‘developer ecosystems’ (don’t just call it an ecosystem; treat it like one). Diane Mueller from ActiveState talked about her 20%-time project to build a crowdsourced database of totem poles and the challenges of understanding what gets people to want to contribute to such projects. Elco Jacobs talked about his BrewPi project: automatically managing the temperature of his homebrewing fridge using a RaspberryPi based controller, and how he has open-sourced to build a community to kick start it as a potential small business. Rafe Colburn from Etsy more directly makes the link between craft and software engineering in his slides.

3D printer making a spoon
3D printer making a spoon

I don’t know much about William Morris so I don’t know which presentations he would have enjoyed or disagreed with. Morris was a preservationist and started the Society for the Protection of Ancient Buildings to ensure that old buildings get repaired and not restored to an arbitrary point in the past. So maybe he would have found laser-scanning and 3D printing interesting. Chris Thorpe is a model train geek and likes to hand-make his own models of real-life objects. He too is interested in alternatives to mass manufacturing and has started to look at how to make model kits. He uses a laser to scan the objects and a 3D printer to prototype the models. He can then send the model to a commercial company who can make it into kits for him to sell. He has recently used his laser-scanning technique to scan a rediscovered old Welsh railway engine to preserve it, virtually at least, in the state in which it was found.

I had a great time with lots of cool and fun people. Well done to @monkchips for scaling a conference to just the right level of intimacy and buzz. The last thing I saw before I left was the craftsman making a wooden spoon pitted in competition against the 3D printer making a plastic spoon.

You can find many of the slide presentations and more about the conference Lanyrd.

The post Monkigras 2013: Scaling craft appeared first on LauraCowen.co.uk.

MQTT Joggler

Spurred on by the success of getting Mosquitto working on a Raspberry Pi, I recently had a play with MQTT on the Joggler. The O2 Joggler is still a great device for hacking and I currently have SqeezePlay OS running on it.

The reason I wanted to try and get MQTT on the Joggler was to make use of its light sensor, and publish light levels over MQTT. It all turned out to be pretty simple since most of the work has already been done by other people!

First thing to do was read the light sensor and get that working with an MQTT client. I had to skip some of Andy’s instructions and just built the client code rather than attempting to get doxygen working. Once I’d mashed up the light sensor code and publish example I could compile the worlds most pointless MQTT publisher:

gcc -Wall publightsensor.c -L../bin/linux_ia32 -I../src -lmqttv3c -lpthread -o publightsensor

Next it was time to check the results. This too was quick and easy thanks to the MQTT sandbox server, which has a handy HTTP bridge. And the final result... was a completely unscientific and slightly dingy light level 4! Now I'll be able to turn on a lamp using an unreliable RF controlled socket and see whether it worked or not!

Update: the code really is all in the existing examples but I've created a Github Gist in case it's any help: mqttjogglermashup.c (11 February 2013)


Everybody Technology

This afternoon I went to Everybody Technology, an event to discuss the need for technology to be inclusive and made in a way that is “so smart, so simple and so powerful it works for everybody”.

A highlight of the afternoon was Stephen Hawking – perhaps one of the best examples of the power of technology to enable someone to reach their potential. He also supported the event by lending his voice to a promotional video which explains the idea better than I can.


“Who is Technology Made For?” (YouTube)

There were several speakers. I won’t do them justice, but I did jot a few notes…

Panel discussion with Rupert Goodwins (ZDNet UK) & Damon Rose (BBC)

They talked of the stigma of using “special” equipment created especially for the blind. There were examples where even when technology or tools exist that can help, people don’t always want to use them. Maybe because they feel embarrassed, or they don’t want to be different, or even that they’re struggling with feeling forced to join a group of people they don’t feel a part of.

They discussed how it was more acceptable to use technologies when they are “standard” and how some felt more comfortable using technology that doesn’t single them out as being different.

Someone noted how people can be embarrassed wearing a hearing aid to help them hear, whilst few people would be embarrassed to wear glasses to help them see. Why are some assistive technologies more culturally acceptable than others?

There was a lot of mention of iDevices and appreciation of assistive technology being delivered as iPhone apps. To everyone else, it’s an iPhone and doesn’t stand out as being different. In addition, the fact that it’s mass-manufactured has meant that an expensive collection of advanced sensors and processing capability can be made affordable. An equivalent device produced purely as an assistive technology would be prohibitively expensive. The iPhone sparked a smartphone revolution that made this technology affordable in a way that it wasn’t before.

There was also discussion about how the app culture removed barriers between potential users and developers. Affordable sensors and technology made widely available, combined with a low-cost delivery mechanism for software innovations, make possible innovations in assistive technology that would have been impossible a few years ago.

Presentation on accessible architecture by Paul Kalkhoven

This looked at parallels between buildings and software. Disability became accepted as important in architecture and you can’t build a new building without considering accessibility. This isn’t yet true of technology.

He talked of the conflicting interests of design and utility. When designing a building, you want it to be unique and different. However, you want it to be obvious. If you want to find a toilet or fire exit, you want to understand the layout immediately. The same applies to technology: we want to make something new and exciting. But there is an expectation that it should be usable without a manual. It needs to be accessible.

One observation I hadn’t really recognised: transport buildings lead the way for accessible architecture, often abiding by a common, albeit unwritten, set of standards.

He challenged us to consider what technologists could learn from their experience.

Presentation on talking TVs by Mark Vasey (Panasonic)

Voice guidance is included as standard in most new Panasonic TVs, offering text-to-speech guidance for complex TV menus.

Perhaps more interesting was how they made it happen. He talked about challenges such as the cost of development, licensing and royalties for a feature they include “for free”. There were challenges in marketing to a minority, without wanting to classify it as a specialist product, and without making sighted users think that they were paying for a feature they didn’t need or want.

Similar to the discussion of the iPhone’s impact, he explained how the only way they could do this and make it affordable was to make it standard. Making a specialist TV with accessibility features for the visually impaired would not have been affordable. Spreading the cost across their entire product line is what made it possible.


“Introduction to Voice Guidance on Panasonic talking TVs” (YouTube)

Presentation on Threedom Phone – Antony Ribot (Ribot)

Antony gave a thought-provoking presentation about their project to make the world’s simplest smartphone.

The smartphone revolution has been great for many, but isn’t suitable for everyone. For some, the controls are too small, or too fiddly, or just too complicated. What if we made a smartphone that had only three buttons? Could we provide the essential functions that people need on a device with three large, easy to press, easy to understand, buttons?

He had an example with him and made a convincing case that there is a need for a device like this, in a market where devices are racing to get more complicated.

Everybody Technology : rlsb.org.uk/everybody

A year ago, I wrote about RLSB’s event which brought together a handful of representatives from tech companies, consumer-facing businesses, Universities, and charities for the blind. We talked about a vision of a Conversational Internet.

A year later, and RLSB got together a couple of hundred people to talk about projects that had happened – both by them, such as the Conversational Internet prototype that I presented, and by others such as Panasonic’s collaboration with RNIB to produce Voice Guidance.

They talked about what comes next, establishing a new group to bring together technologists and designers with people who understand disabilities, to make real their vision where everyone is taken into consideration.

If you think this is something you can help with, either as a developer, designer, or someone who understands a disability, then why not join them.


developerWorks Days Zurich 2012

This week I had a day out of the office to go to Zurich to talk at this years IBM developerWorks Days. I had 2 sessions back to back in the mobile stream, the first an introduction to Android Development and the second on MQTT.

The slots were only 35mins long (well 45mins, but we had to leave 5 mins on each end to let people move round) so there was a limit to how much detail I could go into. With this in mind I decided the best way to give people a introduction to Android Development in that amount of time was to quickly walk through writing reasonably simple application. The application had to be at least somewhat practical, but also very simple so after a little bit of thinking about I settled on an app to download the latest image from the web comic XKCD. There are a number apps on Google Play that already do this (and a lot better) but it does show a little Activity GUI design. I got through about 95% of the app live on stage and only had to copy & paste the details for the onPostExecute method to clear the progress dialog and update the image in the last minute to get it to the point I could run it in the emulator.

Here are the slides for this session

And here is the Eclipse project for the Application I created live on stage:
http://www.hardill.me.uk/XKCD-demo-android-app.zip

The MQTT pitch was a little easier to set up, there is loads of great content on MQTT.org to use as a source and of course I remembered to include the section on the MQTT enabled mouse traps and twittering ferries from Andy Stanford-Clark.

Here are the slides for the MQTT session:

For the Demo I used the Javascript d3 topic tree viewer I blogged about last week and my Raspberry Pi running a Mosquitto broker and a little script to publish the core temperature, load and uptime values. The broker was also bridged to my home broker to show the feed from my weather centre and some other sensors.

Recent hacktivity

This time of year seems to be hacking season and over the last few days I’ve been along to two hackdays!

Friday was IBM’s internal Social Business Hackday. There was some MQTT hacking, a z/OS hack, hacks with Lotus Connections, hacks that could be the future of Lotus Connections, and I was attempting to hack a work around for a Jazz work item. And that was just at the Hursley local event! We were able to link up with a few other labs, but over two days there were IBMers hacking around the globe. There are going to be a lot of amazing projects to choose from when it comes to voting.

(There are a few more photos from HackDay X, and previous hackdays, on the IBM hackday group on flickr.)

For round two, today was the soutHACKton hack day. By the time I arrived the soldering and drilling had already begun!! Unfortunately I wasn’t able to stay long so I’m hoping there’ll me more of these in the future. I did just about have time to try out an idea I had to hack an old doorbell to sense people using the door knocker. A while ago I had accidentally created a touch sensor with a 555 timer while attempting to build another circuit. So my cunning plan was to deliberately create a 555 touch switch and connect it to the bolt on the inside of the front door. Unfortunately the best I could manage today was a two wire touch sensor, which isn’t going to work. At least not without leaving a wire hanging out of the letter box with some instructions attached! Unless someone who knows more about electronics can suggest a plan B, I may just resort to a boring doorbell button instead!!


Recent hacktivity

This time of year seems to be hacking season and over the last few days I’ve been along to two hackdays!

Friday was IBM’s internal Social Business Hackday. There was some MQTT hacking, a z/OS hack, hacks with Lotus Connections, hacks that could be the future of Lotus Connections, and I was attempting to hack a work around for a Jazz work item. And that was just at the Hursley local event! We were able to link up with a few other labs, but over two days there were IBMers hacking around the globe. There are going to be a lot of amazing projects to choose from when it comes to voting.

(There are a few more photos from HackDay X, and previous hackdays, on the IBM hackday group on flickr.)

For round two, today was the soutHACKton hack day. By the time I arrived the soldering and drilling had already begun!! Unfortunately I wasn’t able to stay long so I’m hoping there’ll me more of these in the future. I did just about have time to try out an idea I had to hack an old doorbell to sense people using the door knocker. A while ago I had accidentally created a touch sensor with a 555 timer while attempting to build another circuit. So my cunning plan was to deliberately create a 555 touch switch and connect it to the bolt on the inside of the front door. Unfortunately the best I could manage today was a two wire touch sensor, which isn’t going to work. At least not without leaving a wire hanging out of the letter box with some instructions attached! Unless someone who knows more about electronics can suggest a plan B, I may just resort to a boring doorbell button instead!!


Conversational Internet

tl;dr

We’ve built a prototype to show how we could interact with the Internet using a command-driven approach.

  • A screen reader, but one that uses machine learning and natural language processing, in order to better understand both what the user wants to do, and what the web page says.
  • One that can offer a conversational interface instead of just reading out everything on the page.

It’s a proof-of-concept, but it’s an exciting idea with a lot of potential and we’ve got a demo that shows it in action.

The problem : screen readers today

I’ve written about this before but here is a recap.

Visually impaired people can interact with the web using screen readers. These read out every element on a page.

The user has to make a mental model of the structure of the page as it’s read out, and keep this in their head as they arrow-key around the page.

For example, on a news site’s front page, once the screen reader has read out the page, you have to remember if the story you want is the fifth or sixth story in the list so you can tab the right number of times to get to it.

Imagine an automated telephone menu:
“for blah-blah-blah, press 1, for blather-blather-blather, press 2, for something-or-other, press 3 … for something-else-vague, press 9 …”

Imagine this menu was so long it took 15 minutes or more to read.

Imagine none of the options are an exact match for what you want. But by the time you get to the end, you can’t remember whether the closest match was the third or fourth, or fiftieth option.

The vision : a Conversational Internet

Software could be smarter.

If it understood more about the web page, it could describe it at a higher, task-oriented level. It could read out the relevant bits, instead of everything.

If it understood more about what the user wants to do, the user could just say that, instead of working out the manual navigation steps themselves.

The vision is software that can interpret web pages and offer a conversational interface to web browsing.

Continue reading