Voice recognition still has retailers jumping through hoops

Editor's Note: The following is a guest post from Sam Cinquegrani, CEO and founder of ObjectWave, a digital commerce solutions provider.

Walmart’s just announced partnership with Google Assistant is bound to give Alexa a run for its money. But when adopting any new technology, there’s one question common to all of them: how do you work within its limitations? Bottom line, you need to take the constraints into consideration in order to give your shopper an optimal user experience.

An obvious example of constraints is the smartphone: With its relatively small screen, you don't have the real estate to always be able to deliver an optimal user experience. Similarly, voice recognition software has constraints.

How can retailers get around the fact that Voice technology is not yet perfect?

In order to make an interaction — whether a command or voice prompt, such that it’s one the customer doesn’t have to repeat over and over again — a huge behind-the-scenes databank needs to be in place. Simply put, as with any search, the equivalent term or item being searched for has to exist in that data bank. If it’s not found, the customer hits a wall.

It’s therefore very easy for Voice to become counterproductive — the software can, and often does, fail to recognize a word, annoying the customer instead of providing a new way of engaging her. With Voice, the maxim "know your customer" is almost too glib — there are many more layers to the Voice interaction than any other digital touchpoint.

So how can retailers make Voice work?

Let's use the analogy of passing through a number of doors until you get through the door behind which lies your product or service. Voice is a collaborative effort between retailer and customer. The voice command needs to open the first door, then subsequent doors. At each door, the command needs to be understood. So if at any point, you can’t get further and the door doesn’t open, the request or sale doesn’t happen.

Not only has that lack of recognition or understanding lost the retailer a sale, worse, it could have lost the retailer that customer. It’s akin to having a retail website that takes forever to load — it’s well documented that shoppers won't wait even 5 seconds and disturbingly slow load times drive them to another site that loads faster. Add to that, voice recognition is so much more emotionally loaded than visual apps are — how many people have you observed yelling back at disembodied voices they encounter on phone answering systems, because their request is not being understood. Getting to that point where Voice becomes an asset to improve user experience or the customer requires some real work.

Some retailers have already gotten to a point with voice recognition software where it can discern requests correctly and come back with a correct, if broad, answer. To an extent, retailers can buy the technology, as it’s already out there and available. However, to get to a level of improved user experience, it becomes a matter of needing to match the basic, off-the-shelf technology to a specific application or use.

With that in place, let’s take a closer look at the door analogy.

First, the customer enters the application, that's door No. 1. Through door No. 2 the customer engages with voice and the retailer understands who he or she is and where to send their order. Door No. 3 is a biggie: the customer orders.

Now we get into the smaller and more specialized doors that are equally as important to customer satisfaction and user experience. Getting through door No. 4 involves Voice recognizing that the customer is ordering something, in this case food; door No. 6 is fresh produce; and to open door No. 7 — the kicker! — it must understand what the customer wants when she orders jicama. The customer will say "HIH-cah-muh," the (more or less) correct pronunciation of this desert tuber sometimes called Mexican turnip. Will the retailer’s voice software be able to pick up that the 'j' in jicama is pronounced as an 'h,' and will it know how the word could be variously accented? So along with synonyms, the voice vocabulary needs intensive training on “pronunciation-nyms.”

To achieve a level of engagement through Voice, all this will have to work the first time out, otherwise it will be instantly rejected by the customer.

What does Voice need to do that it’s not doing now?

Retailers will need to build in different ways of saying or searching for things, including different names for the same product and different pronunciations of words. For example, a British friend of mine has trouble being understood by voice software when it comes to his pronunciation of words like "process" — to him it’s "PROH-sess," while to most American English speakers it’s "PRAH-sess." In some cases retailers may need to make Voice smarter by developing a set of exceptions. This may require some manual intervention early in the game to analyze what’s not being understood by the machine, that would easily be understood by a human.

People often can’t remember what an item or product is called — they get tongue-tied. Natural language developers need to design and build a broad swath of vocabulary and also a thesaurus to cover this, and for colloquial expressions, too. They need to train voice software to filter out or disregard colloquial fillers, such as "um," "like," "you know…"

For the actual items, let’s take household cleaning products as an example. There’s a kitchen cleanup aid called Loofah RubBits — it’s multipurpose, biodegradable, non-toxic and comes in several colors; another similar product is called Twist. But can you imagine trying to get voice recognition software to understand what you’re asking for? Suppose you can’t remember the exact name of the product; maybe you can only remember "dish sponge, natural" but not the brand name RubBit, or loofah, or the brand name Twist. Currently, it’s all too easy to hit a wall with voice recognition software if you can’t come up with a word or phrase it recognizes.

In more general terms, a large part of what retailers need voice recognition to do is translate what the customer tells it into what’s available. Not only if the customer can’t think of the brand name — even if they can’t come up with a specific enough descriptor that the voice system will recognize, they — and the retailer — will be out of luck.

As developers of natural language capabilities have to work on a wide swath of vocabulary, it’s the specialty areas and specifics of language patterns that most often get ignored. When a customer wants to search for specific items, these can get lost — and the customer can get lost, too — because the language is missing.

Going wide, going deep

When working on online and mobile applications for retailers, I always start with the question, "Who is the customer, and how do they buy?" With voice recognition, it’s of vital importance to know what they buy — and specifically, what brands, varieties of fruits or vegetables, unique functions and so on.

While acquiring breadth of vocabulary is already in the works, Voice currently lacks the ability to go deep. It needs to move toward understanding and appealing to the specialty within markets, as well as to the connoisseur. As an example, let’s take this chocolate bar made by Dagoba. The flavor is Lemonberry Zest. Both the brand name and the name of this flavor hint at the multitude of creative new brands that are hitting the grocery shelves. So what if your customer wanted to order a box of these bars using Voice, and couldn’t pronounce Dagoba? Or let’s say, more realistically, that they pronounce it in some way that the voice software doesn't recognize. Multiply that by the several hundred new brand names that are popping up in specialty retail stores.

The question is, how many different possible ways of pronouncing Dagoba — leaving aside the question of right and wrong pronunciation entirely — would be enough for Voice to get it every single time?

Because there’s the rub. Where Voice stands currently, it can often be so deafly obtuse that anyone using voice recognition software is him- or herself finding different ways of saying the same thing in order to be understood.

So the artificial intelligence (AI) component that needs to be built in needs to include a broader vocabulary — a deeper vocabulary as required by specific retailers and industries — and a spectrum of pronunciations. If you’re a retailer, you not only have to configure voice recognition software to your online /mobile presence, you also need to make it understand and bend to your specialty, your particular offerings.

Voice plus mobile: An ideal pairing, but not quite there yet

Ideally, voice recognition is a perfect match for mobile, but the retailer will have to get all these other factors right first. While the high-level goal is to configure their voice recognition software to become so smart that it could predict what the customer wants and needs before the thought was even spoken, they will have to get to base one, which implies being able to successfully pass through the doors described above.

A voice system starts with different ways of saying things, and different ways of pronouncing them. Then the specifications on what they buy must be added to the "expert dictionary" of the software. Essentially, it’s the same as loading a term into a database and making it match up with a search. You load in as many possible terms as you can. Similarly with voice: you’re working with sound bytes stored in a database that you’re trying to match to your customer’s order. This range of search terms needs to be huge, pliable, elastic — and also, absorptive. Not unlike that loofah sponge.