github twitter keybase linkedin email
Challenges to overcome in Voice UIs
Nov 24, 2017
5 minutes read

Voice as a consumer interface is still in its infancy, with some issues to iron out before it truly hits prime-time

Google Home Mini

With Black Friday upon us and Christmas around the corner, purchases of home voice assistants such as Google Home and Alexa are expected to be more popular than ever. While voice has been used as a complementary tool in other interfaces (e.g voice google search on a mobile phone), it really hasn’t existed purely on it’s own two feet in many consumer devices – meaning no screen to complement it.

During my short time developing Voice UIs for these home assistants I’ve realised that the design patters in Voice UIs feel very underdeveloped and lead to some issues in usability and retention. I don’t think this is terminal, I see this in a similar vein to when mobile apps first become popular – a lot of messing around and experimentation, eventually settling on some general principles common to most apps. Below I’ve listed a few of the main problems I’ve experienced in developing for voice which aren’t really problems when creating visual UIs. Whether these are problems that can be solved, or just constraints that we have to live with and work around, I’m not sure.

Feature Discoverability

Give me a mobile app, tell me nothing about it, and I’ll figure it out through prompts and experimentation. Put a Google Home in front of me and it’s hard to figure out what to do. Google have currently dealt with this by having a “Discover” tab in their Google Home mobile app (visual), but users don’t want to have to read an instruction book.

A potential solution could be the app trying to teach you over time based on what kind of things you currently use it for (“Timer is set, did you know you can also ask X”), it will be tricky though to not make this intrusive and spammy. Another could be some kind of on-boarding, similar to how many visual apps do it now, although with the lower bandwidth of audio it may take minutes to explain, and users

Flat Attention Hierarchy

With visual apps you can scan and then focus on different things. Scroll, scroll, scroll, then tap into the thing that catches you eye. Components can be different sizes to convey importance. In voice, everything is given the same “attention priority”, meaning it’s all conveyed to you in a sequential, ordered fashion that you can’t tend to skip over as you’re not entirely sure the full extent of what you’re skipping over or skipping to.

One consequence to think about, this has huge ramifications for online shopping. Visually you can scan items very quickly, looking at 10s of options in a 30 second period and getting a pretty good idea of what’s on offer. This currently isn’t the case in voice interfaces, it takes a painful amount of time to get through each item and you can’t really know if you should skip or not until you’ve heard most of the speal.

Side note: Listening speeds vs Reading speeds

According to wikipedia:

The average adult reads text at 250 to 300 words per minute. 150 to 160 words per minute is the recommended talking speed that allows for comprehension and understanding.

On the surface this doesn’t seem like a huge divide, ~2x word speed. The one advantage though that text has is you can skim read it, not picking up 100% of the details but getting a pretty good idea of the content. Skim reading can be done significantly faster:

Skimming is usually seen more in adults than in children. It is conducted at a higher rate (700 words per minute and above)

Now we’re looking at an almost 5x increase in information transmission speed, albeit at a lower comprehensibility.

Requires Active Intent

Voice UIs currently require conscious and active intent outside of a few common tools that can be ingrained into habits (such as setting timers whilst cooking). Mobile apps on the flipside can be picked up and browsed in a more relaxed, less directed fashion. I’m not going to trounce into whether this is a good feature of mobile UIs or not, but the fact is that it leads to mobile being much stickier than voice in day to day usage. Like I said at the start this may not be a challenge to overcome but merely a constraint – that Voice UIs are more suited towards active tools than passive consumption.


All of the above leads to common situations where the Google Home or Alexa product is used for a week or two after purchase, but then is relegated to an ornamental role outside of a few key uses that people ingrain into a habit (I currently use my Google Home to make white noise and tell me the weather). As VoiceLabs reported earlier this year “When customers use a voice app on Alexa or on Google Assistant, there is only a 3% chance that they will become an active user in the second week, according to the report.”

Bring it all together

Voice UIs currently haven’t solved how to become a common interface that users naturally engage with on a common basis, nor encouraged users to learn more and expand their use of these interfaces over time (at the moment it seems to be the opposite). On the other hand voice as a tool, usually a convenient input method, integrated into other UIs seems to be taking off. An increasing proportion of google searches are now done by voice (coming from the phone).

Whether voice can stand on it’s own two feet or remains a tool integrated into visual interfaces, only time will tell.

Back to posts