Finding the Right Voice Interactions for Your App (Google I/O '17)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[MUSIC PLAYING] DANIEL PADGETT: Hello, everybody. How's everybody doing? It's day two of I/O. Good-- a rounding "good" from everybody. I had this idea. I didn't go with it. I was going to grab my guitar from my house and bring it on stage. I wasn't going to play you a whole song. I was just going to play like three chords, and then I could tell all my friends that I played Shoreline. I didn't go with it, though. So title of the talk today-- "Finding the Right Voice-- Interactions for Your App." By way of introduction, my name is Daniel Padgett. And I'm a conversation design lead here at Google. I have a very great pleasure of working on Google Assistant, with a focus on voice interactions across the services that we have-- Google Assistant. I work with an amazing cross-functional team that's tackling really difficult problems. Actually, I've been working in the language technology space for about 15 years. And it's kind of like everything's coming together in a way that's really exciting for me personally. I've just been kind of waiting for this stuff. This is that stuff we thought about, like the sci-fi promise-- the stuff that "Star Trek" and "Star Wars" and all that kind of stuff-- things I was really excited about as a kid. And I know there's a ton of work to do. And we're nowhere near where those bots happen to be. But this is a really exciting time to be in this space and working on Assistant. So a question I get a lot, what is conversation design? And thank you for asking. For me, conversation design is a carefully curated back and forth between human and artificially intelligent machine. It generates a personalized and context-sensitive experience around tasks and content. And it's content that we experience in a very visceral way. It's like an immersive experience filled with things that we can see, that we can hear, that we can touch. And it's contact that we can interact with and control by speaking and scrolling and swiping and tapping and even typing, as we announced yesterday in the keynote. In that way, it's a design practice that really puts humans at the center. It's one of the things I really love about it. It lets us be us. And in the end, when people ask, what do you do, I teach robots and talk to humans. That's my gig. But we're here today to talk specifically about voice-only interactions-- voice-in, voice-out, no screens, no tap targets, no navigation bars, just voice. Now I'm assuming that most of you are new to this space, and you're just starting down the path here. Show of hands, has anyone worked with voice in the past? OK, a decent amount. Before we get too far into specifics, I just want to say, designing for voice is different-- different than for mobile and for websites. Please, take that at face value. I'm making no claims about complexity here. Designing can be a tough thing. It's just different. Voice is different. And that's what I'm hoping to unpack a bit for you today. So here's a quick look at what we're going to cover. I want to cover off the value of voice-- again, voice online, no screens. We'll cover some considerations, things that you should be thinking about as you're creating voice experiences. And some opportunities-- really some ideas around use cases. All right? Sounds good? Good. So let's start with the value of voice. The question here really is, why? Why voice? And for me, it boils down to three things-- speed, simplicity, and ubiquity. All right. Now let's start with speed. Now voice interactions can be extremely, extremely fast. Done well, they're even faster than pulling your mobile phone out of your back pocket. It's like, getting to the app you want on a mobile phone takes a tap or two, maybe even a swipe. And when you get to the app, you're another tap and maybe even some typing away from a result. With voice, you get to bundle all that. And it's like the ultimate shortcut. That's what I think is really cool. Then there's just the simplicity of use. You have users who already know what to do. This is language. It's conversation. And they've been doing this forever-- well, since they were born. And there's really nothing for them to learn. Don't get me wrong, I'm the first to acknowledge the challenges around design and technologies and things like that. But the promise here is, say what you want, get what you want. And it doesn't really get much simpler than that. And finally, there's just the sheer number of entry points-- a number that continues to grow. You can already reach Google Assistant and our actions on Google partners on millions of devices. And it feels like there are new surfaces and new opportunities popping up all the time. All you need really is a microphone and a speaker. Well, that and a network connection to the actions on Google Platform, but you get the point. The reality here is that the numbers are significant. The potential here is massive. By the way, I don't know if anybody saw this thing that was launched a couple weeks ago. Literally, cardboard, a microphone, and a speaker, Raspberry Pi, and you're connected to Assistant. That's really cool stuff. Did anyone get one? No? Oh, one guy. Get one. I have one sitting on my kitchen table. It's the next project with my 11-year-old. I'm looking forward to it. Anyway, let's do a quick thought experiment. Let's play, "how many taps would it take?" All right. Simple equation-- what's 15 times 24? You don't have to take out your phones or anything. But just think about it. If you took out your mobile phone, how many taps would it take? You open your phone. You navigate to the app. You tap the app. You type the equation, get your answer. I think I did it in eight on my Pixel earlier today. Now how about this one-- play the latest album by Guerrilla's on Spotify, one of my favorites. I think on my device I did it in 12 taps. I think there are a couple swipes in there to go through my pretty lengthy library. Last one-- any direct flights to Denver next Sunday? Anyone? Right. I did stop counting. I actually stopped counting. So long story short, with voice, we can get our answers and actions pretty much instantaneously. So why voice? Ultimate convenience. It's really, really powerful stuff, when done the right way. Now let's talk through some considerations. It seemed like a lot of you didn't raise your hand, so you're new to the space. The first thing I wanted to talk to you about is the nature of the signal. I think it's a good thing to acknowledge. Fact is, voice is linear. And it's always moving forward. It's ephemeral, too, which means it's always fading behind you. If you can imagine, it's like the critical elements of your mobile app disappearing pretty much as soon as they appeared. The Back button, the Hamburger menu just kind of fading away-- content appearing once and only for a fleeting moment. It's kind of like you got to catch up with it. Those anchors are critical to visual and touch interfaces. They're persistent. They're available. So with voice, we rely really heavily on the users' knowledge of conversation and what they can recall. So some things to think through, based on the nature of the signal-- keeping people comfortable, helping them stay in that space and that thing that they know-- the metaphor of conversation. Use everyday language that users can relate to. It's a really important one. Ask questions that are easy to answer. You don't want them thinking too much. You're going to time out anyway with the current technologies. And structure information in a way that supports easy recall. Make it easier for them to remember the stuff you're presenting. The strategies around this-- or deeper strategies around this-- and I'm not going to go deep here. I would actually encourage you, tomorrow, there's a talk by a gentleman named James Giangola, a friend of mine and colleague. He's giving a great talk on what we call the co-operative principle. I'll show you some details later. But that's going to be some good stuff. All right, moving on. The other thing that you want to consider is the capabilities of the technology. Recognizing the words people say is much different than understanding what they mean when they say them. Now with respect to recognition, Google already had a low word-error rate. But we continue to improve, apparently. I saw in Sundar's keynote yesterday that after applying deep learning to the problem, we slashed our word-error rate again significantly. And now we're down to something like 4.9%. That's an absolutely amazing result. So to some extent, recognition is a, quote unquote, "solved problem." Yes, there's still room to improve. But it's more or less a solved problem. But language understanding is a more difficult nut to crack. So just a couple examples maybe to illustrate the point-- what's the weather in Springfield? So which Springfield are we talking about? Missouri, Massachusetts, some other Springfield? I did a Google search, and there's I think one in every state. I think it's the most popular name for a city in the nation. Maybe the best solution to answering this question for the person who posed it is to apply some context. Like, we know that the user lives in Springfield, Missouri. Therefore, we provide them with the weather in Springfield, Missouri. But maybe that's not the right thing. Maybe it's weird that they're asking the question, what's the weather in Springfield, when they live in Springfield. Wouldn't they ask, what's the weather here, what's the weather like today-- that sort of thing? I don't know. It's possible they mean one of the other Springfields. How about another one? Play "Yesterday." Is that the song? Is that the movie? There is a movie. The playlist, an audio book? Maybe it's a game. Who knows? Now again, applying context and the things a company like Google would know about users, statistically they're probably looking for the song. Right? But that opens up other complexities. While I have no idea why anyone would want to hear anything other than the original version by the Beatles-- that's my personal preference-- perhaps they want to hear the dulcet tones of Boyz II Men singing their R&B rendition and the cover that they did in the early '90s or some other version. I really don't know. Anyway, it's something to consider as you deploy your voice applications. I have no doubt that you're going to be running into this. Some strategies here-- acknowledge ambiguity. When you don't know the answer, you can't apply the appropriate context. And let users clarify. Don't be afraid to engage. A lot of people, I hear them say, well, it's an extra step. Believe me, taking that extra step is way better than the correction they're going to have to do if you get it wrong. So just keep that in mind. And remember, there are choices. And leverage them next time, if you can. This is the learning part. Keep a record, and apply that the next time that they come into your app. It's going to be super useful for streamlining interactions in the future. Last, but certainly not least, we have our users. Always, always, always, we need to consider our users and their context. We've always kind of talked about voice-only interactions being the hero for people whose hands are busy, whose eyes are busy, who are doing some sort of multitasking, typically in private or at least in a familiar space-- something he's sharing with a family or something like that. It's voice interactions while they're driving in their car, cooking in the kitchen-- that sort of thing. And that's absolutely still the case. But there's definitely movement outside of that-- people leveraging voice while they're on the move in public spaces, for instance, like on their bike. I see kids a lot using this stuff on the street. But what I really want to cover here is something that we touched on earlier-- the fact that users are instant experts. There is nothing to teach, or at least there shouldn't be. Again, this is language. It's conversation. It's something that they've been doing forever. And because of that, they have high expectations and very, very, very low tolerance for error. This stuff should just work, right? Now let's think about that for a second-- the difference here between touch and text input and for a visual UI and a voice UI. With a touch UI-- with visual, tappable UIs-- reasons for errors tend to be transparent. You know when you had a typo or when you've hit the wrong top target. You know when you need to course correct. And you're in control. You know what to do, you take the step, and you make the correction. On the voice side, though, errors are typically the system's fault, not the users. That is, if the user is truly cooperating, if they're not speaking nonsense. It really then becomes the system's fault. They have to rely heavily, when there's error, on the system to get them back on track. That said, some strategies here-- make sure you're really, really, really spending time developing a strategy for exceptions. I just can't stress this enough. This is the thing that is going to make or break your application. You want to make it really easy for them to get back on track. And you want to leverage techniques that we use in everyday conversation. Maybe I could try to illustrate this one. So you're collecting a date from somebody. You ask them, what date do you want to travel? And somebody says, April 14th. Let's say you needed the year for some reason. Actually, let's back up. You asked for the date. They say, yeah, April 14th. I'm sorry. Brain-- April 14th. You want you want the year as well. For some odd known reason, they may be traveling a year from now. If you force them to give you the year, that's going to create this loop. You're going to throw them an error. You're going to say, sorry, I didn't get that. You're going to ask them to repeat the day. And you frustrated a user. So instead, your strategy here should be, accept the April 14th and just move on and say, what year? Just use the type of things that we do in everyday conversation. Or if you have some sort of recognition error, an easy strategy is simply, if they say a date incorrectly, sorry, what's the date? And just ask them to repeat it in a very natural way. Now lastly, we're going to touch on some opportunities-- opportunities use cases. Now from a user perspective, what they're asking here really, or what they're telling us, I should say, in the space is, answer my questions, help me get things done, keep me on track, and give me something to do. From a developer perspective, I think that means supply quick answers, frictionless transactions, and give relevant suggestions. So some use cases, just taken from some verticals in the space-- some verticals that are out there. Like if you're in financial, some things that you might want to consider-- what's my current balance, how much do I owe, what was my last transaction-- some things that should be quick and fast. I need to make a payment, transfer $100 to Joe's account, or when does my policy expire? Maybe some of you are in retail. Is my order on the way, what's your return policy, do you do same-day delivery? Maybe it's a stock check. Do you carry milk, are headphones on sale, I need to reorder cat food? Health care-- actually a space that I came from-- is my prescription ready, can I get a flu shot, do you have any appointments available? Maybe there's a clinic. I need to schedule a follow-up, I need to leave a note for my doctor. Refill my prescription is a really fast one. Wonder if there's anybody from Walgreens here. Anyone? Hey, how's it going? And lastly, just another vertical that I thought through-- fun and wellness. What's my horoscope, got any good dad jokes? Actually, I think every morning, my son asks Google Home that we have in the kitchen, he asks him to tell him a joke. And then he tries to pass it off on his own. He runs into the bathroom where my wife is getting ready and tries to pass it off like it's his. Some other opportunities around reading stories or playing a game or meditation-- in the end, I think the thing that you really need to make sure is that it's easier than the alternative. If it's not more convenient for your users to do this in voice, it really probably isn't worth pursuing. And so I can't stress this enough. Make sure it's easier than the alternative for you. Now I mentioned some related talks. There is one that was done earlier today on multi-modal design. And that was done by my colleagues Jared Strotterman and Adriana Olmos. "One Size Doesn't Fit All-- Defining Multi-Modal Interactions," and I would encourage you to go watch that online. There is one tonight at 6:30 by Nandini Stocker called "In Conversation There Are No Errors." and that gets at the strategies that we were discussing a little earlier. And then finally, the one that I mentioned by James, "Applying Built-in Hacks of Conversation to Your Voice UI" is really all about that cooperative principle and making sure that you are leveraging the rules of discourse and conversation to make your users very comfortable in your speech application and your design. I feel like there was a massive chunk of that that has disappeared from my presentation somehow. And I am at the end, unfortunately. Here's a URL for you to go to and to check out some stuff by the actions on Google Design team and Developer team. Make sure you take that down and go check it out. [MUSIC PLAYING]

Info

Channel: Google Developers

Views: 9,040

Rating: 4.8024693 out of 5

Keywords: Daniel Padgett, Google design, design, Google IO 2017, google io, I/O, io 2017, io17, google io17, #io17, google i o, i o, io conference, what is io, google conference, io google, google event, google developer conference, google, shoreline amphitheatre, Location: MTV, GDS: Full Production, Team: Scalable Advocacy, Fullname: other, Product: other

Id: 0PmWruLLUoE

Channel Id: undefined

Length: 22min 11sec (1331 seconds)

Published: Thu May 18 2017