[MUSIC PLAYING] DANIEL PADGETT:
Hello, everybody. How's everybody doing? It's day two of I/O. Good-- a
rounding "good" from everybody. I had this idea. I didn't go with it. I was going to grab my
guitar from my house and bring it on stage. I wasn't going to
play you a whole song. I was just going to
play like three chords, and then I could
tell all my friends that I played Shoreline. I didn't go with it, though. So title of the talk today-- "Finding the Right Voice-- Interactions for Your App." By way of introduction,
my name is Daniel Padgett. And I'm a conversation
design lead here at Google. I have a very great pleasure
of working on Google Assistant, with a focus on
voice interactions across the services
that we have-- Google Assistant. I work with an amazing
cross-functional team that's tackling really
difficult problems. Actually, I've been working
in the language technology space for about 15 years. And it's kind of
like everything's coming together in a way
that's really exciting for me personally. I've just been kind of
waiting for this stuff. This is that stuff
we thought about, like the sci-fi promise-- the stuff that "Star Trek" and
"Star Wars" and all that kind of stuff-- things I was really
excited about as a kid. And I know there's
a ton of work to do. And we're nowhere near where
those bots happen to be. But this is a really exciting
time to be in this space and working on Assistant. So a question I get a lot,
what is conversation design? And thank you for asking. For me, conversation design
is a carefully curated back and forth between human
and artificially intelligent machine. It generates a personalized and
context-sensitive experience around tasks and content. And it's content
that we experience in a very visceral way. It's like an immersive
experience filled with things that we can see, that we
can hear, that we can touch. And it's contact that we can
interact with and control by speaking and scrolling and
swiping and tapping and even typing, as we announced
yesterday in the keynote. In that way, it's
a design practice that really puts
humans at the center. It's one of the things
I really love about it. It lets us be us. And in the end, when
people ask, what do you do, I teach robots and
talk to humans. That's my gig. But we're here today to talk
specifically about voice-only interactions-- voice-in, voice-out, no
screens, no tap targets, no navigation bars, just voice. Now I'm assuming that most
of you are new to this space, and you're just starting
down the path here. Show of hands, has anyone
worked with voice in the past? OK, a decent amount. Before we get too
far into specifics, I just want to say,
designing for voice is different-- different than
for mobile and for websites. Please, take that at face value. I'm making no claims
about complexity here. Designing can be a tough thing. It's just different. Voice is different. And that's what I'm hoping to
unpack a bit for you today. So here's a quick look at
what we're going to cover. I want to cover off
the value of voice-- again, voice online, no screens. We'll cover some
considerations, things that you should be
thinking about as you're creating voice experiences. And some opportunities-- really
some ideas around use cases. All right? Sounds good? Good. So let's start with
the value of voice. The question here
really is, why? Why voice? And for me, it boils
down to three things-- speed, simplicity, and ubiquity. All right. Now let's start with speed. Now voice interactions can
be extremely, extremely fast. Done well, they're even faster
than pulling your mobile phone out of your back pocket. It's like, getting to the app
you want on a mobile phone takes a tap or two,
maybe even a swipe. And when you get
to the app, you're another tap and maybe even
some typing away from a result. With voice, you get
to bundle all that. And it's like the
ultimate shortcut. That's what I think
is really cool. Then there's just the
simplicity of use. You have users who
already know what to do. This is language. It's conversation. And they've been
doing this forever-- well, since they were born. And there's really
nothing for them to learn. Don't get me wrong, I'm
the first to acknowledge the challenges around design
and technologies and things like that. But the promise here
is, say what you want, get what you want. And it doesn't really get
much simpler than that. And finally, there's just the
sheer number of entry points-- a number that continues to grow. You can already reach Google
Assistant and our actions on Google partners on
millions of devices. And it feels like there are new
surfaces and new opportunities popping up all the time. All you need really is a
microphone and a speaker. Well, that and a network
connection to the actions on Google Platform,
but you get the point. The reality here is that
the numbers are significant. The potential here is massive. By the way, I don't
know if anybody saw this thing that was
launched a couple weeks ago. Literally, cardboard, a
microphone, and a speaker, Raspberry Pi, and you're
connected to Assistant. That's really cool stuff. Did anyone get one? No? Oh, one guy. Get one. I have one sitting
on my kitchen table. It's the next project
with my 11-year-old. I'm looking forward to it. Anyway, let's do a quick
thought experiment. Let's play, "how many
taps would it take?" All right. Simple equation--
what's 15 times 24? You don't have to take out
your phones or anything. But just think about it. If you took out
your mobile phone, how many taps would it take? You open your phone. You navigate to the app. You tap the app. You type the equation,
get your answer. I think I did it in eight
on my Pixel earlier today. Now how about this one-- play the latest album by
Guerrilla's on Spotify, one of my favorites. I think on my device
I did it in 12 taps. I think there are a
couple swipes in there to go through my
pretty lengthy library. Last one-- any direct flights
to Denver next Sunday? Anyone? Right. I did stop counting. I actually stopped counting. So long story short,
with voice, we can get our answers and actions
pretty much instantaneously. So why voice? Ultimate convenience. It's really, really powerful
stuff, when done the right way. Now let's talk through
some considerations. It seemed like a lot of
you didn't raise your hand, so you're new to the space. The first thing I
wanted to talk to you about is the nature
of the signal. I think it's a good
thing to acknowledge. Fact is, voice is linear. And it's always moving forward. It's ephemeral, too, which means
it's always fading behind you. If you can imagine, it's
like the critical elements of your mobile app disappearing
pretty much as soon as they appeared. The Back button, the Hamburger
menu just kind of fading away-- content appearing once and
only for a fleeting moment. It's kind of like you
got to catch up with it. Those anchors are critical to
visual and touch interfaces. They're persistent. They're available. So with voice, we
rely really heavily on the users' knowledge
of conversation and what they can recall. So some things to
think through, based on the nature of the signal-- keeping people
comfortable, helping them stay in that space and
that thing that they know-- the metaphor of conversation. Use everyday language
that users can relate to. It's a really important one. Ask questions that
are easy to answer. You don't want them
thinking too much. You're going to time out anyway
with the current technologies. And structure information in a
way that supports easy recall. Make it easier for
them to remember the stuff you're presenting. The strategies around this-- or deeper strategies
around this-- and I'm not going to go deep here. I would actually encourage
you, tomorrow, there's a talk by a gentleman
named James Giangola, a friend of mine and colleague. He's giving a great
talk on what we call the co-operative principle. I'll show you some
details later. But that's going to
be some good stuff. All right, moving on. The other thing that
you want to consider is the capabilities
of the technology. Recognizing the
words people say is much different than
understanding what they mean when they say them. Now with respect to
recognition, Google already had a low word-error rate. But we continue to
improve, apparently. I saw in Sundar's
keynote yesterday that after applying deep
learning to the problem, we slashed our word-error
rate again significantly. And now we're down to
something like 4.9%. That's an absolutely
amazing result. So to some extent, recognition
is a, quote unquote, "solved problem." Yes, there's still
room to improve. But it's more or less
a solved problem. But language understanding is
a more difficult nut to crack. So just a couple examples
maybe to illustrate the point-- what's the weather
in Springfield? So which Springfield
are we talking about? Missouri, Massachusetts,
some other Springfield? I did a Google
search, and there's I think one in every state. I think it's the most popular
name for a city in the nation. Maybe the best
solution to answering this question for
the person who posed it is to apply some context. Like, we know that the user
lives in Springfield, Missouri. Therefore, we provide
them with the weather in Springfield, Missouri. But maybe that's
not the right thing. Maybe it's weird that they're
asking the question, what's the weather in Springfield,
when they live in Springfield. Wouldn't they ask,
what's the weather here, what's the weather like
today-- that sort of thing? I don't know. It's possible they mean one
of the other Springfields. How about another one? Play "Yesterday." Is that the song? Is that the movie? There is a movie. The playlist, an audio book? Maybe it's a game. Who knows? Now again, applying
context and the things a company like Google
would know about users, statistically they're
probably looking for the song. Right? But that opens up
other complexities. While I have no idea why anyone
would want to hear anything other than the original
version by the Beatles-- that's my personal preference-- perhaps they want to hear
the dulcet tones of Boyz II Men singing their R&B
rendition and the cover that they did in the early
'90s or some other version. I really don't know. Anyway, it's something
to consider as you deploy your voice applications. I have no doubt that you're
going to be running into this. Some strategies here--
acknowledge ambiguity. When you don't know
the answer, you can't apply the
appropriate context. And let users clarify. Don't be afraid to engage. A lot of people, I hear them
say, well, it's an extra step. Believe me, taking
that extra step is way better than
the correction they're going to have to do
if you get it wrong. So just keep that in mind. And remember, there are choices. And leverage them
next time, if you can. This is the learning part. Keep a record, and
apply that the next time that they come into your app. It's going to be super useful
for streamlining interactions in the future. Last, but certainly not
least, we have our users. Always, always, always,
we need to consider our users and their context. We've always kind of talked
about voice-only interactions being the hero for people
whose hands are busy, whose eyes are busy,
who are doing some sort of multitasking, typically
in private or at least in a familiar space-- something he's sharing with a
family or something like that. It's voice interactions while
they're driving in their car, cooking in the kitchen-- that sort of thing. And that's absolutely
still the case. But there's definitely
movement outside of that-- people leveraging
voice while they're on the move in public
spaces, for instance, like on their bike. I see kids a lot using
this stuff on the street. But what I really
want to cover here is something that we
touched on earlier-- the fact that users
are instant experts. There is nothing to teach, or
at least there shouldn't be. Again, this is language. It's conversation. It's something that
they've been doing forever. And because of that, they
have high expectations and very, very, very
low tolerance for error. This stuff should
just work, right? Now let's think about that for
a second-- the difference here between touch and text input and
for a visual UI and a voice UI. With a touch UI-- with visual, tappable UIs-- reasons for errors
tend to be transparent. You know when you had
a typo or when you've hit the wrong top target. You know when you need
to course correct. And you're in control. You know what to do,
you take the step, and you make the correction. On the voice side,
though, errors are typically the system's
fault, not the users. That is, if the user
is truly cooperating, if they're not
speaking nonsense. It really then becomes
the system's fault. They have to rely heavily,
when there's error, on the system to get
them back on track. That said, some
strategies here-- make sure you're really, really,
really spending time developing a strategy for exceptions. I just can't stress this enough. This is the thing that
is going to make or break your application. You want to make it really easy
for them to get back on track. And you want to
leverage techniques that we use in
everyday conversation. Maybe I could try to
illustrate this one. So you're collecting
a date from somebody. You ask them, what date
do you want to travel? And somebody says, April 14th. Let's say you needed the
year for some reason. Actually, let's back up. You asked for the date. They say, yeah, April 14th. I'm sorry. Brain-- April 14th. You want you want
the year as well. For some odd known
reason, they may be traveling a year from now. If you force them to
give you the year, that's going to
create this loop. You're going to
throw them an error. You're going to say,
sorry, I didn't get that. You're going to ask
them to repeat the day. And you frustrated a user. So instead, your
strategy here should be, accept the
April 14th and just move on and say, what year? Just use the type of things that
we do in everyday conversation. Or if you have some sort
of recognition error, an easy strategy is simply, if
they say a date incorrectly, sorry, what's the date? And just ask them to repeat
it in a very natural way. Now lastly, we're going to
touch on some opportunities-- opportunities use cases. Now from a user perspective,
what they're asking here really, or what
they're telling us, I should say, in the space
is, answer my questions, help me get things
done, keep me on track, and give me something to do. From a developer
perspective, I think that means supply quick answers,
frictionless transactions, and give relevant suggestions. So some use cases, just
taken from some verticals in the space-- some verticals
that are out there. Like if you're in financial,
some things that you might want to consider-- what's my current
balance, how much do I owe, what was my last transaction-- some things that should
be quick and fast. I need to make a payment,
transfer $100 to Joe's account, or when does my policy expire? Maybe some of you are in retail. Is my order on the way, what's
your return policy, do you do same-day delivery? Maybe it's a stock check. Do you carry milk, are
headphones on sale, I need to reorder cat food? Health care-- actually a
space that I came from-- is my prescription
ready, can I get a flu shot, do you have
any appointments available? Maybe there's a clinic. I need to schedule
a follow-up, I need to leave a
note for my doctor. Refill my prescription
is a really fast one. Wonder if there's anybody
from Walgreens here. Anyone? Hey, how's it going? And lastly, just
another vertical that I thought through-- fun and wellness. What's my horoscope,
got any good dad jokes? Actually, I think
every morning, my son asks Google Home that
we have in the kitchen, he asks him to tell him a joke. And then he tries to
pass it off on his own. He runs into the bathroom
where my wife is getting ready and tries to pass it
off like it's his. Some other opportunities around
reading stories or playing a game or meditation-- in the end, I think the
thing that you really need to make sure is that it's
easier than the alternative. If it's not more convenient for
your users to do this in voice, it really probably
isn't worth pursuing. And so I can't
stress this enough. Make sure it's easier than
the alternative for you. Now I mentioned
some related talks. There is one that was
done earlier today on multi-modal design. And that was done
by my colleagues Jared Strotterman
and Adriana Olmos. "One Size Doesn't Fit All-- Defining Multi-Modal
Interactions," and I would encourage you
to go watch that online. There is one tonight at
6:30 by Nandini Stocker called "In Conversation
There Are No Errors." and that gets at the
strategies that we were discussing a little earlier. And then finally, the one
that I mentioned by James, "Applying Built-in
Hacks of Conversation to Your Voice UI" is really
all about that cooperative principle and making sure
that you are leveraging the rules of discourse
and conversation to make your users very
comfortable in your speech application and your design. I feel like there
was a massive chunk of that that has disappeared
from my presentation somehow. And I am at the
end, unfortunately. Here's a URL for you to
go to and to check out some stuff by the
actions on Google Design team and Developer team. Make sure you take that
down and go check it out. [MUSIC PLAYING]