Attribution Modelling with Google Bigquery and R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello I'm Scott thank you for for hosting us grid dynamics and Microsoft as well I am here to change gears again to being on the cloud and with with a lot more kind of like data engineering type stuff but today I'll be talking some about attribution modeling how we get set up at that how do we deploy it maybe some interesting insights and observations along the way yeah so quick agenda I'll give you a brief introduction into what attribution modeling is a show of hands have many people heard of attribution modeling or no no what it is roughly okay good lots of opportunity for learning tonight then excellent so first we'll talk about that we'll talk kind of like the end-to-end modeling approach in broad brushstrokes how do we handle big data in the cloud in this case we'll be looking at Google bigquery we'll do some a little bit of stuff in our and we'll see how we can take that data and see it in some sort of visualization layer layer layer like tableau or a power bi so a little bit about me first hi I'm Scott I think ready set my background comes from the world of academia physics and astrophysics I did my undergrad up at Western a little bit north of here did my graduate studies abroad out in England for astrophysics which led me into data science where all the astrophysicists are going these days and worked for a little bit actually right across the street in 18 for the visual studio team services team at Microsoft most recently at tableau and done some fun hobby things like writing a book on intro to machine learning with our I find on Amazon or the O'Reilly website and this fantastic photo of me courtesy of Ted Wolfe photography is me at the Redmond Derby days criterium bike racing downtown it's it's a great experience if you can go there be sure to check it out they have lots of other stuff to do within other the bike racing all the slides tonight will be available my blog at SAV burger calm so if you don't want to take photos off your phone and can't read them later you just feel free to go to the site they're already posted so attribution learning time the goal of attribution modeling in a nutshell is to make sure that we are spreading out our marketing budget effectively so what does that mean well in some in a lot of cases you can have is if you have users that are doing some sort of interaction and they are converting in your customer funnel to the very top layer of lead status or something like that some sort of conversion activity sometimes people will look at something that's called a last touch scenario so whatever the last thing that someone interacted with that's what gets all the credit and this is very easy to implement it kind of makes sense to some degree like oh well you know someone came to our download page and you know they downloaded a free trial of our software and convert it to the lead that way so let's just give all the credit to that trial download page kind of makes sense but at the same time we're losing out on a lot of fidelity four steps before that maybe they saw a really cool YouTube video maybe they saw a really good white paper or something that really piqued their interest more and really drove engagement a lot more than just the last interaction that they had so that's kind of the overall motivation for what attribution is we're making sure that everything gets as much credit as it can so this is this is an example of the difference between the last touch and a multi-touch attribution so last touch so let's say like an example someone comes to our you know our company's website they come in through some sort of marketing channel let's say like an email or something and then they click around a bit maybe they go from one page to another page first touch the second touch they leave they come back on a new session different marketing channel maybe this time it's direct traffic because they already know about the site the product everything great and then they convert last touch is going to say well you know they converted on this trial download page let's give a hundred percent of our budget to those guys the trial download people are a static about that great we got a hundred percent the budget when in reality the actual user journey is much more flipped towards the majority of the channel that they've been interacting with so this helps us make sure that you know everyone gets the appropriate credit that they deserve and what we'll be seeing tonight are various methods that we can go about building the foundation for that so the modeling process overview we start with developing user journeys so seeing the the whole landscape of what people can possibly interact with we do this with Google Analytics data and our talk tonight and that data being hosted in Google bigquery so we'll get a lay of the landscape there once we have those user journeys what we do is we can tie conversion rates to them we can see which journeys are are better at converting users than others and then once we have that path data associated with it then we can put it into some sort of more meaningful statistical model in this case we're using a Markov chain model we'll dive into that a little bit to help make that a more robust and statistical process instead of something that's just oh just like the last thing that they interacted with so for that we use our and then once we have that data we punt that back up into the cloud you can also do that locally but in our in our case we'll be pushing it back up and then pulling it from that down to do some sort of visualization layer so Google bigquery the cloud is is good for some things not so good for other things but in particular Google bigquery worked for our implementation because it really acts very well in in two ways one is handling huge amounts of data so we're talking well huge in relative terms but you know for tableau huge means about 250,000 rows a day and if we want to look at a whole year's worth of data for one kind model and then seeing your year that gets really big very fast and one advantage that we have in Google bigquery is that we can take all the Google and raw fine granular Google Analytics data that we have and just dump it natively right into the Google bigquery jbq has some extra bells and whistles on it that we'll see in a little bit but one thing that you can see just from this standard view if you just go to cloud or console doc cloud google.com slash bigquery for the new UI you can play around with some example data sets that are on there and you'll see some in this case one that I've been playing around with from the National Oceanic Atmospheric Administration NOAA where we have a bunch of tables of weather data and this is data per year where each table is its own year and that just goes on forever so we'll be seeing a way to handle data like that there's just tons and tons of tables of data so the first kind of step in building out our user journeys has to deal with these this huge amount of tables of data that we have so for each one of these tables you know you may have zillions of rows associated with them but maybe we're only interested in a very select subset of these tables well bigquery has some fun little techniques that you can do one is called table wild carding that if you have a table that has basically the same kind of schema with it but it has some kind of date formatted string at the end of it then what you can do with bigquery is you can say okay well give me data for just one day and you can query that table just with the the suffix for March 6th or something but let's say oh actually I want all the data for March well instead of you know going and putting in your in your select statement you know for dates between this and this which has some a little bit of nuance with how how pricey you can get instead you can just oh well give me everything from 20 1903 star and with the Star Wars to do is it'll it'll will upend all those tables with that kind of prefix to each other so it's kind of like a big Union in a way you can do that one step further for all of 2019 data and be warned if you do that for all of your data that can get really really big so make sure that you know you have a good understanding of how much data you're playing with first before going about doing this likewise if you want to do kind of like ranges of dates but want to be more kind of specific to the table schema itself you can do something where you do in your where clause for your statement here you do like a table suffix between some date and some other date with this kind of format what's fun is that you can actually set dates in the future for which you don't have data yet and when you run these queries when the data actually comes in then it'll just Auto append there and it won't give you any error so make sure that you know you're using it appropriately but this is a way that you can make sure that you have like an unbounded future window for for your data but still kind of time bound in a way okay so this is still kind of laying the foundation but what we've done so far is we've seen that data in bigquery with Google Analytics can be big it can be there can be a lot of different table well you can have a lot of tables and each one of those tables can have a lot of data associated with it with Google Analytics data specifically how many people here are familiar with with like web data or Google Analytics data okay good number hands so I think for the most part people are relatively familiar with sessions you come to a site you click around a bit maybe you leave that kind of defines a session the things that you do within that session are what we define is hits so if we want to establish a user journey based on the hit level and not necessarily the session level we have to take that data and we have to what's called unnecessary what you'll see and there there is actually a Google Analytics sample data set up in the big we're kind of sandbox that you can play around with I've been trying to get a screenshot of it for demo purposes here but it's been down of course but once it gets back up you know you'll be able to take a lot of data from here and play around with with it in the sandbox so in bigquery what you'll see is session data this is for each one of those date tables but you'll see session data and for each session you'll see a bunch of hits and if you're trying to build a user journey you want to go off hit level instead of session level for various reasons but the the crux of building out the lowest level user journey here is with one of one of the hard to see lines on this this slide deck here where it says unnecessary oh wise so that each row is its individual hit but because Google Analytics data contains everything that the user is doing including scrolling on the page clicking on things how long that they're looking at things like that in this case we're really only interested in what what pages they're looking at because we want to see where they go from what page a to page B maybe back to page a to page C that kind of journey you can't extend this to things like what users are interacting with and what they're clicking with that gets even bigger so you know this this kind of project has definite ranges of big data that that it can span so once we have user journeys and we have or so once we have a data set that looks like an individual user I'll actually let me go back one slide here so just to describe some of the data that we have here this is data that has been unnecessary have a session timestamp we have all the individual hit time stamps related to that session we have someone's unique identifier and then we have like what kind of sequential page journey that they took before converting to a lead so this is what it looks this is what this user is a journey looks like from a row level perspective but if we want to do something useful with this for attribution modeling purposes we need to kind of take this and pivot it out into here's this user here's their journey here's another user here's this users journey and then look at all those individual journeys themselves so that's this that's the slide there is a really useful function in Google bigquery that does that pivoting for us called string AG this is really great where we can look over an individual user and aggregate all of the different sites that they're visiting to sequentially so order by their hit timestamp and then pivot that out into their unique journey then what we can do is we can say well for that person's sequential journey we can say for while we can we can group all of those unique journeys together we can see a lot of different statistics about them like how many number of touches were in that journey path so these are all the unique journeys possible in this contrived example we can see how many of the channels within these journeys were unique or not in this case I'm just pulling an example that has all unique channels per but that's another kind of interesting thing to look at we have a total number of conversions for these various journey paths we have total unique visitors for all these things and then this is like this is the data set that we want to be working with for pushing this into more statistical models into our if so this is all of course obfuscated data but if you're doing stuff where you are looking at impressions and seeing how users are seeing ads within their journey you kind of have to pump the brakes a little bit you're free to do that but depending on the different types of ads that you're putting in there that can also balloon your data up quite substantially so there's a lot of pitfalls and a lot of high wire a balancing acts that you have to go about doing this but Jen speaking this is the kind of data set that we want to work with for for modeling in our when we do attribution modeling there's a number of different sort of rule based or heuristic approaches that we can do a lot of which are just very easily programmable into sequel logic so things like last touch we've seen that first touch this is just putting all the credit to the first thing in the users journey you can do a first and last which is just assigning you know half credit to each of those linear spread it out across all the touch points time decay maybe putting more importance towards the events closer to conversion position base maybe something more custom but we were more interested in statistical modeling approaches and seeing how that differed compared to other modeling techniques that we were using all of these mouths so this is a screenshot from the Google Analytics UI which is great to use the big fan of it however if you have a senior leader leadership that's asking okay like that's great you're using a time decay model how is that working exactly and I think a number of us have probably encountered a Google documentation at some point or another and it can be very good and in some cases it can be quite sparse and in the the case for statistical models it's quite sparse and what this really drove us was to do was to say well you know we can use this this UI approach in GA to to do our attribution modeling but if we really want to understand how it works down it like the code base level then we can do this model shaping that we've been looking at the past couple slides and we can bring in our or other statistical program programs or packages and kind of do it ourselves and then that way we have code ownership we know exactly how all the code is working down down at the lowest levels and we can tweak approach as appropriate here is here's a view of what the Google Analytics UI looks like and it's a little washed out but on the far left what you'll see is a drop down for conversions if you click there you'll get an attribution arrow at the very bottom looks like that model comparison tool this is where you would land on all these different attribution models this is another thing that you can play with in the Google Analytics sandbox so if you're just dying to jump in its just start playing with knobs and dials by all means but if you're going through and you're trying to do some analytics with the the more statistical models you may run into some some issues of maybe explain ability so what that brings us to is okay great we have our data we have unique channel pads we have conversion rates tied to that what we do then is two things we leverage a package called big our query which helps with input and output from our in and out of Google bigquery so we can take data that we've already seen from gbq we can push it down into something like a VM layer on Google Cloud where we're running an instance of our we have our second package in this case channel attribution a single line of code it's great markov underscore model there's some other kind of tuning parameters that we can throw in there as well but that gives us an output that we'll see in a second for here are the number of conversions expected for various channels for marketing channels like email direct your paid channels that kind of thing and then once we have that table we can push that back up into Google bigquery if we can save this locally - if we want but we'll see the value of pushing it back up into our data warehouse and sack but this pipeline here is really useful because it gives us a lot of flexibility in terms of tuning the the data maybe we want to look over specific date ranges account for seasonality look back windows that kind of thing and as well as flexible in terms of model tuning like with Markov chain modeling maybe we want to look at a chain length of two or three or four things like that and see how those things vary it's very highly experimental so the main output that we get from our channel attribution package in our up here is that for the different marketing channels that we have maybe we'll have you know a bunch of different heuristic models for such last touch linear those types of things very rule-based what we can do is we can join on that data with Markov data in this case this is all example data but you could mark up data it could be something like 100.3 it could be you know more more decimal oriented as opposed to just whole numbers but this top level output here is very useful because it can tell us differences between just pure last touch and other statistical models you can input like Shapley or other things in here which is very useful and then you can see how the distribution of these changes per model selection as well but one thing that's a that's very nice in terms of value is taking some of these models and looking at their fractional credit per hit and putting that back onto the data warehouse or the the hit level data in the data warehouse so what this means is from our output maybe let's say we have we have our first three channels have an expected lead value of 100 200 and 300 specifically so in this example user case they have four hits to the site they have four touches that's their whole journey before converting and for the first channel that they interact with its channel 1 then channel 2 then maybe back to channel 1 then channel 3 we take each of the expected lead conversion values for each of those channels just put those in as as appropriate and then what we do is we divide by the sum total to get the fractional and so what this gives us is for this example user journey down here for their 4 touches we can see their first their last touch and the Associated credit that we should shift appropriately so their first touch boom gets all the first touch credit last touch boom gets all the last touch credit linear as expected gets 25% for each of those four hits and then the Markov model gives us a a decimal breakout for each of those touches and then what we can do is we can just take that data and then push it down into a visualization layer like tableau then what we can do is we can say well what's the what's today's attribution how does that look like compared to like with a month over month you're of a year view that kind of thing and then compare all the subsequent models as well and this really gives us a lot of flexibility and power to to show how the the data is working and evolving as fast as we can get it so just to kind of sum up some high-level findings in summary that we've gotten out of this that our multi-touch approach with a Markov model specifically showed pretty good results for two plus journey touches in this case we want to look at two plus because number one if you're looking at just single solitary touch points that falls back into a first last touch those things basically being the same but you also have to worry about things like cookie abandonment maybe someone comes to the site on their phone and then maybe they come two days later with on their desktop computer and you have a break in and linkage to their so by looking at two plus journeys that gives you a little bit more of a substantial data set to play with and you don't really lose much by by doing that either but what we saw was that credit was shifted away from our direct channels into our paid channels this is what something that we were looking for if you have traffic that just comes directly to the site that can be and then converts sometimes that can mask if you have paid channels where you're dumping a lot of money into what this will do is it will show what kind of shift can you expect from those direct channels to those paid channels I've kind of talked a little bit about impression overload if you have impressions from like one kind of major source we'll say that's probably fine but if you start including impressions from like let's say you know you have impressions that's that people see from Facebook or that people see from LinkedIn those are generally you know fine to add but if you start adding more and more different types of impressions and what that's going to do is just it's gonna fracture out the number of channels that you have and it's gonna shift the channel credit accordingly so you run the risk of having more and more credit shifted away from various channels that may abuse get the picture that you're looking for another big thing is with Google Data you want to you know make sure make sure that you are working with viewable and measurable impressions there's it's actually some quite good documentation on that one but that's another aspect where data can change wildly depending on what kind of feel that you're looking with and then lastly with Markov chain data or Markov chain approaches we saw some decent improvements with Markov chain length two versus one but we got diminishing returns after three and you know you run into issues with with resources after that as well too so so this is my last slide this these are just kind of links to the things that I've been talking about so far if you want to go and play around with the the GA UI here's a link to go and do that if you want to get started with Google bigquery and start clicking around and playing around with stuff the example data sets they haven't they're quite fun they have a lot of cryptocurrency data sets in there which I think are very interesting don't know anything about them but quite fun to play with and so that's the link for that one if you want more info on how to use the the are packaged channel attribution specifically there's a link there and of course you can find all this stuff at my blog sob burger calm so I hope I haven't been talking too fast happy to entertain questions uh yeah thank you [Applause]
Info
Channel: Scott Burger
Views: 1,682
Rating: 5 out of 5
Keywords: bigquery, attribution, r programming, machine learning
Id: E0ObToWagzk
Channel Id: undefined
Length: 27min 3sec (1623 seconds)
Published: Thu Oct 10 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.