Basic Data Engineering Project - End-To-End From Web Scraping to Tableau

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there guys welcome back to another video with me Ben Rogan AKA dos y'all data guy today we're going to focus on building a basic data engineering project and we're going to do it from start to finish because I have a bad habit of not finishing projects so that's why this video is so long it's broken down into several sections including um how we get the data looking over the data um kind of parsing and working with the data and then using said data and we're going to consider this basic because I realized that all the other projects that generally exist and the ones that I've kind of started put together have so many components involved everything from airflow to SQL to python to you know clouds services to dashboard tools so it's just too much so let's really break this down and make this simple okay what we're going to do is we're gonna take a few very basic things we're going to take some raw data that we're going to scrape using something called bright data I'll get into that in a moment we're then going to take that data and just load it from S3 directly into snowflake no airflow no jobs we're keeping this simple we're just going to create some tables be a SQL and we'll show you how to do that in the creating table section and then from there load those tables into Tableau and create a basic dashboard something super simple so that way if you're not familiar necessarily with airflow and all these other components in the cloud you don't have to worry as much I'm going to use S3 so that is the one Cloud component but we're not going to try to integrate too much in here so I'm really hoping this is friendly for everyone and if you're more advanced you can feel free to take this project and then add in airflow or something in between to make sure you're automating your work but other than that if you're just getting started this should be pretty straightforward so first we're going to start with actually kind of scraping the data and so we're going to go into uh past Ben and see what he was doing back then to pull that data Ben So today we're going to scrape a lot of information from G2 on a lot of these data engineering and analytics engineering tools since I have done my fair share of building web scrapers I wanted to find a better solution and that's to say a long time ago I decided to build a bot myself I basically scraped and like liked tweets and followed tweets and follow people like on Twitter and things of that nature and I built it all myself uh one it uh gave me a really weird ratio here on Twitter so if you look at my Twitter it's why I have a weird ratio and I've been trying to slowly unfollow things manually taking forever um but two you learn very quickly that these sites are kind of smart and they know how to make sure they cause problems for people that scrape even if they allow you to scrape information um they do things like change uh IDs in terms of like this underlying HTML so you can't always know where information is they also do things like slow you down and all of these various things so even though scraping is hard enough as it is because you've got to parse all this you know HTML or Json or whatever information exists there is the struggle of just having the infrastructure so all the proxies and all that to set up to do it really easily well for today we're gonna be using bright data which if you haven't heard of bright Data before bright data is a web scraping platform that provides everything from proxy networks so you can set up web scrapers so they don't get detected they also provide the ability to like automate a lot of these web scrapers yourself coding in multiple different languages to actually scrape all these various websites sites so you don't even have to you know create a system that manages all that as well as offering ready-to-use data sets I almost actually used it for a client who needed to get information um on kind of where people are based on the companies that they live on by scraping LinkedIn so for this case again bright data has already put together a templated scraper of G2 so this is great that's what I wanted to use so we're going to use their scraper that they've already created and then after that we're going to take all of that information we scrape put in Snowflake parse it out and build some dashboards we can answer questions like which of these products are most popular who are their biggest competitors all in a pre-dashboard so let's go into it shall we so diving into bright data again we're going to use this as our source of data for the larger project basically what we're going to do is if you go through their various Services the one I'm going to use is their collectors so here you can essentially set up specific web scrapers to pull data I've already pulled like set up the one that I'm going to use but we're going to at least walk through setting up a basic one again but you can kind of just set this up and then once you get that you can start it running and push it to like an S3 bucket or somewhere similar as you can see they also offer several other options everything from a proxy manager to API Integrations but for here we're going to keep it real simple and we're just going to develop a self-managed collector so they already have a few that they've built and so I'm going to look at the G2 one but you can see they've got a few other options I was actually poking at the Zillow one at some point because I am looking at hopefully buying a house here in Colorado so I was honestly just curious what I could pull so I'm gonna hit use template and it's going to build me a template all right so this is the Baseline template so what you see here so there's this G1 uh or sorry G2 data collector and you'll see there's two different kind of sections of code so the first part is the interaction code so that's essentially the code that we'll be interacting with the website and you can change this as you'd like I'm not going to change anything but if you want to make it do other things and you're familiar with JavaScript you can do that so that's where you could write it test it out and then below that is partial code so all this is whatever is parsing the various kind of paths in the CSS to kind of clean up and pick different sets of data and we're going to kind of look at the data that pulls here in a second what's really nice and in some of the other collectors is you'll notice that there's this plus you can actually set up multiple steps so once you do this you can actually have it set up another step so I'm not going to make it super fancy I'm just going to use their template which is going to force me to kind of just mainly put in inputs and so what I mean by that is if I scroll down here the inputs for products are essentially like you have to provide it right like I don't it's not going to go and look at G2 site and you know pick like the top top 10 data products in a certain category obviously these are not data products this is just their templated version but I'm gonna have to add them in and you know that's very manual and that's not what you want to do you know if you're building a web scraper you want it to be automated you don't want it to sit here and manually add in all of these various hard-coded options so in the future what you'd want to do is maybe have this G2 collector instead of going uh to the specific review page it'd probably go to a category of pages on G2 for example you might have specific categories uh like this one that you're seeing here I don't know why canva is on here for data visualization but you can kind of see the different options here and then you essentially have the scraper click on this and then you get to the review page so that's the optimal situation that you want to set up I'm going to keep it simple but that's again optimal situation go to like these Pages like the data visualization page um I think they've got data storage too I'd look up like a data warehouse you see they've got a category for data warehouse and it's the same thing where you're gonna have it essentially go to this page click per thing go to this review and then go from there but for me I'm gonna keep this real simple so this is the collector it already works like I can just click finish editing here and it's going to go for a run um what you're going to see is below it's actually kind of running both a log and some various consoles so you can actually see it run line by line so you can see it running these various lines here and they correlate with lines up here so country us Etc and so if you're looking to debug your code you can kind of tell where things go wrong which is nice because it is um which is nice because whenever you're using something like an abstraction layer in this case being bright data the more you can kind of have access to the logs especially if you're a technical the better it is you can deal with whatever issues you run into faster so it's gonna run and I'll probably speed through this part so as you can see it's running on my preview version I'm basically using this preview to test how it's essentially running through the code now it's going to go through here and parse through this information you can see it's hit parse so that's gonna go through the partial code yep you can see it's even going towards specific links like um competitors Alternatives again this is what I mean when you can set up like multiple steps um in this case for the parser they've added multiple steps or you can actually look up the Alternatives and competitors for a specific product so that's essentially the preview it's it'll run I'm going to run the main version that I've actually put together here in a second um because this is just an example so you can see it's kind of gone through the pages from there scrape specific information and from there you can either decide to go to a CSV or S3 so let's go back to my initial one that I've built and I'm actually running it right now so basically I've already built this um you can actually see so if I go to here um you'll see what I'm already scraping for so five fan air byte pre-cog Informatica palantir nexla ascend.io service analytics just a ton of different options here so we can just get some information about that so that's basically the input that I'll be getting so let's go back you can see it's almost finished uh from there you can actually decide where it goes so delivery preferences I haven't integrated anything just yet but you can obviously do a few different things you can one pick Json CSV Etc you can also pick things like batch versus Real Time how you're going to download it is it gonna be an S3 thing is it a web hook um where is it actually going Google Cloud Storage Microsoft Azure storage SFTP so this is great because most data pipelines if you set them up it's usually like to an S3 bucket and then from that S3 bucket you've got a trigger and from that trigger you know then you load it into something like snowflake or data rich or something like that so this is kind of just fitting into that naturally so you don't have to think about setting up the scraper code you can have this build most of your scraper code and then from there just build the actual valuable code which is the Analytics um so for now I guess if I can just set up email since I'm gonna just be using it myself but for now maybe I should set it up to S3 which then you need to give it some information here you can also see that they've already set up a decent amount of schema here um gotta scroll through it while I'm waiting um you can see there's a lot of arrays in here so I'm gonna have to do a lot of parsing in order to get all of this information out and you can obviously clearly turn it off and on as you see fit for the specific items so that's kind of also very handy so before we dive into more of the data warehouse side let's kind of dive through this uh data just to kind of get an understanding of what it is so for now they've kind of just ordered it the way that the data came in um I think one thing I would add in the future is probably just the actual label for what the company is I have the order so I know but that's just something I keep forgetting about so what you'll see is they've got everything from kind of industries that are represented categories essentially where you could find it on um this is essentially five grand to be clear uh it's competitors so things like talent and how many like it kind of compares to Star wise and review wise I'm so we can look at Talent uh let's see who else is here so what you can kind of see here is um this is the data that's been parsed um you'll see that down here is the input um so this is for Informatica it gives you kind of the biggest competitor in this case which is Talent if we keep scrolling down I think we'll get a little bit of a better setup here so let's see what's this one this one has more competitors which is great let's see where the input is just kind of scrolling down okay kibula so uh this one's kibula it's got a much nicer kind of setup so let me go back to the top so we kind of parse through it so what you'll see here is it's rating it's reviews so it can eventually when we do is set up a comparison on Tableau where it's like you know looking at its other top 10 competitors where does it fit essentially in terms of like average number of reviews and then the actual number of stars as well as like how is it kind of just fitting in to these various uh Industries also if there is pricing maybe even just having like is there pricing for the solution because so many of these things don't really set up pricing you always have to call them so you can see you can compare it to zapier you can compare it to mulesoft compared to Super metrics which I'm confused why it's comparing to that integrally apple pie I've never used apple pie Clio integration clown you know didn't kind of compare it all to this and there's other information oh it's legal yeah that's an interesting competitor I don't really think it did the same but yes five Trend makes a lot of sense to sell them a degree um so yeah that's that's really what we're going to show we're gonna show kind of the comparison compared to everyone else is competing with I mean as well as just giving a general feel for it so that's what we're gonna be doing on the Tableau side when we get there um for now we're gonna take this data pull it out of S3 into snowflake from Snowflake into a dashboard so before going too much further in this I kind of already finished this but let's talk about what we're gonna do so we're going to build a few different tables here that we're going to use for the dashboard um the first table is kind of a header or yeah kind of a header table it's basically going to have a company name the reviews star rating and then the category list we're going to create another table that's basically going to be the competitor rating so this is going to basically give us the company name so again company is kind of the the overarching theme the competitor so whoever its competitor is the number of views and rating for that competitor so that way we can in the future you know look at Informatica and be like hey what's you know how is it doing compared to its competitor and then finally we're going to do a category comparison so for this one what we've done is we've taken all of the different categories or all of the different competitors and vendors in one category figured out the average a star rating specifically by taking the total number of reviews multiplying it by the average number of stars per you know specific vendor so we basically take it there's four point six average reviews in terms of stars and 100 total reviews you know that then the larger number that you got used to get that average was 460. and then you can add those all up and then divide them by the total number of reviews and get the accurate average rather than taking the average of average so we're going to use those three tables and then we're gonna take that and build a dashboard that basically just has like you know the number of uh reviews number average rating and then we're going to kind of compare that versus category and also versus products or for different vendors so that's kind of the goal all right so going from all of that raw data that we just saw a second ago you know that raw Json now we need to take that and actually process it so we can eventually view it on some form of dashboard so what I've done first is you know I'm just in Snowflake right now that's kind of where we're going to build this all on so if you're unfamiliar uh with snowflake you can use another data system although one thing you're not going to do at least not in all of them is create a stage so a stage in Snowflake essentially lets me point to a raw file location or basically some sort of storage location like S3 or Azure or gcp's version but same same so you see I'm pointing actually directly to this file you don't have to point directly to files just to be clear usually the way this will be set up is this will be some sort of partition so you might have something like whatever maybe this is data Lake and then here you might uh have like the entity so customer and then you might partition it down by dates so it would depend actually in front of a customer you probably have something like where that data came from whatever your Erp is like Salesforce or something but that's essentially the gist of how you set it up and actually I'll put up an image of how you can think about this but for now I'm just going to keep it here very simple and just point directly to the file so what this lets us do so I can actually query this oh I'm on the query side I can run this and we can see the data Now you kind of can tell that wait this data is just basically one giant Json right look at this is the field it's one giant NC and an array so we need to parse that out so step one obviously is we need to parse out all this information that exists here so let's do that so in order to do that we're just going to use this I'm going to try to give some formatting here actually I'm just going to show this in poor SQL on all right so what we're going to do is we're going to use this query here to essentially flatten the data and that's what you see here where there's this lateral flatten this will let you essentially parse out if you saw actually let's go back to this if you saw right like this is an array and so we want to flatten that array out right like by flatten I mean pull out every entity that exists inside of it so in order to do that we're going to use lateral flatten and then we need to give it an input which in this case G2 is just this sub query that I'm referencing Json data is right there it's this column so this dollar sign one is referencing the very first column in the stage data if there were more columns in it column where's dollar sign two Etc and you keep going but there's only one so we just need the one so basically we're saying to unroll it to flatten it out so we're gonna query that so you can kind of see what that looks like so you can kind of see what this essentially looks like so this value is what we're working with and you can kind of tell so somewhere in here input the company name is let's see next left you can kind of see which company each one of these lines represent we're a little bit closer to actually getting some information here and from here you can actually start specifying like you would in Json specific values inside of this value so let me show you what I mean probably the most straightforward one to go for is input and company name this will just tell us which company we're looking at okay so you can kind of see which companies we're looking at so we're now going to create a few different tables the problem here is unlike usual if we were doing like actual some form of data modeling where you'd have fact tables and dimensions is we already have very aggregated data here so we're kind of aggregated to their review level in terms of like the aggregation of reviews if you were to actually like model this your reviews would likely be your fact table your companies would be the dimensions as well as maybe some of the users but uh this data is already aggregated to like the average number of stars rating like total number of uh reviews that you know got to those Stars so I do not have the ability to kind of create fact tables so we're going to just be creating like one or two tables here more for building the dashboard than doing data modeling so like I said we're going to pull out very specific information in this table this table is going to kind of represent the vendor rating so there's gonna be vendor rating there's gonna be a vendor category table and then we might have to do something with competitors so we can like pull in how like who who they're competing with every time every time I do do poor SQL I always forget that oh it doesn't like that okay so now we've again we've pulled out even more information here we pulled out is it Informatica we pulled out what company is it number of reviews star rating we're actually going to figure out the average rating per category here in the future which we're gonna do something wild there where we luckily take the number of views and multiply it by the star rating so we don't get weirdly weighted averages but this is kind of just the initial table and we're gonna add one more part to this table so the thing that you can't tell that exists is there's actually a nested value inside this Json data let's see if I can remember what's called top 10 competitors yes I I had to go find it okay all right so if we look in top 10 competitors and you scroll down what you'll see is you actually we actually get some more information on other possible companies that I haven't looked into and they also give us number of views and number of stars so that's kind of nice so we're going to actually also add that in as well as product category so we can basically get all the same information that we have here but from this nested information so I'm gonna I'm gonna show you how to do that here really fast it's gonna be uh like magic basically so we need it all all right so this should work let's remove these basically what we've kind of done here oh nice the stage is different I try to name it a little more to what it is basically what we've done here is we've flattened it twice now so if you recall you look at these competitors there could be a list of competitors here right there there could be more than one in a lot of cases no this one's got multiple so there could be more than one so we need to flatten it again so we're actually gonna be doing another flatten and you can totally do that so I've got one lateral flatten here I've got another lateral flatten here we're going to keep doing this we'll just use we'll get similar information but these are now on basically vendors that I didn't pull in so there was like all this data that I pulled in if you recall earlier and this stuff isn't all necessarily there some of it is like 5tran but like I didn't pull in super metrics I didn't pull in uh Clio or Microsoft SQL server or zapier appy pie I didn't pull any of those in that was that's kind of nice and then slower it again here too lower then add in his company name category list all right so we have essentially a table that we can build here I just remember seeing some issues when I actually got into Tableau create table Hangouts vendor let's just put I'm just gonna put this in wrap up like Dot vendor uh rating it's kind of underrating that's where it's kind of aggregated too you know this is the rating so we're going to create this table and this is going to act as basically us being able to say like hey this table is going to be where we can kind of you know take number of views figure out the rating and then maybe get a category list on top of this we're going to want to create a table I think for competitors to see like what are different companies competitors ratings so that's actually pretty simple so we already have this we can pretty much just take this and then the only thing I think we need to add in maybe get rid of this we don't need this I don't think I'm going to use number of reviews I might I'll keep it there and then company name so this will give us like basically in the future I want to be able to like filter by like let's say kibula and get all of its competitors in like how those competitors are rated what kind of you'll kind of see what I mean here in the future dashboard but I basically want to have two different comparisons so one is like well how are all the other people rated um that compete with uh with said company and then also uh what is the rating in comparison like if you take the average rating for that specific product what is it compared to all the product like product types or categories it's in compared to one thing I don't really like about G2 data is like it's all really tight like 4.3 4.7 it's also high like it's like okay someone's got to be a bad product and there just never seems to be a bad product I mean there's a 4.0 there's a four here um I feel bad there's a four here like you're you're definitely the first four I've seen most of these scores are arguably pretty high so it's like hard to say like who's really doing a bad job it's like so this table will be just basically competitors and the reason I'm kind of building these in different tables is the different granularities and it's going to be wonky to try to mesh them all together there's obviously complexity added by adding and building more tables but there's weird complexity added if you know this is at the level of company and competitor because you're gonna have multiple instances of that company because you're gonna have multiple Peddlers whereas the other table we just built this is at company level so it's only one company per one rating you know it's all one there's not multiple instances so if you try to combine all this it gets weird on the Tableau side you can obviously do things on the Tableau side too remove duplication and it just gets weird and I like trying to clean up so vendor competitor rating okay basically now we're gonna take all this and we're gonna create a table that essentially has the average rating for each basically category all right for this one we again have to kind of do some flattening because there's a multiple categories for for each product including competitors so let me just lower this one kind of done the same thing where we've had to flatten now three times so we've flattened and the initial Json data to get us each of the actual products unrolled each of the competitors inside of each one of those products and then from there there are multiple categories for each competitor so we've also to enrolled that as well a lot of flattening here as well and I'm just I just want to get like all the different categories per company name so now we're going to go and make a company name per category because what we're going to do is we're going to take that data and then find the average for each category in terms of rating so that way we can compare the product eventually against each of those so let's take this create this table so that table is created all right so for this next part basically what we're going to do is we're going to take the vendor rating table that we first created and then the vendor category table in the category table and we're going to join them and basically our goal here is to figure out the average rating per category whatever that category is we gotta figure out the average rating which one thing you'll notice that I've done is I've done this extra layer where I'm taking the average number of reviews uh times the number of star rating not number but the star rating and then divided by the sum of the number of reviews so the reason that I'm doing this instead of just doing you know I could in theory just do average star rating and then let's select technically this does work and annoyingly it's not even that different but in theory if you've got weird weightings you could have very drastically different answers and the problem is is because it is often so close you know whatever 80 of the time there's no real math to how often it's closed I'm right you're like oh what why does it matter right 4.5 4.54 4.633 4.62 they're the same right um but they're not so it is important that you understand that like weird weightings like if one company had a 3.2 in terms of their rating but they only had let's say two reviews but all the other companies you know their average reviews they maybe had thousands of reviews and the average was 4.8 that three-point and but instead of like using the thousands essentially reviews you then just take the 10 numbers 3.2 4.8 406 whatever and then average that you're gonna get a drastically in that case a drastically different average than you would in in this case the problem here is the numbers are so close to begin with that you know it doesn't make a huge difference but in general you don't want to average averages so just just a quick point so this table will basically give us the category rating the average category rating which it already exists so I can just create or ear Okay so we've created that table and then we're gonna do one more thing just to make this easier on the Tableau side again for me I'm always about trying to do things on the SQL side and not the Tableau side I know Tableau loves Tableau prep I I dislike Tableau prep sorry sorry Tableau prep so finally we're going to just basically take all of this and join all of it together so basically the average Stars per category probably not a great place to put that company name category list company category and star rating so what this is going to give us when you were if you were to actually look at this every instance of a company and the category that that company is in so you're gonna have multiple instances of a company just to be clear so you should know that yeah you know you're gonna have Informatica category a the average rating for come category a 4.3 and then the actual rating for Informatica which we'll leave that blank because I don't want anyone think I have preferences and then same thing here where it's like category B so you can have multiple instances of Informatica so we're going to use one table where it's just the individual instance of the vendor so that's going to be the the vendor table and we're gonna use that as kind of the high level information but below that we're going to do all this other information like we're gonna say like we're going to provide um all these different competitors that these products have and show kind of how they're rated versus you know the main product and then same thing with each category we'll compare each category compared to how the actual company is which just reminded me of something so if there's a competitor table I just built okay oh no I did put to put the vendor rating okay I did put the vendor rating so yeah we're gonna need that and basically the overarching filter or all of this will be company name so that will be the overarching filter let me make sure this is actually got data in it because you'll notice that most of my joins are on company name because I don't have an ID could I add one sure did I add one no quick pause because I've kind of gone through this whole video and realized I didn't point this out so you'll notice that a lot of my joins are on option company name and actually I guess you don't see that until towards the end that's just because I don't have an ID I could add an ID something I could do I didn't do it obviously you should try to have IDs of some kind versus company name but the day is so small here it doesn't make any big difference on my side cool so now that we've done all that we can kind of jump onto the Tableau side take all this data basically we're gonna take three different tables so we're gonna take the vendor competitor table so if you don't recall just as a quick this is basically the table that has the product company name competitor name and how that competitor was rated so that's going to let us compare to the main company actually no that's okay we're gonna keep it because yeah I'm not gonna do a comparison with that one I'm only going to do a comparison with the category which I'm me then we also have the comparison table this is kind of what we've been building the comparison table that's another thing we've basically aimed to build this one just has uh oh I need to lower company name hold on okay so now everything's a little more consistent and so this table again has average star rating per each category so if I go down to company category you can see there's different ones and then also the category or the actual rating for the company so that's another one and then we're just gonna have the hierarchy table or not hierarchy like header table so this table just essentially has each specific vendor and their rating actually I need to lower it here too hold on I thought I did this already all right so now we've got this vendor rating table for Monica Etc and his category list so we're gonna use those three tables to actually build our dashboard all right so this is our kind of end dashboard that we're going to be building towards yes it's not the fanciest thing in the world but it gives you some baseline metrics one the stuff that you would get just on the Baseline page which is average Stars total reviews that's what you'd normally get we've also added in category comparison as well as product comparison and so those are kind of the ways we've like aggregated all this information from multiple pages that the scraper from Bright data has pulled in and with that we can now compare all of these various metrics so we can compare it against categories as well as specific products so right now I'm looking at five Tran you look at Snowflake and you have it all in one place rather than having to go to multiple web pages so I'm going to show you how to build this in the next 10 to 15 minutes let's get into it we've already kind of scraped all the data we've kind of changed it all now it's time to actually work on it obviously there's a ton of places we could add more things like I think another metric here would be nice it's like just crying for this empty space and then if I had grape some maybe description information I'd maybe would put it here about 5tran and what it it is also if you want to go like that extra level if you scraped the logo and and put that somewhere that would look amazing I didn't do that but those are things that would definitely take this to that final level um if I was like giving it to a client the other thing I'd probably add in is uh reviews at the bottom like some sort of granular information I would try to get more granular so the first part is not very granular very high level second part a little more granular we've dived a little deeper and the third part would be like let's start giving direct reviews to a client so this is a baseline dashboard let's dive into how to make it all right now let's move over to Tableau where we're just going to create a pretty straightforward dashboard here again we only have three tables if you recall those three tables go over the general rating so the vendors rating competitors rating so we can actually look at how it compares to the competitor's rating and the category comparison how it being the actual vendor we're looking at is competing in its said category so I can kind of pull this up we're just going to create an initial dashboard something's broken oh I haven't put anything to the data source okay let's just since this is vendor rating let's just put this in oh nope didn't mean to do that twice remove all right and yeah there's there's some other ones that I've had but it'll probably remove those because I'm not gonna be using those or maybe I can build another one I should delete these usually when I build a dashboard in Tableau I kind of follow a pretty straightforward principle where you usually have some high level metrics on top um actually I saw a great one from code Basics recently where they're doing metrics on I think it was qriket but yeah some some way like you put metrics on top and then after that you put the interesting information below that will often tell you a little more about the metrics that you're looking at above what we can do here and we don't have a ton of metrics to go here so we're going to focus this on one specific vendor at a time usually I might have a larger thing where it's like oh let's look at the entire swath of the category but for this since we don't have a ton of information I don't think it'll be super useful so instead we're gonna do it one vendor at a time like the general rating for that vendor and then also right here we'll probably put like the number of reviews so that'll be pretty straightforward and then from there we'll put how that vendor compares to other vendors into space and then how does that vendor like kind of Stack Up in terms of the category so we're going to category and individual vendor metrics so let's do that let's first just quickly build the number of views in star rating so let's get start rating okay we have to pick a specific company I'm trying to remember which one's a good idea to use I wonder if snowflake is now you know what you guys probably think I like snowflake too much so let's do data breaks oh great there's data breaks Lakehouse platform okay good it's 4.66 let's clean up some of these a little bit I don't like and I like semi-bold or bold personally it's just a weird thing that I like let's name this average actually I forgot to do where's the uh let's just do 4.6 don't want it too messy I think that's a little cleaner average Stars actually 4.6 seems a little small I'm gonna do this a little bit that's a little better then we're gonna duplicate this and then instead of average Stars we're gonna do number of views and I'm gonna remove star rating number of reviews and these two will basically just go right so whatever Stars we're gonna make this floating I'm gonna add a new title in I've never liked that one we're gonna make this floating as well and then so this is a little less chaotic and actually this is kind of small I might make this bigger later we will make this horizontal so for anyone who doesn't whose number used a container it should help you make things a little more organized so say it's floating and so basically I can stick this inside of it actually you know I just realized I actually want vertical first sorry so let's do vertical first it'll just make things a little more organized let me show you what I mean so okay we've got this container I'm gonna take this shift it put it inside so it's inside and put some text on top uh average stars and we're gonna just change this there's 20 and then into my bold oh let me Center it we're gonna do the same thing um to the hundred nope don't want there and then I'm just gonna say total reviews now we can get this horizontal container so what you're noticing is the horizontal container or at least the vertical container Stacks things vertically and then the horizontal container Stacks things horizontally because it's not Stacks but puts things or holds things horizontally so if you're going to say select and oh it is selecting the entire container so I can shift and put it inside and then now you're going to select the entire container because it's not currently what we basically have created is this massive container so we're going to select this entire contains container vertical select container horizontal so now we've picked this horizontal container and it you'll notice that it fits everything inside of it so that's why I like doing this cool so now we've got this container that holds our two key metrics I mean they're not super exciting metrics but you can oops you can essentially imagine you could have some more exciting metrics here so we're gonna put this just right here probably we're gonna put a title up eventually so we're just gonna put this here and below that we're gonna actually put some more interesting things so we're going to take this and make this Avenger dashboard so this is the vendors of our rating we should probably have a title here that says Informatica or maybe somewhere over here and we'll get to that but first let's put in two more dashboards okay now we're gonna do to the vendor category first so if you recall this data specifically comparing like category category to the rating the average rating for that specific category so what I'm going to do with that I'm going to take this category list first of all we're going to so we're going to take this company name okay now we have the right amount of data rows I was like that's not the right my data Rose cool so we're looking at again the uh rating per category per vendor so actually I don't need company name we're gonna actually do company name over here nothing shows up great perfect oh that's right that actually makes sense so we're gonna do complete category so we see which categories it's in then we're going to do the average rating for the category versus the average rating for the actual company so what we're gonna do for that is we're going to take this company rating so that's 4.6 cool we don't need to average it and then we're going to change this to bars cool it's really boring right now let's see if we can make this a little more interesting let's take this average star rating so we're going to take this and use a reference line if I go to analytics there's these different things and I want reference line so we can do this and we're going to take the average average star rating actually sorry average start for category we're going to chain an average is good we're going to change the label to be custom we're gonna make it the average and then we can just take this value and that should be good let's see oh that's not great I would like this to be separated [Music] okay so I think I finally figured out this cool okay so I finally kind of figured out this although still not as like it's not the same as what I wanted I had to change something to discrete and basically want to change it to discrete it kind of tells you what I want because it's all the same I think this is something where it's like because this is all the same value 4.6 it's a little less exciting I'm trying to think if it would look any different the other direction but at the very least it has kind of broken this out so we finally figured out how to get rid of the header um there was a remove thing I've been poking at it for a little bit uh always these little things with dashboards okay so basically what we've built here it's a little less exciting because they're all the same 4.6 but the interesting part here is like you can kind of say like oh well how do they compare two big data processing and Distribution Systems and there are 4.7 is the average but they are 4.6 so arguably better in fact the Eerie beat all the other things in this category minus AI machine learning but I'm actually curious if I were to put number of reviews actually I don't have a number of reviews great I should have added a number of reviews because I I could tell you how much that weight was but you know you can actually they're technically doing better let's see how snowflakes doing all right so we're doing Snowflake and we do have big data processing here and it's still technically above average same thing with everyone pretty much almost everything else even database management system and for now just average things because I realized again there's duplicate data somewhere and I kind of knew that was going to come through but future problems for future bed to deal with so cool so 4.5 is is its rating and overall for some reason again it's somehow always better or not worse vertica okay mmm it looks like vertica has is losing this game you know let me make sure all this so big data processing distribution should be 4.7 4.2 let's check out snowflake again now we're doing some data QA live here so we need big data processing yeah okay so that's the same sorry vertica this is not me this is G2 data there are companies that are technically doing worse so that that's not good but it's good because at least it means the data underlying is accurate or at least more likely accurate so this is one of the things we'll kind of compare to let's clean up some of this let's go to format I really like getting rid of I don't know why I picked up this habit I like getting rid of pretty much all of the lines I'm like all the lines I hate all of you I don't know why I like doing this I just I hate all the lines and I I never know which ones are which so I just go through all of them yeah I just feel like that's cleaner I feel like all the other lines and try and tell if there's a oh that looks like there's no line there just just get really annoying I don't no just just add more noise so I was trying to clean up all those I'm sure some people like it some people don't cool and then we're gonna call this a category comparison Paris and cool so then let's guess this you can put this in make this floating foreign the wrong thing where's this Dash oh I see what happened it got into the container pull this out and we're gonna make this a little nicer here in a second but let me just loading by second I mean probably in a little bit it's loading um no make this floating I'm there's a container here and I don't want to select the container I really want to just select this I apologize if I move fast I just I don't know how to slow down sometimes okay we're gonna use this later so this is actually going to I wonder if I can do it already apply to worksheets all using the data swords selected worksheets I might have to connect things can I can I go this far okay so we're gonna have to go here we have to make this connection here somehow why can't oh my company name and let's remove this cool and then same thing here let's go here let's make connection let's write this for some reason it's too high the average 4.3 we can double check to make sure that that's right that is right again I'm averaging because there's duplicates in theory I should have that fixed in the data source I clearly did not okay so now we can start filtering this and say like palantir and it'll compare oh look now I can compare everything in palantir and it automatically oh there's only two reviews it can compare everything so 4.5 somehow everyone's got a 4.5 Sigma again compared to everyone now we're going to add in on top of this all of the other competitors in in for each company so that's another data source so we're just going to connect it all via the company name 191 rows it's not a lot of rows I think I'm only gonna have competitors for the original companies that I looked at so it's not great but we're gonna put them in so for them let's add a new sheet and again we're going to do company name as the filter again I think I'm only going to have the companies that I've I've pulled in so as much as I have all those other companies I don't think we're gonna get everything in uh competitor name cool we can see that these are yeah these are reasonable competitors and then we're just gonna put their star rating and number of reviews star rating number of reviews okay cool and um let's format some of this all right just uh custom nope nothing and then here well we can leave the let's just make sure it's actually where I want it nope there's probably a secret I don't know about about how to make sure you format Nicely versus just like doing what I'm doing currently okay now I do want to make this again a little bigger I just don't want to do this nope I'm doing the wrong thing I might I might do something here so it's a little more interesting but trying to think don't want to do something a little more interesting here so I want to just do a bar chart like would that be more interesting [Music] see if I put this on I'm just gonna be curious what this looks like basically we should put this right next to it we're gonna we're gonna still polish this up or we're gonna have to make this wider too okay so it's a little bit better you know the other thing we want to do we have to rename things and clean up things but uh the other thing we need to do is go here I apply this to worksheet and then select worksheets we're gonna rename she 10 to not sheet 10 and we're gonna go to issue 10. and then again we're going to connect this company name and we're going to move this other company name so we can see all of its competitors how they're doing yeah I'm not a huge fan of these numbers I do like that again we can kind of compare this to competitors let me just quickly start formatting things and maybe I'll see something that I like one thing I like doing is adding a border it just seems to tighten things up in general like it just makes things a little tighter it feels like it's a little more organized it's like container yeah you should probably make these sure these are the same size which you can do oh the other thing I might do is put a background on the on the header so I select this might add a little little background here I like doing that sometimes if you do that though Then You're Gonna Wanna [Music] I feel like I've done that maybe I need to do darker gray okay that's at least a little uh better you can pick a better color obviously if you have a certain scheming like I just tend to do gray just again it looks a little bit cleaner than just what it was before then same thing here I might just keep the center for now and change this to White and then let's remove this let's put in the horizontal container again again the container all the container does kind of help you sort things so so let's do floating so this is kind of where we're going um you know in terms of like building a dashboard I think we're we're going in the right direction I I wish I had more information uh which I could scrape more using bright data I think I'd love it if here uh instead of to having to do this whole tricky thing I did I was able to put like what date Works was like a little description so I'd probably want to scrape that information I think the other thing is I could squeeze this down and then maybe put down the categories that this fits in but I think this kind of gives you a good idea of where you could go and how I've kind of used this information to consolidate a lot of information because you did have stars and reviews on one page what you didn't necessarily have was competitors in categories and how this product compared to those other categories so what we've kind of done by building this very quick dashboard is created something that now the end of the day you can actually compare those products to each other as well as the categories to each other so you can kind of see like oh how's this product doing in all of its categories not just the one and how is this product doing against all of it competitors not just again like one at a time so we've Consolidated that information again you could scrape even more to make this even more valuable I'd add another metric here because just like space again this is blank space here I had more description here but that's kind of end goal and that's kind of gone so I think this is where we're gonna wrap it up I'd love to see what you guys kind of come up with if you want I can actually maybe share this data set with you guys and I'd love to see what you guys come up with feel free to poke around at it and other than that thanks so much for watching thanks so much for you know everyone who's kind of joined in it's been a long video also thanks to Bright data for helping us get that data other than that I will see you guys next time and thanks and goodbye [Music] foreign [Music]
Info
Channel: Seattle Data Guy
Views: 49,100
Rating: undefined out of 5
Keywords: data engineering project, how to become a data engineer, basic data engineering project, basic snowflake project, basic tableau project, web scraping, basic sql project, sql project, data engineer, seattle data guy, end to end data engineering project, basic sql, snowflake sql, data analyst project, data engineering skills, what skills do data engineers need, learn to become a data engineer, data analyst vs data engineer, data analyst, sql for dat analysts
Id: vSgJ3bOyE0w
Channel Id: undefined
Length: 48min 52sec (2932 seconds)
Published: Wed Jan 04 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.