Investigating Application Performance at the Client Level with Aternity and Riverbed

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so uh thank you for letting me present today i'm john hodgson i'm the head of product for uh eternity um i have uh i have a long relationship with these products uh i'd say about 18 years wearing some form of uh of a riverbed related um logo uh i've been acquired multiple times um we spun off eternity recently and but as vince mentioned we brought the band back together and and i'm very happy for that um i've been a uh uh i.t professional for well over 20 years i started um as assistant administrator many moons ago uh back when aos semester's aol instant message was a thing my uh my handle was sin plus act so i have a network background as well uh but i've spent the majority of the you know probably last 15 20 years really focusing on application performance um so what i want to do is we're going to get into that sort of practical demo but i want to spend a a a little bit of time just talking about uh sort of what attorney's about because those of you who uh who've been following riverbed may not have been on top of us for the last couple of um years so um really what we're about is is uh looking at the end-to-end visibility starting from the client side so we have a an agent that gets deployed on end-user devices that's one of the aspects of telemetry that we have which is really looking at users whether they're in the office or from home or they're mobile um and capturing information about their interactions with applications so when we talk about these tellers or these um these desk personnel at the at the airports you know when they're interacting with their terminals interacting with this uh with this uh uh rental car app uh we would be monitoring their interactions what are their actual activities what are they clicking on understanding response time and we're following that um uh through to the back end applications because there's an apm component that basically if you look at all these little dots the the dots on the first row represent client-side agents and then we also have server-side agents which can monitor vms and containers services you know full apm and it's all fully stitched so there's there's no sampling it's all high fidelity every click that a user does we follow full stack through we leverage both nate both native by code instrumentation as well as open telemetry and we take all of that data and we feed that into our sas based em platform where it's consumed by different roles and responsibilities to either solve performance issues productivity issues uh obviously we integrate with the service delivery ecosystem in terms of servicenow and feeding into tools like splunk um and and from the network perspective that you see illustrated here it's that we can do a breakdown that says yes the transaction is slow um and it's slow because of either some breakdown of client network or server but obviously depending on the nature of of the uh of the the slowdown we're going to want to drill into those respective areas and obviously the mpm side of the house is what allows us to drill into the network side and i'll kind of make that clear in a second but again to show really our core focus is not just about sort of time series telemetry but really looking transactionally at saying what are users doing and how does it traverse and i just want to sort of put that in your mind that the time series data is important what's happening on the on the on the device what's happening um uh you know in terms of cpu and so on but really it's kind of like a you know for tree falls in the forest you know does it make a sound well if uh if no one is doing anything with a laptop who cares if it's cpu is high or low or you know if outlook is crashing it doesn't matter so we're really always focused on the human we focus on what they're doing but we need that raw technical telemetry we're able to solve problems so very quickly to tie it all together we start with that person we we have the context of who they are what device you know are they on what type of device and what are they accessing they're accessing an application they're doing it through the cloud and that's an important piece of understanding that end-to-end response time but if if we determine that this response time is slow then we have to look beyond the ingress point of that application and and appreciate the fact that there's networking components and application components that all uh comprise that overarching response time so when a user clicks a single click of which millions could happen in a day there's all these places where delay can occur in the initial dns phases through wan lan application code method sql and at the end of the day when this transaction is done um a human is going to have an impression of whether or not they're happy or sad with that and that's been a big deal obviously with the pandemic is understanding you know people's reaction to things what's their sentiment been so these are all the kind of telemetry points that we collectively focus on the eternity side is about this piece is about understanding the device understanding the activities end users performing and understanding how they feel about it from a sentiment perspective and then we all the mpm side is handling the lan and the lan and all the network effects and understanding my packet loss in jitter and all the things that you know us for what part does that play in the overall equation and then if it is not either of those but it's something in the application then that's where apm comes in is looking at the the uh infrastructure uh but most particularly the the transactions as they flow through jvms and you know you know node.js and go and things like that and stitching that all together and all of this is done with full fidelity there's no sampling and the last piece of the puzzle um is synthetic monitoring because that's become increasingly important especially with work from home and people access success applications our customers have really been wanting to make sure that they can continually keep an eye on applications that uh uh that may be out of the purview of typical telemetry points so we've added synthetic web monitoring you know very robust with client side profiling and waterfalls and screenshots and things like that and here's the real important thing is that we bring all this together into that one data set that one view that's what portal is about that phil was showing was taking all this and saying let's make sense of this and it's not integrated in a side-by-side you know kind of co-location fashion it's it's really intertwined so that we can solve problems and that's what you're going to see here is this ability for us to kind of take information hand it off from one person to another and ultimately solve issues okay so with that let's roll up our sleeves we'll get back into the technology and we'll continue that demo john quick question if i could um on the synthetic side um synthetics have come up a lot uh over um this field day so i'm wondering when you say synthetics are you just talking about like paying trace route looking at http get request that kind of thing are we also doing synthetic transactions like a full synthetic transaction yeah so it's it's all of the above right so there is infrastructural synthetic where we're doing things like uh again pings and trace routes and uh port checks and and so on um there's very sophisticated web synthetics that are doing you know selenium-based scripts clicking on workflows capturing not just the nav timing which is you know the overall response time of the base page but the resource timing which is all the css and javascripts and capturing screenshots and film strips and even javascript profiling of the rendering time you know apps are becoming more more and more work is happening in the browser so when you see you know the cpu goes up when you open up a web page we have analysis that allows us to to see that um and we're in the process of extending that also to um net worth pat network path um uh synthetics as well so so that from the uh we have about three million endpoints deployed um in our sas and being able to do sort of triangulated triage of especially access of people working from home to sas based apps and understanding the path that they're taking uh you know so effectively you know think of it as distributed trace route um is something we're in the process of adding right now hopefully that answered your question yeah and just so i'm clear is that um the synthetic uh web scripts running for the client side or is it happening on the server side the synthetic web scripts are running uh currently uh uh from either our global network of uh of robots or it could be deployed um on premises as well um so so we haven't yet um any and again we're not running synthetics on the server side we're running it to the server and then we can stitch through the server if needed so those same transactions that we're capturing for for uh real end users we're also capturing for the robotic end users right and and being able to stitch through to again the call stacks and so on thank you thank you john great do you guys have the concept for the agents to be able to do blue-green testing so you can have a certain set that are doing blue-green testing for one set of surfaces versus another yeah i mean so generally the answer is yes i mean we have very rich um sort of uh we can capture method parameters and things like that so it's very easy to extract tags to understand uh those those sort of a b testing or subgroups of users and so on so that is a use case that people use us for um so continue with that demo um so here we are in the eternity sas based ui um we have a million dashboards to solve all kinds of problems like blue screens of death and outlook crashes and so on but i did want to talk about something we have called dxi which is our digital experience index one of the challenges that we have is that we have so much data that comes from so many different data sources that it is inherently uh sort of disjointed and uh and not normalized and that's a challenge that customers have because there are people who are like really good at solving cpu problems but don't really understand what to do with the zoom problem or so on so we built dxi as a way of normalizing all that data organizing disparate data um and then collecting it into sort of categorical buckets so you can see um there's this center score which is a sort of normalized continuous improvement score that uh is saying okay overall the company's doing you know 89 and i'll talk in a second about how that works but it's based on the the next ring of things which are devices and collaboration productivity and even business applications and what happens is as you go further and further on those rings you get into this like sort of raw telemetry where you're talking about outlook crashes or blue screens of death or response time of a particular activity and what our customers do is the way the scores are calculated is through um basically benchmarking data or three million endpoints across all of our hundreds of customers give us we have a anonymized data set that helps our customers understand what is typical for a certain type of thing a certain version of windows 10 what's the normal crash rate and it helps our customers understand is what they're seeing along with everyone else or are they lagging or ahead of folks and so what our customers do is they they um specify their business goals in terms of what's important to them like you know obviously during the pandemic collaboration was critical and um and ultimately what uh what they would say is like look we want to um based on benchmarking data from other companies we want to lead the pack there and we don't want to do any worse in our business application so we want to judge ourselves relative to ourselves for business apps but we want to judge ourselves relative to our peers in our industry for collaboration and long story short the goals combined with the technical data turns into these scores and the scores effectively allow someone to from a normalized perspective say green good red bad let's follow the red right and that's effectively how they understand of all the things we could address what are the things we should address first from a business perspective so if we see here again when i talked about those rings here my devices score is 100 because i'm meeting my goals um and i can see kind of the trend of that over time um but then if we look for example at business applications that trend you know that score is lower and again because i've sort of judged myself more harshly i'm saying okay 60 this bad score is the reason why i'm at 89 and why is why are my business applications problematic so i can drill into that and say now focusing on this ring what is the reason why we're 60 and if we look at those constituent range you can see rental app is the one again the one that we've been talking about in the steel demo that's the one that has a score of 50 and if i want to understand why that has a score of 50 again clicking in further i can then see that it's because of this deal transaction being particularly slow now i'm showing you the manual click-through mechanism but if you if you notice in the in the sort of tabs there's something called dxi analytics the product itself will just hand you on a silver platter says listen if there's something you're going to fix you should fix the deal transaction in the in this application because you know the fact that we've colored it red means the product understands that this is the most important thing um but people tend to like the explore this visual view helps people understand the data model which is why we usually show it first um and then here's a book uh transaction which is you know perfectly fine don't worry about it it's good it's been good it's staying good so now if i drill back out and i say okay you know what should i look at next um let's let's let's pick up the the the scenario where uh which uh phil started where we're saying okay a user from one of these locations is calling in cindy johnson is saying you know my tell my app isn't working i'm at jfk you know help me out here and now we've got a view of um cindy's device but cindy specifically in terms of her activities so that first tab of experience is all the various things that she's doing those view deals for customers um there are uh some some transactions that are showing unavailable um and again all the different operations that you would do would be enumerated here and the trends that you see are colored based on dynamic baselining and how far away from normal they are so green means it's within band yellow means it's slightly different orange means it's significantly different and red means it's not completing um so we can see that there was this um period of time where basically you know everything started turning red there was a lot of failures and this aligned with kind of what was observed in portal and again you can attack this from different ways if i'm playing the role of a service desk personnel who's opening you know looking at a ticket that was filed by a customer by an employee this might be the view that i looked at but in many cases our customers are solving problems you know right from the portal side right and they don't even have to go to this level anyway continuing along um we look at the different applications that she's accessing we've got teams and this you know rental app and the rental manager and the rental app we can see both from a relative percentage of all the transactions we see 22 of them are you know failing 58 of them are you know really out of bounds we could see counts of the transactions in the second column but it just gives me a feel of the context of what is the problem right so while packets are ultimately going to become important understanding why those packets are important that's what this is helping us sort of understand and i can filter in just on you know particular application all the data filters we can see here that cindy's in jfk um so now you know we're going to want to drill in deeper so one of the things is we're usually going to look at a particular transaction and try to understand it better in terms of you know why is it the way that it is and every one of those records has a transactional record behind it at a minimum it's looking at things from the device side which is saying what was the transaction when exactly did it occur and we see that response time a 38 second response time 29 seconds of that almost 30 is in the back end uh eight seconds of that as the client so this helps answer the question you know is it something we want to flip over to the desktop team something we have to flip to the network team something we have to flip to the um server team you know so at this site it almost looks like a little server side maybe i'd flip it to the server team um and if we had apm involved we could even drill in and follow this transaction into the backend tiers but at this stage i don't want to jump to conclusions so we want to look at the device and make sure you know there is a device side delay so i clicked on troubleshoot device which is now in the context of her device more technical telemetry about what she's doing cpu events and so on we can see the rental app how many you know it's the thing she's using the most um we can see kind of you know the business activities they were you know almost sub second um at this point but then clearly um we're seeing some device issues we see like the spike in cpu at 100 which is kind of weird didn't seem to impact response time and then later we see this big block of cpu pegging and clearly that was correlating to some degree with when uh cpu is responding remember i'm a level one service test person so you know i'm not going to jump to too many advanced conclusions yet um so whenever i see cpu being a problem it tends to be like well what's causing that that usually is the question so let's go look at what processes are running on the device and we drill in um now we get a list of sort of more detail on the on the different processes and one that stands at the top in terms of this 99 cpu is something called xm rig uh which i don't recognize it's not a corporate uh uh transa not a corporate process so this is something that okay it's pegging cpu i probably want to look at this in greater detail so switching back here to the to the summary i can kind of scroll down on the applications that we've discovered and now we see one of the applications being used across the uh the organization is the xm rig miner which a little googling tells us is a bitcoin miner and i don't know how it got on our systems in our airports but it's there um so we can see it's deployed on it on a series of devices so i can go and say show me the users who are leveraging this to see is this isolated to cindy like did she get hacked or did we get hacked like that's kind of the question and now taking a step out we can now ask the system tell me where else this lives i can see that i've got these three users this is kind of that initial view where we're seeing kind of the usage of this and so on but i really want to understand where they're located so i can change my pivots to say for that like second column let's change the column to uh either device name um which the way our naming scheme works that works and here i can see okay i've got three different airports that are all running this miner that should not be there and my corporate policy at this point says okay we've got some sort of a security event so i'm going to presume that the miner is the reason for the root cause of delay i'm going to flip this over to my security team to say can you please investigate this and we're going to let them handle it further so depending on the nature of the problem uh you know we might hand it you know again to different groups but in this case we're definitely um uh looking at it from a at this stage presuming it's a security issue that's causing the problem so what have we learned so far that this full fidelity client visibility is allowing us to see that there is a problem who is impacted and at this stage of the game appears to be a crypto miner that somehow got onto our systems that's that's impacting the user experience you

Info

Channel: Tech Field Day

Views: 90

Rating: 5 out of 5

Keywords: Tech Field Day, Gestalt IT

Id: 84QfVnefSZw

Channel Id: undefined

Length: 19min 11sec (1151 seconds)

Published: Fri Sep 17 2021