Day 1 - Intro to Machine Learning (ML) with Google Earth Engine (GEE)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
yeah thanks everybody for joining I know um this is towards you know you've got a lot of priorities going on in West Africa and um we were so happy to hear um you know partly from severe's visit to Huntsville you know very recently and some of the you know the conversations at the land for the land cover task force um on sort of the interest and the need excitement around machine learning and deep learning and AI and really how severe West Africa is wanting to utilize these different approaches in service development and service refinement and uh piplov and I uh along with Emil and Jacob kind of put our heads together on what would be a good approach to developing some some initial uh training time that could lead towards folks who who are now here on this call to participate in the broader tensorflow working group um and also keeping in mind the Geo for a good event that's coming around so there's a couple big you know the Jew for good event is the sort of the yearly way in which we engage with Google especially on the Deep learning data science front um and we're also having that technical exchange at the exact same time as Geo for good the global exchange um and then we also have the tensorflow working group which is this bi-weekly meeting um that we just work on capacity building and developing Solutions uh to keep this sort of train moving down the tracks because there's a lot to learn there's a lot of expertise um so um we we thought that having these two one and a half hour sessions could be you know really vital in um you know having an opportunity for you guys all to ask questions get get us rolling and build like this this slide deck and some code and the recording as uh resources for the future as well so um I've successfully eaten up the five minute mark so now I see a couple more people streaming in I'll just start by introducing myself I think most folks probably know me but I'm Tim Mayer I'm the regional science coordination lead for the Hindu kushimalaya Hub uh also research scientists at UAH and get to work with uh biplov a lot on sort of the data science team and I'll hand over to piplop thank you Tim hi everyone my name is bipla bhandari and as T mentioned I work on data science AIML across various thematic area here in SEO I work as a data and research scientist for the team and I'll hand it to team back perfect so as you all know we're recording this session we just want to use this as an opportunity to put this on the tkms or have this you know live as a resource at West Africa so you guys can can use this for future trainings you know you've got interns coming on or maybe you're interacting with some of the Ames fellows and or or what have you this is just meant to be really geared towards a capacity building deliverable um and as a resource for the future so with that being said take full advantage of it ask questions we have three hours uh over two days one focused on machine learning one focused on deep learning and we'll get into the nuances of those but we want this to be a really useful tool um so with that being on I think this is time for a group photo because we have to do this virtually we normally do this in front of a building right but everybody open up your cameras and let's have a great opportunity to take a screenshot or several um and let me see if I can minimize this a little bit one two three oh that looks great okay now let me save that before I lose track of it Perfect all right we've got our official documentation we're all here we're all learning together um Perfect all right for posterity let me go back to uh sharing the screen okay so with you know the live setting the live recording setting um definitely mute your uh Mike and we can you can turn off your camera especially for bandwidth um if you have any questions because we definitely want to hear from you all uh just raise your hand um and also we have folks monitoring the chat I know Gloria's putting um some resources in there for filling out the pre-survey as well so you can be doing that right now as I kind of just go through the very early introduction to save some time um so yeah if if when we get into sort of the Hands-On portion as well we have a lab later on um you feel free to just share your screen I know we're all pros at this now having done the virtual environment you know around covid uh but just putting this out here that um you know feel free to just raise your hand and ask any question so let's move on okay so the objectives for our um hour and a half today I know this was shared in the uh agenda um but I and I also sort of mentioned a little bit early on we had a few more people join in but really we want to provide the theoretical background for both machine learning and deep learning and this first day is going to be on machine learning second day is going to be on deep learning and it's really just start getting everyone's gears turning on different workflows how are you currently doing some methodologies and how maybe machine learning and deep learning could complement that we also want to grow West Africa's interests and skill and ability to participate in the tensorflow working group because the initial idea back in 2019 of the working group was to have a space that is open for dialogue for people to ask questions to develop to to Really sort of come up with ideas um so you know we want everyone there to be um feeling as though they can participate and I think this this you know three-hour session could be a great stepping stone in that direction um and also we wanted to have some time for um folks who are attending the the Geo for good event to also have some Hands-On knowledge and experience as well um so there's those those elements here um so as I mentioned before expected outcomes you know some knowledge and expertise in machine learning deep learning and uh the ability and interest to participate in the tensorflow working group um so as I mentioned before day one day two day one is all machine learning day two is all deep learning we're gonna do Theory um so what is actually happening in a random Forest um we're not going to talk about all types of classical machine learning because it's a really really broad field but we'll go into one example we're gonna have a test so that's the first warning here uh put on your listening your your thinking caps and your listening ears I don't know what the term would be but there will be a test it'll be really hard so you have to pay attention take notes um and uh we'll also have a winner of that test so we're setting the bar pretty high and hopefully I'm scaring everyone just a little bit um in a fun way right uh and then we have a Hands-On lab and then discussion space so um within an hour and a half we're going to try and do a lot for two days okay so this is a great time to check if you have access to some of the resources let me just share this repo really quick um it is publicly available so everyone should have access to it um it should be listed in your um let me open up the pointer here turn on laser pointer it should be listed under your reader so when you click on that link since it's open to everyone it'll pull up as an available repo that's got my name and this West Africa machine learning training and there should be three scripts in there not actually two but this is an older screenshot so if you don't have that when you click on it and let it refresh a few times if you don't have it just raise your hand and we can work chop it um while we're while we're rolling through the theoretical okay so as Glory has uh kindly been sharing the the um pre-survey pre-training survey definitely fill that out um we want to hear we we want to collect the information on you know demographics uh interests uh background that's really helpful for us um not only just for folks we know really well in the hub but also just really broadly you know how how does this type of machine learning training um help um as it also goes into our sort of global thinking on uh the usefulness of the tensorflow working group um and at the very end I know we'll have a summary report that will help to underscore you know how useful this type of training was um so please fill that out um and I'll just keep turning along so at the very beginning when a couple people uh while we were waiting for a couple more people to join um I shared this uh jamboard it's here again um feel free to fill that out it's just a couple of blurbs about who you are for the sake of time I'm just going to zoom forward and if we get time at that very end we'll come back to this introduction so we can you know meet people a little bit more I know you guys all know each other but I don't know everybody um so uh we'll save some time because I know there's like 42 people on the call we would basically eat the whole hour but we want to have that so oops let me go back to that okay so now we're going to switch over to menti um very quickly if you go to uh join at minty.com and then use this code so just another tab you can put in that six four nine seven eight two eight one helpful to understand what people are coming in with uh what knowledge they have I mean what their use is um data science deep learning artificial intelligence image processing can I read these ones horizontally instant segmentation yeah so like as as people are still adding these I mean my takeaway here from this list right off the bat is there's a lot of different uses right a ton of different uses some of them are very specific events but then there's some that are hyper specific on you know instance segmentation so um I think you know keep adding I you know your your thoughts in here I see that automation is popping up as a touchstone the larger the the text the more higher frequency of submissions here so yeah I mean magic I love that one is that a meal uh it's great it was not but that's hilarious yeah there is a bit of a uh a magic feel to it I'm surprised no one has put in Black Box either um so that's the double-edged sword right there's some really fantastic things you can do with machine learning and deep learning and then there's some caveats and I'll be uh Frank about those um and and diplov as well about good applications and elements in which you should be cautious right this isn't a Panacea that won't fix every problem but it's a it's another tool in your toolbox so great we've got a lot of responses here uh let's Zoom to the second question okay rank your skills and knowledge using machine learning from one to ten so from I don't know to you're a wizard how how skilled are you oh we've got an eight this is fun I love these mentis because it's like you know it's live you can really see people interacting it's fun I'll give it a couple more responses I know you guys are already inundated with uh surveys uh but this is just one more temperature check on where everyone's at uh but it's cool to see that you know we've got um uh not so much a normal distribution we've got this kind of uh I don't know waveform this would be but we've got a quite a few Folks at four oh we've got a couple more experts that's great to see so it's it'll be interesting um where we're sort of tailoring this this training right now we're going to start real relatively basic right and then we'll progress to this day one and day two we're going right into deep learning so this is really focused towards that machine learning but um it's great to see where everyone's kind of coming in and uh self-classifying too um which is a you know always fun so let me switch back to this tab and we can get rolling okay so I got to make sure that I'm asking the specific questions for the test and don't leave you all hanging when it's test time um but uh so this is a relatively useful diagram I think for understanding the history of you know artificial intelligence machine learning and deep learning and you can see here that artificial intelligence the way it has been until recent memory has been really talking about a broad field and generate generative AI folks are sort of using that that terminology to talk about a very specific type of application but if you were to take a couple steps back AI is really sort of like a whole field with machine learning being a smaller field inside of that which is very well mirrored to statistics um so this is like a computer science uh applied science mirror to some statistical approaches and then within that a smaller field and you can see here the timeline around the 2010s time frame deep learning started to come into Vogue as well and we'll talk about why that is um but you can see here really think about it as these subsets so we're starting a little bit more broad with you know a lot of the advancements even though machine learning had sort of existed before the 1980s had someone more and more algorithms were coming out and it was more and more uh user-friendly um with with like you know now you can just pull up uh scipy and use machine learning before it was very very complicated um so this that's the rough history so this day one is just on machine learning so we're talking about the breakdown of of different fields there is um two camps I would mention one is unsupervised and one is supervised learning and really what we mean by supervised learning is you're given the data with labels so if you're trying to identify rice you have existing rice labeled points or annotated points that have that information and then when that when the model runs and creates that relationship and you introduce new data it's able to leverage that Past prediction past knowledge to predict on on new instances so that is all supervised it has information to begin with versus unsupervised it doesn't have the information you're not providing that label to begin with so um think of like clustering when you've got when you've got data on an X and Y and you've got points uh within that scatter plot you're using like a nearest neighbor clustering algorithm to identify the the actual distance between those points and the clusters of those points so you don't necessarily need a label to tell you what's rice and what's not rice or what's galampsay versus not column say so those are just different approaches within a data space um that you can use to segment your data and we've highlighted this top section here on machine learning many folks are probably familiar with uh regressions probably the most um common sort of approach you know in modeling and we'll talk more about random Forest today but there's svm as well and I'm sure folks are very familiar with that as well so we're really only going to be talking about supervised learning because that tends to be the area especially in the EO world that we interact with in the most the most so um I know this was shared with the prereqs as well uh but for everyone who's really interested in in this as well and you haven't had this opportunity this is a 10 hour training from Google on machine learning and deep learning I would highly recommend it there's still time to to listen to what you what you can between now and you know this time and into tomorrow um it's just a really great resource and I actually come back to it every every year or so because I find it really relevant as I'm reading more papers I come back to it and I go a lot I have a lot of aha moments um so check it out it's a free resource and put that as you know something you can leverage in the future if you haven't done it already okay so as I mentioned before it's a there's a broad field of machine learning zooming back to this slide you can see the different algorithms here but this is a a I kind of like this as a sort of a network diagram and the folk you know most folks are probably interested in when we looked at that mentee uh like image classification right it makes sense with the Earth observation field we're looking at trying to identify features within an image so having something like an image classifying algorithm could really be really powerful and that falls way up in this top corner where we've got classification which is a type of supervised learning which is within the field of machine learning but you can see these different essentially camps where algorithms are are performing differently and really we can think of machine learning as it's building a model off of that sample data so that's known as our TR our training data right and that's in order to make these predictions so the beauty of this is when you give it enough data and you allow the model to build this relationship you're not having to explicitly go and label every single Pixel as column say or not column set you're allowing the model to make that prediction so on a scale sense this is really really powerful because if you're dealing with millions of pixels it's okay if you have a little bit of error because you just saved yourself an insane amount of time not labeling every single Pixel right so there's a lot of upsides to it um so think of sample data or training data with those labels and the ability to make predictions so let's walk through a thought example here um I I had a really um useful time when I worked it at uh Colorado State and I had a a postdoc who um was a great teacher and he explained you know like think of everything as tables Tim and that really had a you know kind of a an aha for me because um when you're doing all this geospatial processing you know you're dealing with rasters oftentimes the the operations are kind of like tables already right it's an array um and then when you get into the data space what you have to give the model is a table essentially right you've got your points or your pixels or what have you with your labeled instances so that's your rice or non-rice your Guam say or non-gawam say for these binary classifications and then you've got all these features so you've got like um ndvi and Savvy and occurrence of water or distance to roads so each one of those columns each one of those features is going to be a column with an ID next to it so when you give this to an algorithm it's actually just synthesizing that table it's taking that information in it's assimilating it and it's allowing it to draw conclusions based on the information so on the data science side take the time to sort of know and think about the qualities of your data going into the model so we've got this example here with this tabular data where we've got cats and dogs so the things we know about cats and dogs is maybe in this in this hypothetical table we've got for entry number one uh we we've we may know if it's a cat or dog we may not know that information but we know its size its weight its color its ear length its nose length all these attributes about that individual um entry so it may or may not be a cat or a dog we don't know but we know all these other attributes about it right so how can we given that information this huge table with all these different fields how can we predict to make a binary classification if the next entry that we get is going to be a cat or a dog so let's just you know I'm a very visual learner so let's walk through this example here so we've got two entries entry number one is this bottom one entry number two is the top one up here and we can see here for nose length we might actually have a relatively small nose length versus this nose length and the size is also vastly different so with this feature here in this first model if we only had two entries it would try and with the machine learning algorithm essentially it would try and draw a line this would be a hyper plane or a decision boundary it's just really just a fold in the data it's it's trying to say within this data space I want to draw a line where these two groups are different and that in a very straightforward way is the classification right that is we've made this this this decision between these two uh populations and then now I anytime I get a new population I'm going to make that prediction right but we want to really look at our whole table so let's add some more information so we've got an instance here we've got um a much uh smaller nose but a much larger size you can see here how that hyperplane has changed a little bit with new data right and if we if we do that again you can see here we add an even smaller couple smaller animals we can see a different hyperplane and finally here we've got we've got a location where this might be really confusing where a dog is about the same size as a cat maybe the same um nose length but the ears might be the distinguishing feature but it's not listed here right so this hyperplane this approach that divides these two entries that you're interested your focal entries um this is a really key factor about you know understanding the quality of your data and what algorithm you're going to use to find the solution so in very in a very short very concise way this is the the majority of how most machine learning algorithms are are making decisions um based off of the input data so it's looking at those features and it's it's developing a relationship based off of that data so um in that blue box you know the key takeaways here are training data um so you've got your galam say points with a labeled gallopsay as this location as being a galam say point we've got prediction or AKA inference um that's where your say you've got a point at location a and at location B you don't have a point so how do you know if that is a galam say location or not um that's where you rely on the model to predict to that new pixel and when I say inference here that's a statistical term um that uh for folks in the machine learning World they tend to use the word prediction but inference often means relatively the same thing slight nuances there we can talk about probably on day two if anyone's interested and then for the next bullet there on input features or AKA model covariates that's what I mentioned before with like nose length ear length size weight things like that those attributes about the individual species you can think about the same way as like galam say we've got a location where there's a point where we know there's um you know galom say happening and then we've got all these other attributes about it you know the change in ndvi the forest cover type the distance to roads the distance distances from waterways uh you know all these other attributes and they don't necessarily have to be just EO derived they could be you know really whatever but that's the input features and then classification you can think of it as like a binary classification all my examples thus far have been like rice non-rice or Forest canopy versus non-forest canopy but you can also have multi-class you know you've you've got many many different types of entries and you're using that to predict 10 classes and the model can do that as well there's abilities to do regression as well so in the the um canopy cover example if you had 10 canopy as an entry and 40 canopy as an entry you could use machine learning algorithm to identify at what point you could start to identify canopy at a certain uh percentage of that canopy as well so you can use not only categorical information but continuous data as well um so we've got this example here again we think of it as table data we have this database of galam say and non-galom say GPS points so that is this this instance right here with blue being non-galom say and or excuse me Blue Bean Golem say and red being non-galom said and you can see my cursor is also red so it might be a little confusing um and we've got this set of landsat images so how can we use these these this information to really start answering the questions of where column say might be um so we're going to walk through another example here um so let's leverage some of the information from um EO that we can derive and this is just a you know these might not be the best features right that's where you can start to blend your knowledge your Environmental Management or environmental knowledge ecosystem management um knowledge with what are the right variables to understand this phenomenon because that's really key you know understanding the geophysical properties and coming up with proxies for those or direct measurements of them with Earth Earth observations that are relevant to that feature is so critical in the Val in creating a valuable model but we've selected in this example here Savvy and ndvi and these are just medium Composites between 2019 from May to October so it's just an arbitrary time with an arbitrary index that's a median we can see here in this example we've got a pretty Stark um classes you know non-galom say versus galam say at the top and um obviously when you're running in this uh graphic here we've got the number of trees that are running through this calculation to identify this hyperplane and um essentially all we're doing is that line ends up being the model as I had mentioned before so how do we then take in new information right so imagine there's only a hundred points here um within this data space and that's all the points for you know that for you for your field campaign right this doesn't have any geographic information it just has only information plotted by ndvi and savvy so if we were to add in a new point or a new location that has Savvy information and has ndvi where does that where does that point fall so let's walk through this example so we've got this brand new location it ends up having a negative 0.3 and a zero for Savvy we can see that it's most likely going to be not galam say because it falls within this space right and that's really sort of how you should start thinking about um the data space and machine learning algorithms because they're they're taking the relevant geographic information but then they're when they're stepping into the the data space they're shedding that Geographic proximity um there are ways to include that in the modeling space but for this um random Forest example it's not being included right it's only looking at these and comparing these features which are Savvy and ndvi um so that is that's essentially What's Happening Here so as I mentioned before hyperplane and decision boundary are are really key factors in identifying um these two different classes in this binary classification example so how will we um actually leverage machine learning to answer these questions so there should be question you should be asking yourself um going into um you know using these this tool set what's your input data so like for instance galam say we had we had those known locations of non-galom say and galapse points but oftentimes it's like technical implementers all of us in the room here you might just be handed a data set you're probably not the person going out collecting the information and running with it all the way to the final product you might just be right in the middle right so someone's handing you the data and then you're charged with sort of producing something um which is great like I I love being in that sort of middle space but I always like to take a a full stop there and understand the input data so what is the data type what's the quality what's the quantity what's the evenness between the classes so do I have 10 million Global State points and only four non-galobsay points and if I'm trying to build a model off of that I'll probably have some uh biases or imbalances to begin with which is really going to influence my model so I'm trying to you know develop a relatively even distribution between those classes because that's really going to be really important also another factor to think about is spatial distribution if you're going to be collecting all of your points um down by you know say the ocean um in Ghana and you are you have like a high concentration of points in one location but then you're trying to you know model that same uh phenomenon but thousands of miles away that's going to be really challenging right so having a really widely distributed and evenly distributed Point collection for the phenomenon that you're trying to identify is going to be really critical um and you you want to take advantage of that uh variability both Geographic environmental because if you think about it say you're trying to identify salt right you've got a dried up Lake and you've got a salt bed and from an earth observation salt is is bright white maybe surrounded by desert so it's pretty easy to identify but then you know you might have some salt distributions that are in other environmental uh cases you know might be some mine tailings or things like that and if you're not capturing that spatial variability or even like the human intervention uh variability when when you're collecting those points your model won't know how to handle those new inputs so really think about the the quality of the data that's handed to you um and that also will change your your approach when you go to actually do the implementation right you may have to scale way far down like your your your Geographic uh prediction if all of your points are concentrated uh in one location so those are some of the caveats they should be thinking about um and as sort of rules in your toolbox when you're thinking about applying this um okay I see a couple questions here yeah go ahead Emil yeah thank you thank you Tim um maybe also just something to to consider too um you know for there that you have on the slide talking about key factors with input data I guess we often think about the the input data in terms of like you know the field points or or what have you that we're collecting uh something else obviously to consider is the I guess the appropriateness of the imagery that we're using right like so let's say we're doing you know classical classical type classification as you pointed out of you know you're doing a supervised classification maybe using Sentinel two imagery maybe using you know that's at nine imagery what have you um also considering the intricacies of of that data itself right um you know folks in in West Africa like you know folks in other parts of the world now also exposed to you know data from planet um data from other sources uh I think you you brought up a hyperspectral imagery perhaps a little earlier Tim and so you know like how the the number of bands that are available you know uh come into play um you know whether the data for instance have uh shortwave infrared they have additional um say additional spectral information that some types of imagery might not have and also of course um the reason I brought up the the idea of the imagery as well is um you know the how we're processing that imagery you know are we collecting the imagery from say an appropriate time of year um are we doing additional types of processing on that imagery to to essentially kind of match it up to what we're looking at and so just to throw out that there are there are additional kind of things that that go into it because as as Tim kind of highlighted earlier um I guess you didn't put it into so into those specific chords but I remember um you know like when I was in grad school being exposed to some of this stuff the you know the the traditional term that they'd say is garbage in garbage out and so lots of times we think that oh you know this model didn't work properly or I'm not getting the outputs that I want because the input data I have isn't correct but I guess there's the intricacies of not just your field data your satellite data the algorithm that you're using how you tune it et cetera and obviously uh with Tim and biblab we're going to get into some of the intricacies of that as well and I guess I know that there are other people have other questions in there as well yeah a meal fantastic Point yeah the other element you know being being really mindful of the variable space those covariates that you're adding uh and you said the the key word there garbage in garbage out uh keywords I should say um that is so so critical to even produce anything of value right and I think most everyone knows that already in this call um but it's so critical um because the the model will only develop it'll only produce what you give it right um so it's it's going to be making decisions off of potentially poor data so that's where being really really uh restrictive and um thinking very deeply before just applying these methods um it's really important okay I have one question here uh how do you evaluate the evenly and widely distributed stability suitability of training points great question um I would say it starts with expert knowledge right we're going to have an example here of Gollum say I'm not an expert on Gollum say right so I wish I should be probably invalidated to begin with but if there's people who have that knowledge to begin with they're capturing you know unique instances of galam say uh mining you know there's maybe high impact instances low impact um large-scale small scale elements like that that type of variability be really critical for the model to see um and then spatial distribution you know if it's happening if it's widespread you want to try and capture that in a field campaign in the same way and to Emil's point you want Earth observation that also captures that same type of phenomenon right so you want to look at variables and include variables that capture that type of phenomenon either for that time period or for that sort of Optics or physics in the atmosphere so definitely be thinking along those lines and I would say just apply more scrutiny than you'd actually ever more scrutiny than you ever think you need just keep coming back to that question because ultimately when you run these algorithms you're just going to come back to then evaluating the quality of the model so you just do this iterative process where you bring in data you bring in your products you run the algorithm it runs like super fast and then you have to understand the quality of it is this even useful or not and then you just go back to the drawing board over and again so it's very very iterative and taking that time to think about you know elements about like how widely distributed the points are how apt the points are for understanding the phenomenon how useful the input data is those are all going to really improve your model in addition to like the the tuning of the model itself so the different parameters that are they're happening within the random Force algorithm um so yeah there's lots of talk here and we'll save that a little bit more for the for the discussion as well um okay and then the qualities and quantities of input data does it comprehensively identify geophysical attribute right if you're trying to identify a specific thing the indices that you're giving it or the proxy data is that even really telling the tale and you need to be thinking about it that way um and how much data do you really need right um and what kind of data gaps do you already have those are just taking that time in advance of that and we can talk about that in a little bit more and then one more thing is getting kind of familiar with different algorithms especially in the machine learning Camp um there's a lot of different types of algorithms they all do sort of similar things in specially say in this image classification bubble and the data that you might have might be more suited for a particular algorithm and for instance like random forests can run with relatively a small sample size a small population versus maybe accent which might need a lot of points or svm might might need something else so when you're making your decision um understand your data first and then you'll probably come to a conclusion about the best algorithm and of course that's where you know you could be coming to the tensorflow tensorflow working group and asking those questions when we get into a little bit more detail um because you know everyone's sort of going down the same sort of checklist when they're doing their applied science so when I say thinking about your input data this is a good way to sort of visualize it because again I'm a visual person but is your data does it have this linear quality right so if you think about galam say versus non-galom say we don't know what these variables might be but we can see here that it's relatively easy to draw a hyperplane right so like a regression algorithm or uh svm or a random Forest could easily pull this data apart like that's what they're designed to do um but then we might have something like this where within this data space there is this tight cluster that doesn't really interact with any of these other locations um any of this other space so it might be like meeting house prices um in a particular census or something like that that is already geographically tied to but then you when you look at it in a data space uncoupled from the geography we see this tight cluster and that's where a clustering algorithm like a k nearest neighbor would be probably really successful because it's not going to draw this really awkward hyperplane it can just cluster those together as their own instance and then we've got this location here where we've got really Inseparable data so we've got highly mixed data or Tangled data and we can't really use these two approaches so that on day two we're going to talk more about deep learning which helps to draw really unique potentially overfit types of hyper points decision boundaries and this is really relevant because it allows you to also apply more knowledge about your training data like looking at a patch of information as as opposed to a single point so that's really exciting in the Deep learning yeah Emil go ahead uh just to say quickly I I really like the example that that you brought out there um you know with the graphic on the bottom center um because I think it illustrates you know just two things to quickly mention about this um I think a lot of people have seen this type of graphic or some version of this um in different remote sensing books and it kind of provides a I'd say an idealized you know what spectral signatures look like and at least in the in the classical part of of machine learning um you know they usually point to you know ideally you're going to see these big differences in spectral signatures between different types of things you're looking at and so again just imagining you know we have folks from different parts of West Africa you know a problem set could also be you know you want to differentiate uh two types of crops you want to differentiate you know Millet from sorghum or Maize from Millet or what have you right and so ideally in the ideal case you know you have that there'd be spectral differences there um but of course you know it's not until we actually do the testing that we realize that in many cases uh it's it's not so so idealized that's one um and then the second part that you have there highlighted uh Tim regarding the visible bands um is kind of showing how you know like for instance let's say we're only using you know data from from planet or one of these other ones like quick bird iconos Etc that only have visible and near-infrared data um if the different features that we're trying to differentiate between are all like really close um in in the parts of the Spectra that we have data for and we don't have the shortwave infrared where you have the big differentiation between the classes then that creates a challenge for us to to make those to you know essentially tease things apart but again as we're going to see in in the second part tomorrow with with Tim and biplab I guess that's where you know these newer techniques like deep learning have come in to come up with you know different ways of of separating these things so I really like the the example uh thank you yeah and this as as Milo mentioned like this is such a a staple we have kind of like an ongoing joke on our office because we see we see this figure all the time because it's so uh especially this one up here it's so telling of sort of what you're sort of driving towards when you're trying to identify that key uh physical property or Optical property that really helps to distinguish the classes in your mind right you know you're sort of applying this data science lens and then this uh geophysical lens or applied science lens and you're trying to merge them together and another good example could be like elevation right you think of something so simple as we're doing this project in Bhutan and we're looking at low-lying rice fields versus elevation terrorist rice and it doesn't matter how effective you know you can identify you know the greenness of the rice or the wetness of the fields the biggest telling Factor was the elevation like that is that was one of the best features to begin with just because of the way in which the rice is already cultivated so knowing that kind of knowledge um and using taking advantage of it um can be really really uh useful in a in an informative input variable okay so then we've talked a little bit about the different algorithms um and one of the questions I or the elements I had here is get familiar with the different algorithms and in the classical machine learning this is not by any means all of them this is just a few that you probably interact with like a cart or um brts or gbrts or random forest and this is I like this figure a lot because it within that like the past example you had right here it's gonna give you three or four um decision trees svm uh bays and nearest neighbor it gives you four different types of algorithms and how they're performing on this data theoretically right and you can see here they're all basically doing the same thing right and they might do it slightly different but they're still segmenting that data um so you know you can take advantage of any of these algorithms some might have problems some might have cons but they basically do pretty similar things so I would say just be open to learning new algorithms um but also learn the caveats associated with them and this is a good example of uh random Force what's actually happening here when we're developing these hyperplanes it's segmenting this data space to say I'm going to make a split here to segment this data here and then here and you can see here with three lines that it draws when it's segmenting this data we end up with a pretty good classification with only one misclassified location but then if you wanted to capture this and really push the algorithm we'd end up having to create six total splits and that's three times the amount of partitioning right just to capture this one location and this is a good demonstration of an overfit algorithm because what happens when you get introduced with another red Point that's right here it's going to be misclassifying that so you want to balance your specificity of the model with the flexibility of the model to interact with new data points and this is really more about just you know good modeling practices but I love this representation of of you know this is really what you're sort of interacting with when you're doing that iterative modeling stage okay and then let's go hyper basic like very very basic uh I I gave this training uh to folks in Bhutan in May and uh a lot of people were like oh that makes perfect sense um so I wanted to include it it might be too basic but um I I thought let's just keep it anyways so we've got uh this input of data right we've got two classes ones and zeros and within these classes we can see here that there's some features some are colors so we've got some red we've got some blue but we've got another feature which is underlined some are yes or no so these two we can see here underline and these are not so looking at that Top Line we want to identify a new location so like we've got something that just popped in it's another zero and happens to be a red zero so how do we separate this right we can see it's kind of complicated we've got you know two classes a couple features we've got a new input what do we do so we can leverage a decision tree so the first question you might ask is is this new entry red yes or no the Second Step would be is this underlined so you can see here we've got this immediate segmentation right off the bat we've got splitting this data yes and no right but we haven't been able to identify this so we we want to we want to see if it's underlined and we can split it again so Within These two rules we can identify a location where this is not underlined but it is red so with a very simple decision tree you can find out something that is relatively nuanced within your data so we're just going to scale that way up with way more features with way more complexity but in a sense that's what we're sort of driving at right so this is a great representation of something relatively straightforward but we're going to apply that with from One Tree to many trees so this is where we get into the world of random forest and I always like to talk about random Forest because um it's probably the most applied version of of machine learning is relatively straightforward to to leverage it's available in Google Earth engine and it really lends itself to a lot of the the EO classifications that we already do and the power of random forests is the ability to Ensemble so in this other example here we had a single tree that came to a conclusion right but with that's just a single decision tree but where we make a forest is where we have multiple trees so that's the beauty of random Forest is we have a single tree tree number one is coming to a conclusion tree number two is coming to a conclusion and N number of trees so you could specify do this 100 times over and then it's going to take a majority vote across all of these to identify the best class so it's taking advantage of a large set of computations to come up with a final solution so that's really powerful that single tree has now become an entire Forest and then that random factor is on this next slide here so the beauty of the random Forest is if we stay up here at this very top in population P we've got say three classes of blue class an orange class a red class we're going to take a random sub sample of this data to create our first rain of forest so we're just grabbing these four points for instance then we're going to replace that data and Sample again so that's random random sampling with replacement so we may get that same point we may not and you can see here we're sub sampling that population to create this Randomness to begin with so each of these trees has a different set of populations to build its relationship with additionally to that Randomness we also randomize the features so for instance here this this single decision tree that we had in this last example that had you know underlined ones and blue zeros that example it looks at all these features simultaneously and it makes its final decision but in this random Forest example here it's maybe only making decision off of two features right so if we Zoom right over here feature number one feature number four are only used in this very short this not very deep forest um this very uh short tree I should say so the beauty of this also is the it randomizes the features that it it looks at So within each of these trees not only is it getting a random set of uh entries for its population a subpopulation but also randomizes the features to make its conclusions and it still then uses the majority rule across all of them so you're getting a lot of variability mixed into the model uh across all these different trees and that's super advantageous for understanding you know really biased data nuances of data and also drawing relationships that you couldn't have thought of to begin with you're allowing this algorithm to draw those connections um so the randomness is really really powerful here um and something else that I but I really mentioned here is bagging so that's when I talk about this this um random sampling with replacement we're actually doing this bootstrap bootstrap aggregating and that's what's uh is bagging and it randomly uh selects features in order to construct a collection of decision trees with controlled variants so that controlled variance is really important because again it's going to look at all of your features but it's it's doing them in these random subsets right so that's really powerful and another Factor here is when you look at say this population p and you're pulling down this sub population you can measure the out-of-bag error how wrong essentially is this population um for each of these trees so there's different metrics that you can pull out of the valueness the valuableness of this of this modeling framework so a couple other key factors that are in Gray here um or this tannish color number of trees going this direction so you can do 100 trees to a thousand trees whatever it allows you to add more and more and more variability essentially controlling that variability and the depth of these trees so you can say I want to I want to use every feature essentially if you give it a million features it could be a million tree spaces long um but you don't necessarily want to do that if you're really computationally challenging so these are some uh basically parameters that you can give to the model and then it does this final Ensemble so this is a lot to consume but this is sort of what's happening and the value from a random Forest perspective and we'll get into that with the remaining you know 30 minutes of this um so the pros and cons just for a random Forest um it improves upon the decision tree that that first example because it allows basically a large committee we have a huge set of trees it's gonna Ensemble across all of them many trees and come up with a majority solution so that's really valuable right it looks to reduce that overfitting by allowing bagging and also a metric for the out of bag error so you can then you know deal with overfitting um with this metric you just have to be aware of it easy to implement it performs well with a low number of points it's computationally inexpensive because all it's doing is basically drawing a statistical tree um and it's it's low effort and then the cons is uh can't handle Tangled data like back a couple slides we had that really Tangled data it's going to make a really overfit mess um the one key bullet here I'll highlight is if the problem is super simple you may not need random Forest right you may not need um all of this magic right you might just need a simple linear regression or maybe you're just looking at um a spectral Peak or something like that so just because I I I'll get on my um soapbox for just a second just because you learn these skills and you have them becomes a hammer and then everything becomes a nail and I'm guilty of this you know I'll use machine learning for things I don't necessarily need to um so just be be aware of that don't over complicate it um and this this algorithm is still very over prone to overfitting um you can take care of that with monitoring you're out of bag error but you know it's it's all about the application okay and then Emil uh said it already but this is really key two major takeaways the so what's uh garbage in garbage out you're gonna get a bad bottle if you don't spend the time understanding the qualities of your input data your input features to begin with it's going to be bad um and you it's going to be within your your side your Camp to really understand why this is um so think about garbage and garbage out and then this last quote is from uh George EP box and this used to be at the when I worked at the USGS this was on the in the modeling room above the computer it said all models are wrong and some are useful so I love this quote because you know this famous statistician is is straight up saying that you know we have all the all these tools at our disposal but they're all going to be wrong right they're going to get pixels that are incorrect it's going to misclassify things every every model is going to be wrong but it's up to you and um stakeholders and the technical teams to understand the value of what's been created so always keep that in the back your mind that all models are wrong but some are useful um so that's a great takeaway so now we've got our Hands-On lab we've got about 25 minutes left we might be able to squeeze a little bit discussion in as well um so we're going to talk a little bit about um galamse and this is where I'll switch over to um the uh Google Earth engine and just raise your hand if you can't get access to it I shared it before that it is uh it should be in your reader so let me switch that tab okay and for uh go right ahead I think it's Frederick Frederick uh well from what you just explained on uh random Forest I it gives me the idea that for example if 10 different individuals with same input parameters to run that model they're likely going to have different results is that correct uh that's a that's actually not correct but there's a there's something we'll talk about um called the setting that it's the seed value so uh what we're taking advantage of with uh these machine learning algorithms this Randomness is a stochastic property we want to take advantage of the randomness of a population we want to sample from population P we want to make these Sub sub populations and then build a relationship off of it but the challenge you run into is if you were to then come back to it in an hour and run it again you might get a different result or someone else might get a different result so you have to set the random State and so that's where you set the seed uh and we'll do that in the um the rainforest script but that allows you to set the same properties for your stochastic behaviors so that way you get the same repeatable output using a random property so uh we we someone much smarter than I circumnavigated that a long time ago so just make sure that your random seed is set um so I've got this script up here and uh let me make it a little bit smaller so everyone can see we only have about 20 minutes uh oh go ahead Frederick I think it's Jacob Jacob oh Jacob go ahead hey Tim yeah I just raised my hand because there's actually another question in the chat um so I just wanted you to check that out video data they are mostly spatially spatial temporal and during sampling and training how do we manage to randomly select from a certain uh temporal period or spatial location okay yes that's a that's a challenging question and I would say it's mostly related to study design so um for for instance I'll I'll go back to this rice mapping effort because that's just fresh on my mind from some of the work that we did in the H cage region but um rice is grown at a particular time there's a translating stage where it goes out in the field and then there's a flooding of the fields so there's like this water um presence and then the fields are drained and there's a harvesting and then there's re-fertilization so there's there's a cultivation practice that happens in it's very you know um concrete that happens at the same time every year you know maybe one or two days flexibility so we can key into those phenomenon that those changes on the landscape with sub-selecting those temporal periods we did a lot of testing as to you know what's the particular time frame should we do every five you know 15 days or 30 days every six months a full year you know when do we develop those Composites but we really dialed into the the farmer Behavior and the cultivation practices and the same thing could be happening for any kind of phenomenon you just need to sort of think about it as you know what bookends are happening with this phenomenon and how do I capture that with the right indices or the right you know spectral information uh and then for the spatial locations I would say that it's really tied to the variability on the landscape um if you're seeing a particular um phenomenon happening in just one location it's hyper-specific you may not need points from all around you know the sahal sahal region right you might only need it for that location to model that uh Behavior but if you were to then project that do that prediction far beyond the bounds of where you have that data you're really sort of interpolating you're just projecting what this model should be but the confidence of that those much further away locations is should be much lower um and random Force doesn't actually give you um a metric to understand that um interpolation other algorithms do maxent is one of the examples that actually gives you a value of when your model is starting to project into basically um it's giving you a good value a good statistical value but it's low uh quality output so different algorithms kind of handle this uh differently um I think that maybe answers your question and I'll just keep zooming because we've got um we can't find the script okay so when you open up the um the scripts on the left hand side it should be within reader so right over here in reader um and it should be under tjm so it should be way probably down in this location and I'm just going to uh yeah go ahead Emil yeah I was just going to say too uh sometimes too what happens is uh even though it adds the script you have to use the little refresh button um at the at the top to kind of refresh you know what what scripts you have access to so that same button that Tim is showing there so after you do that do that then it should show the the tjm0042 slash uh West Africa ml training uh repository so click the link that Tim sent if it doesn't show up just hit refresh and then it should be there and if you still don't see it then uh just let us know in the chat thank you kitty yeah that's great point you know sometimes it has a weird lag so um I'll just keep moving along um because this is more about this is kind of The Parting gifts you can use this um in the future as an example um and the scripts are relatively straightforward so this first script is um let me Zoom back up here is associated with under owner sorry um SAR processing so I won't go into this because this is all sort of remote sensing and not machine learning but I wanted to just highlight that we have this uh very detailed pre-processing Sentinel one script to produce a uh radiometrically terrain corrected um and uh lee lee um speckle filtered processed uh Sentinel one image for ascending and the reason why we do this is what Emil brought up before is understanding the quality of your input data especially from the covariate side you want to do a lot of pre-processing and understand the pre-processing to get a product that's valuable so the grd data that's available in uh Google Earth engine is not great so we added this script to basically produce this um a pre-processed uh image collection and then it's available in the next script so this is one thing you can use in the future it's essentially just a while it's a pre-processing script but I didn't want to just skip over that and let you guys have it and think about it so we'll just jump down to the second script which is the meat of what we're going to be talking about and that the final image collection from script number one is available in here it's just an open asset that anyone can use so feel free to read through this um it's it's got nine distinct steps in this code all of it is already commented out so we'll just go very quickly through um this code and uncomment it together and then we can ask questions feel free to stop me at any point um so let me okay so first step is we're going to Define our date and our region of Interest so I've arbitrarily selected 2017 um just looking at some of the galam say data that was available in the West Africa um Google Earth engine asset space and that seemed like a good um input for the goombse data and also the EO data thinking of SAR um you know it's not available really until 2017. so we wanted to select a later time step um that had you know more uh covariants available so then we've got uh uh Ghana and we selected that country and we've added a geometry so we don't want to look at the entire uh country we want to just look at a sub-region so that is this geometry right here so let me scroll in we've looked at this smaller location near the ocean and then we've just intersected that location with the full country so if I hit run just with just these few lines included we can see here that we've got this Roi which is this location right it's that box intersected with uh the country bounds so now we want to add in some input variables so I'm just going to go to the section two and uncomment all of this so we've got landsat 8 um surface reflection and we've got Sentinel two and I just pulled in this Sentinel 2 example from some of the other galampsi scripts that uh Jacob had shared with me as well um that are some some of the other approaches that are being used um so I'm just kind of grabbing different data sets um by no means is this the way it should be done um definitely take more time and sort of understanding the nuances of each of the um uh input EO data sets um but I just want to get a lot of examples here so we can run the algorithm so we've got landsat 8 Sentinel 2. we've created these ndvi Composites with Sentinel 2 in particular this is taken from the some of the ongoing work at West Africa so if we add this we hit run just in step two all it's doing is basically adding these to the map so we've got some data which is great let that load perfect so we've got some input data now one thing we should be thinking about is let's just uncomment all of this we've got a single function here that function is called Optical indices it takes an image collection and a region of Interest we've got a region of interest already we've got a couple of image collections defined up above and what we're going to do is we're going to remove all the clouds so this first function goes into each image collection like a deck of cards pulls out a single image it goes across every pixel identifies where a cloud might be and removes that pixel so that's the first thing then it takes that that single Pixel information and performs an operation on it so it might do an ndvi ndwi all of these different um indices that are being produced so it does all these calculations and then puts that card back in and produces a new uh image collection that we return with a single image collection with all of these bands added so it's a very like strong a pretty Mighty uh set of scripts here that produces all of these different indices automatically um so we've got this Optical indices here um when we uh here's a trick question when we hit run here what happens nobody's this is kind of hard like I told everyone it's a trick question right uh nothing will happen that's the trick because we're not calling this function on anything right in Google Earth engine it's not going to actually uh run until we have it applied so this is just we've defined a function we we haven't set run on anything just yet okay and then we add in our Sentinel one this is that indices the pre-process indices that's available within the assets base everyone should have it as well it's just the ascending Sentinel one pre-processed so now we want to apply um that that function I mentioned before so we've got this variable which is Optical industry out it's taking in our landsat and our region of interest and it's producing um all of our indices that we mentioned before now in addition to all of that once I hit run we added a couple more data sets we added a jrc data set which is the um it's from peckel at all from jrc it's total water occurrence over like a 50-year period using all of landsat um obviously because galam say is highly tied to um water change changes in water this could be a great variable and then we've also taken in this Urban population and looked at locations where there's Urban centers because obviously Kalam say is not happening in City centers and those phenomenons between urban um and Guam say kind of are very close to each other right so we wanted to get a variable that helps to sort of distinguish them so we've added we've done this aspect right here if I scroll over we've got this variable called band prep which adds the shuttle radar topography our elevation all of those indices jrc the urban population as well as the Sentinel one and the Sentinel 2 final ndvi so we've got a ton of bands we've just added all together and essentially this is created a single table this is a a uh a feature collection that has all that information for us so when I when I hit run in the console we can see here this band has 20 images or has 20 elements so the normalized difference ratio for Sentinel one uh the settlements from the urban information the maximum extent of water from jrc so there's a lot of nuance here that we could go into a further detail I know we're running kind of out of time but you can see here all these different bands which have some logic as to why we would include them okay so now let's uh get into the the data prep aspects so this is where you can get engaged you can you can do this on your own um if you read through here this is where you're going to add a marker for each of the locations so if I scroll up here we can see that I've added a user added Glam say and a user added non-galom say and if you click on this point this is CDI one so that's all the locations that are galom say and you can just scroll into the map and find all the locations let me zoom in where there is galam say and you can add more points and this is just I you know I wouldn't necessarily recommend it for a study design right you want to make sure you've got good field data you may be using collect Earth online um but for the sake of running this algorithm you could add more data this way so you can just go ahead and add more locations if you want and then you can add for your non-galom say you know we've got intact Forest I can just add some more non-galomsa and these points are already set up uh to have a zero for all of these locations so this is the non-galom say locations so that makes sense right we're adding more and more training points that are spatially distributed again you want to apply more restrictions on good rules of thumbs in the future but this is more for just getting the algorithm to run um so this walks you through adding more locations you can just do a 20 or 30 what have you for each of them and I've already included some to begin with and then let's go ahead and open up step 5.3 so if we hit run on step 5.3 we can see that let me make this just a little bit bigger okay when we run this we've got this user added locations it's telling me that I've got 28 user added points for galom say and 100 already to begin with I just added a couple more but then at the same time we are also adding the 2017 points that are available right um so then um we we want to add these these this data together so this is where we take any time the CID is considered one we want to convert that into being a glom say point so we're just doing a little bit of data handling here to make sure that the input data is zeros and ones for Guam say non-galom say uh and we're selecting that year so the final output here for this column say points if we look at it in the console first example here properties cid1 year 2017. so it does all the handling already and that's what we want to give to the algorithm so now let's go into step six um as I Race Against Time uh we've got this training sample so essentially what we're doing is we're extracting we've got a location where across the entire study area we've got all these pixels and we want to go and select for these locations where we've got input information and we want to sample that region so we want to go to that geography we want to go to that feature collection and Sample the galante points and that label information so this is creating our training sample right thinking back to the theoretical aspect we we did all this work to create good training points and we add this random column where essentially we split the data 70 30. so we have 70 of the data used in the training and 30 using the testing it's pretty standard approach but this is what we're using here okay so this is when we print this we can see the Dynamics of the data so this could be a good measure of when I talked about the balance of your data um let's let this run and we can see what for our testing and training how many points we have um this is running I think it is okay it's thinking I'm skipping ahead while it's it's thinking so it's calculating as as it's doing the sample region it's calculating all of the the features that are in there just minimize that while it's still thinking um so then the step seven is where we Define a random Forest so if you were to read the documentation on the random Forest within Google Earth engine it needs a feature collection a list and an image so we've created a function here that allows you to to do all the operations that you need in a relatively straightforward function um essentially what this does is it pulls in which we're sampling out our list which is our band name so that's this right here or our list excuse me the CID is our is our label so that is our we considered CID as being presence zero or one one being galom say and that's within the feature right and then our list um is up above so then we've got this all the information we need for the random Forest right here um and so I'm just going to go ahead and add step eight I guess we're running out of time I apologize um and hit run so we have a product to work with here let me make this a little bit bigger uh oh here we go okay so now we've added all the layers in the random Forest so when on step eight this is where we're actually applying the um random forest model training data are bands which is the the these 20 bands right here and then our band prep which is that image which is well up here band prep excuse me right here yep so this is that image of all of the the different bands so this is what we give to the algorithm uh and it it performs in this function here and it provides all of these outputs so it's going to provide a chart of the variable importance when it's actually it's by by performing the random Forest it's going to use this dot explain to take advantage of the um decision tree that is being produced and it's going to then plot this information in this histogram so if we open this up we can see here that of the variables that we included it settlements mndwi ndwi so these water indexes settlements and the normalized difference ratio of of Sentinel one seem to be important while these other ones are not um again this is just a tool to be thinking about when you're in that iterative modeling stage and then we've got all of these different metrics which we can go into detail more if you want but we're out of time essentially um and you can see the the model being performed here so let's zoom in this isn't the world's best classification but it could be useful right all models are wrong some are useful and we can see here just with some indices and about you know 30 seconds of that run we can see here that we've got a classification that's able to identify relatively well some of the galante locations it's also misclassifying uh pretty significantly as well like right over here is a great example um so this is this is uh exciting definitely check this out um and then I'll end with this final script over here on script number three um it's a little bit more detailed um it's the entire script for the complete lab with far more training points um so you've got 400 non-galom say points so the model output is is much better um and you can just go ahead and try and run through that so let me Zoom back to the the correct tab here um all right so we've got two minutes left it's hard to do a training virtually with an hour and a half um so this is a walkthrough of all the steps that we just did to produce this column say classification um and you can use you know this this models really understand how did it perform uh you can take advantage of the statistics uh that I I very briefly mentioned to understand the accuracy of it I think if I zoom back to the the tab over here as it's sort of sluggish um the accuracy was for the trainiac she was 99 so uh we know it got wrong right we got locations incorrect um so just because it's giving you a good statistic doesn't mean the the model is good so just be thinking about that or actually identifying you know um be just add a caveat to some of these statistics when you're reporting them because it might give you a really high value but you know when you look at it on the map it might not make perfect sense so you might want to use a different statistic um like Cohen's Kappa coefficient to help sort of tease out the value of your final product um Okay so we've got a minute left I think we still have time if a couple people want to run a little bit long to do the kahoot um the the final test to see who the winner is um so let's let's put that up because I know we're already over time but I feel like we'd be losing out on a good opportunity to do a fun test so let me share this town so I know we're over time but if everyone wants to you can participate you just need to go to kahoot.it and then you can add it in this pin uh this will allow you to join the game and there's about five or six questions and we'll we'll Don a winner uh keep the excitement high for when we jump into deep learning uh hopefully we'll have a lot of discussion space Also I know I ran over on the discussion tonight but see severe feel free to add in everybody else and I'll give it like you know a couple minutes I I know that's a meal so we're while we're adding folks um let me read through the the comments because I know I ran long um if there's anything else can't find the script me too all right if that didn't work okay here's the script so yeah I as people are joining any questions as I zoomed through the the you know the random Forest well one thing I'll mention as as people are still joining in if you look at script number two there is within the random Forest step number seven there is properties like numbers of trees variables per split minimum Leaf population bag fraction number of nodes and Seed we set it as seven that's the stochastic behavior that I mentioned before those are all model parameters that are just set so if you want you can change those numbers um and the value of that is you add more trees that's just more models that it's going you know like laterally right but it just ends up being more computational effort so just think about that as you know one of the negatives of adding more trees but the value is you get more variability right you got you control that variability so you can play with those parameters on on improving the model process and that's again that's sort of that iterative component there um but for when you actually do implement the random Forest that we've ran through you can see how simple it was right it's just making the decision drawing the hyperplane based off of those 20 variables that I gave it so I see we've got 19 people in with 43 45 people in the call excluding myself um I'll give it a few more minutes if people are still switching over it's just the kahoot.it all right we're at 24 people um we're gonna do it we're gonna roll Okay so for everyone who hasn't used kahoot the way this game works is we're going to start the quiz and then you have to respond and you get points for being accurate right if you know the answer and you correctly answer it you get points you also get points for how fast you answer so the very end it's going to tabulate not only how fast and how correct you were so you get a lot of you get more points for being correct but you get additional points for being fast so I know we're challenging network issues um but this is all just for fun anyways so let's go okay machine learning machine learning algorithms build a model based on blank in order to make predictions without being explicitly programmed to do so they build models on blank is it garbage in garbage out training data slash sample data cats and dogs or inference slash prediction nice okay get some good answers all right so you can see how fast it goes right not only do you have to be able to get it right you have to do it quick because there's only like a 30 second timer all right next question oh we can see the scoreboard first is winning right now just barely though um okay and then let's go to the next slide okay true or false an input feature is an individual measurable property or characteristic of a phenomenon true or false getting a lot of responses here okay a few seconds one second nice a lot of good answers let's see let's see what uh with this score shake out Mary zooming up the the leaderboard very nice okay next question true or false decision boundary is a hyper surface that partitioned the underlying Vector space into two sets one for each class true or false thank you nice a lot of good answers okay let's see where everyone shakes out that's awesome okay look at all these people zooming up the leaderboard highest streak three three answers in a row by as is all right next question true or false random Force uses bagging to randomly select features in order to construct a collection of decision trees with uncontrolled variants 21 answers ooh split Camp down the middle all right that means we got to go back to that one okay see where we're at zooming up I think we maybe have two more questions true or false diversity of solutions is a valuable trait true or false nice yep that's great okay maybe this is the last one let's see the close it's a close race everyone's in the same sort of uh space up here all right random Forest is a blank learning method for classification that operates by constructing a multitude of decision trees is a ubiquitous a single an empty an ensemble is a blank learning method nice a lot of people put in Ensemble that's wonderful okay next this might be the last one see your scoreboard Superior well out in front now okay true or false George EP Bock said all models are wrong and some are useful true or false nice okay let's see who was on the podium Mary in third Siberian second Jacob takes the lead nice at the very last oh that's awesome well Round of Applause to everybody for having a good time at the very end I know we ran long by 10 minutes um this is day one of two so I know I just zoomed through the code which is probably what everyone wanted but um we'll have time tomorrow with some discussion space as well yeah and Foster go ahead yes um yeah thank you very much for the um um for the GE script um and so I I just wanted to find out if you think that the I mean from your experience um the random the random Forest um algorithm will uh will do as good uh given that it's uh it's simple to implement and not too expensive um to use uh so I just wanted to find out your take on that in relation to our one thing to adopt a deep learning approach uh do you think we wish we can go for a machine learning or or a deep learning approach yeah great question that's I feel like that's I I I'm constantly sort of thinking which which one should it's like you're shopping for a car or a house like which one should I buy what do I really need um and with a with something like random forests it's very simple to implement um it's relatively easy to understand the how you where you came from at the very beginning to your final product you know obviously there's some space within the random Forest that's sort of decoupled from what is easy to understand that's the Black Box nature of it um and because the safer column say the phenomenon that you're sort of looking at of impacted riverways is relatively easy to distinguish between forest and urban right that class isn't very well mixed you can you can identify it pretty easy from from space which we just did that random Forest example um so in that sense it it's pretty straightforward I think you could get away with a random forest model and just really fine tune it to your other element of that question though of deep learning I think it's really valuable to learn deep learning like to actually have that skill set and leverage it because the field as we know it is really advancing in that direction Because deep learning takes advantage of more and more data the random Forest you're gonna have to go and conduct these field samples in your every year you know it's like it's really data intensive in that way um as opposed to deep learning it's it's even more data intensive but it can rely on even more and more data that you may not even have to collect so like the field as as we know it is really sort of Shifting in a way that you're going to be able to take advantage of more and more data sets and get a conclusion that's really valuable uh all in theory and the other element that's really valuable to deep learning is it's not tied to a single Pixel like we did that example of dropping points uh in the code and it's going to build the model in random Forest per pixel classification but you can actually use deep learning to do a full area so like a neighborhood of pixels so you're taking a patch of pixels and then taking the the feature space within that uh neighborhood of pixels to develop a model so it can actually look at like a wider phenomenon so you're looking at a bigger aperture of space um and that's super important when you have a really complex challenge really Tangled data in that sense so there's pros and cons on both sides and Foster to your point I think if I would always start with doing the most um the simplest approach first like can you identify galom say in this example with just a spectral analysis right looking at ndvi change is that enough and if that's the case that's so simple you can course correct and you can develop a service that really leverages something so simple like that if you can't then maybe try a regression or a random forest and if that still doesn't net value then you can go to like deep learning and really sort of start tackling that Tangled data problem with an even more complex set of tools so that's what I would recommend just because those other simpler Solutions are still super valuable any other oh go ahead Glory no no I'm I'm fine thanks uh someone that other person put up his hand before okay maybe we should start standing up now yeah I think if there's no other questions if you do have questions write them down and we'll have some space at the beginning of the day tomorrow to address them take a little bit of time of looking at those scripts that I shared um because I know I just zoomed through them but we'll have some space to talk a little bit and this is all about getting everyone sort of excited about machine learning and deep learning um to ask a lot of questions and participate in the tensorflow working group so um definitely uh be prepared for tomorrow and thank you all uh Boko I see your hand up over to you yes thank you Tim so I have uh uh one question or concern so for those who are uh digging us or almost beginners do you think that is not so easy it is it is it easy for them to follow thank you yeah it's a good question it's um it's hard I would say it's super challenging to under to see how well a training like this lands just because it's virtual right you know you're not really able to walk around the classroom and see how people are progressing um so that's where I really rely on everyone to to to let us know as on the training side what doesn't make sense and even and you add in the factor it's a short amount of time so those are the things we're up against so definitely let us know what could be valuable and uh we'll also have the beauty of of this Bako is that um this is one this is the very beginning right so if folks are really still interested like this is just scratching that initial itch we have a Global Group that meets every two weeks that goes starts asking these questions and gets gets folks sort of more skilled in that direction so we've got a couple mechanisms in place to make it valuable um and I think that's I would I maybe put the onus uh as a I'm wearing my teacher hat here to the students you know it it's for for everyone in who's participating to find the value of the training right and then come back to application questions how do I use this I don't understand these four components but I want to understand these four components how do I get to that end product and I think uh we've got great success in the past with with uh working with a lot of the different hubs to sort of um get to that sort of operational space so um hopefully that answers your question and Spurs folks as well um I see a couple other uh questions add on what team is saying yeah I think it also comes down to how uh you know Hands-On you get with your own problem uh again this is just a pointer for you to you know get started and there are like plenty of materials out in the internet that you can always go back and refer to but again it it comes down to uh the problem that you're trying to tackle and how you know heavily you want to get your hands dirty in you know it's it's always gonna be like learning by doing so uh yeah just add that out there yes and then to also add to his team the the idea fifth or fall for this training was so that we use it as a launch path in West Africa to us actually this starts a regular meeting that we help all of us ask our questions and try to solve problems together in West Africa now in the process of that if we have any need to actually call you or somebody else to come and talk to us we will do that so I want to encourage everyone of others in this meeting now to be ready to to join a group which I will send out a call and if you are interested please join us so that we take this step by step I mean like parkour said some of us are in the very low end of the offense we can grow ourselves together and then we can bring in team and beat love and Emil and Jacob as we go so I will send out our call and we shall start that yes um I just wanted to say that um I think it's it's really interesting the glory we need to think of um mainstreaming this kind of training in a hackathon mode where we can work uh some kind of online hackathons because I think the kind of um audience that we have and some of the newcomers that we have on the team would be very comfortable running these kinds of distributed online hackathons for particular tests and I know that we've discussed it in some cases in the context of a new activities we could have to add to what Foster's team is already doing but I think that this needs to be mainstream that it goes really well with the the kind of new approach we we discussed glory on on the capacity building so I think that's that's a matter of our thoughts in the course of next year I think as we prepare for the annual work planning meeting next week those are the kinds of activities that I would like to see because I think they will work very well over and thanks to Tim so thank you very much team um I think if we we should start Landing off thank you very much team I and Pablo I want to also thank everyone that have joined please um it is important that that you feel the uh what you call it pre-test form by tomorrow I'm going to also send a post-test form please it's very very important that you feel it if you if you need me to send it again to you I can do that and then also it's also important that that you also feel the attendance sheet I'm going to send the attendance sheets to us after now so we meet again tomorrow at same time 3 P.M in Nigeria 2 p.m in Ghana I think it's 12th what time is it 9 A.M in the U in the U.S thank you very much thank you I just want to say thank you all for attending uh asking good questions feeling the ability to stretch uh you know go into new territories uh it's always a fun experience we're going to be have the great opportunity to listen to biplov tomorrow talk about deep learning super exciting you should get your gears turning on like wow I can't believe this stuff is possible uh so thank you all write down your questions because we'll hopefully have some space I know I just zoomed past it all thank you to glory for uh setting all this up and being such a great team player it's fantastic to work with him and uh I'll see you all tomorrow thank you everybody don't forget to film the phone
Info
Channel: TheGeoICT
Views: 2,167
Rating: undefined out of 5
Keywords:
Id: D3WJNmrK4dc
Channel Id: undefined
Length: 105min 22sec (6322 seconds)
Published: Wed Aug 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.