FOSS4G 2021 - Streaming IoT sensor data with LocationTech GeoMesa, Apache Kafka, and NiFi

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
um man that was awesome and so now i'm going to pretend to put up the starting soon banner because that's not going to work i'm going to add mr jim mr james hughes in man pretty good how are you doing this morning randall oh pretty good so i'm we're live broadcasting again yeah seem to get i can't seem to get the starting soon banner set up right and like mute myself and not talk to people and uh anyway we're going to ad-lib for like three minutes that's perfect yeah no i i um i i did exactly what you're doing uh for this room for another conference uh for the past two years uh there's a geospatial track at apache con and yeah uh it was it was a it was a great fun uh one time one of our speakers didn't show up so the co-organizer uh and i just talked for 40 minutes to fill the time oh that's great yeah i keep trying to uh like put up the starting soon banner and it keeps dropping like every one second so really yeah i don't know what i'm doing like hey i tried to train myself and that didn't work and people are over in the chat window like no do this i'm like whatever how's it going man it's been it's not yeah i sat in on one i guess the last time i saw you was in san diego i said it on the jail server talks you and jody were were doing a geo server one of your mesa talks yesterday yeah yeah that you that geo server talk was probably one of the uh most interesting and fun uh conference talks i've given because um i'm not a core geotools and geoserver committer like jody and like andrea and some of the other uh folks but i'm part of the community and so jody was like hey do you want to give this talk and i think i remember you know because it was a feature frenzy kind of talk and so i remember folks uh like at one point there was a slide that came up and jody and i looked at each other and were like we have no idea what this is about it was pretty fun that was one of the funnest talks i had watched yeah yeah everybody's now suggesting maybe i have to keep the button pressed yeah maybe keep the button pressed down here like it's it's 20 21. i should be able to push a button and walk away from it but anyway awesome man where are you located where are you i'm in the middle of virginia um oh you're not too far up the road for me then yeah and i i grew up in southeast tennessee so that's where oh you did yeah we're at uh a little place called charleston tennessee next to cleveland tennessee get out of here i didn't get out of there man we made it eight hours away to charlottesville so holy crap i didn't know you're up in charlottesville yeah because you're you're in northern georgia right i'm uh chattanooga chattanooga oh wow yeah yeah yeah i've gotten lost many a time going through charleston you got that one red light in front of hardy's yeah like and i grew up right between uh between the two so it didn't hang out much in charleston like the town but there's nothing there because it's it's a hardy's and a uh uh piggly wiggly and they uh and whatever's formerly left of the uh uh uh the paper mill that's kind of changed hands several times right water oh water yeah excellent dude you're like a neighbor i didn't know that exactly yes we're chatting like international everybody's looking to ask what is what did we walk into that's awesome oh oh yeah that's great today's talk is funded by rc coal and moonbuys moonpie factory five miles to my uh north um i'll ship you some but uh okay so i guess we have now killed enough time yes to be at eight o'clock so i will let's see okay your slides are up i'm up we've babbled man charleston that warms my heart that's awesome um uh so next up on our talks let's see and we're right at truth joshua campbell's jumping in um oh as somebody said in here i thought piggly wiggles a made up chain in that 70s show no it's not a made up chain of 170 show it's a legit it's a it's a thing here it's a thing spreading the truth in the international foss 4g 2021 media so uh for everybody let's see we made the 801 babbling and so everyone welcome to the next talk eight o'clock uh we're going to talk about streaming uh internet of things sensor data iot sensor data with location tech geomesa apache kafka and nifi nifi knife and we got mr james hughes jim hughes um to come in and talk we just found out we were almost neighbors uh if you were here for the uh the diatribe in between the last talk in this talk and uh james uses a core committer for geomesa which leverages hbase accumulo and other distributed systems provide distributed computation and query capabilities he's also committed for location to a little bit of a location tag jts and sf curve so without further ado mr hughes take it away okay great um so let me just check to make sure the sliders are coming through great uh so uh thanks randall for the introduction um i'll be talking about geomesa in case you haven't heard of it i wanna just give the uh five second overview of what geomesa is and that'll also help orient us for what things we're talking about uh relative to gma's this morning so these are the uh slides that we uh you know usually use to chat through uh the overview so it's a location tech project that is a suite of tools for streaming persisting managing and analyzing spatial temporal data at scale and we'll take each of those in turn kind of talk about some of the technologies that are involved and then drill into the streaming part of what we're focusing on today so in order to stream uh volumes of data we use apache kafka for transport on the analysis side um and this is where a lot of the data science ccri has done historically uh isn't open source unfortunately but we've integrated with storm historically we've also tried out city adding geospatial uh to that and we're starting to look at uh kafka streams and k sql db to see what we can do with geospatial there the persistence side is where we got started in the beginning with gmasa because if you have iot data and you have it streaming with positional uh records that's where you can get to billions and trillions of data points and you really need that size in order to justify setting up a cumulo or hbase or any of these distributed databases uh they're a real pain to manage and um yeah they're expensive so um yesterday emilia and i talked about you know the things focused on this slide where we're talking about the databases on that first row that you could be using uh to store data on the bottom row there's more of the data awake approach where we could be using our integration gmas integration with uh some of the hadoop ecosystems big data file formats and so that includes avro arrow oreck and parquet the great thing with those file formats is if you just need to store your terabytes of data if you can you know if things aren't changing that much you can just write them once store them in s3 and you're pretty happy with things we'll talk a little bit about what i mean by managing data and that's this data management tool that is apache nifi uh nifi lets you build up these really complex data flows and get some idea of how to manage them and see data provenance and also drill into what's going on so if you only had to do a little bit of i need to get one data source that's streaming into my enterprise and going one place you might just write a really quick application by the time you need to integrate 100 data sources if you go gave those to a bunch of different developers they would all go right little applications someone write it in this language others in that language no one would write any sort of traceability or any metrics to show how things were going and it would make it really hard to do that consistently nifi gives you a cool visual programming interface to design those flows so the last pillar of gmasa that we call out for in terms of our integrations is integrating with spark that lets folks pull data from the persistence layers like hbase or s3 and run sql queries on it or write their own analytics based on rdds being able to do that in a notebook environment is really cool unfortunately not what this talks about we're talking about streaming um if you had to put together uh the five second overview of it if data is coming in on the left we're going to use uh there's a gmasa converter library and we've got this is where nifi would be using some of those tools to break things out into a speed layer for live processing or live views and where we'd write it out to hdfs or our distributed systems on the bottom all of this integrates with geoserver that lets us do uh a lot of the really cool interactions that you would expect from um a modern client uh web client and i want to go ahead and just demo that real quick so one of the systems that we ccri help host is a system for a canadian outfit called exact earth they fly satellite receiver payloads on the iridium next constellation that uh receive uh ais messages even from uh near orbit so if you're not familiar with it every maritime vessel in the world broadcast a little ping every couple of seconds every minute or so that says here i am here's my status here's who i am and uh before i heard about saturday ais i wouldn't have thought that those signals would make it to near space but they do and so um once the satellite sends that back down to earth it can enter into a system uh like the one we're talking about here so all the little green dots that we're seeing all over the world are each one's uh uh you know some shipping container or large enough boat that it's interesting uh fishing vessels for instance so the thing that happens here is we get a live view of everything that we've got this is using we're going to talk a little bit about the gma so kafka data store this view is provided by that just to show that some of what you can do with this we can start to drill in and pick on individual vessels and get back uh some of the information about it so this vessel happens to be called bluefin and it's a oil and chemical tanker using a uh this ui one of the things we can do is we can go back and forth between uh livewear that uh we're talking about with streaming with kafka we can do that and we can also see data that we've written to hbase that was put there by nifi so this particular tanker over the last month has just been kind of hanging out here a little bit off the coast of buenos aires um so uh yeah anyhow uh that was that's an example of being able to you know all of that's going out to geo server we can kind of go back and forth there one of the other things that we've done a fair bit with is building up live analytics and i'm just going to clean up the view a little bit so we've built up some analytics to figure out um when in a stream of data we see vessels leaving a port coming into a port when we think they might be meeting up to rendezvous and we can also use this same technology we've built up to store uh recent data in geoserver uh backed by kafka and so we've got an analytic that tries to figure out when someone may have turned their uh receiver off and so this would be either we've lost their signal or the box that broadcast here's where i am they've disconnected it turned it off uh because maybe they're doing something nefarious who knows um um anyhow so that's a quick view of being able to see um a quick view of what we're talking about whenever we're uh building some of these things out so what we saw we had a live view of all the maritime vessels in the world we had an activities layer we saw where a boat had turned maybe it's ais off there are other analytics running there that say when the destination for a ship has changed we also saw historical track uh data where we could see drill into one particular vessel see what it did over the past 20 30 days and just like most all gis systems there's contextual data um you know we happen to have a cardo db uh you know base map you could have had another one we could have turned on shape files for where ports are or eezs which are economic exclusion zones so all that's kind of you know uh the pretty typical things that you see there like i said for the live data we use the gma sarcophagus data store we'll get into that activities that's where some of these analytic frameworks come in for producing those and then we can use the data store again to see them and the contextual data is uh hosted by gms hbase in this case okay so let's uh go through how gmasa does what it does with kafka so if you're brand new to uh streaming and you're aren't familiar with apache kafka the thing that it's doing is it is a distributed right ahead log so the benefit of any right ahead log is that uh since it's append only rights are really fast and since it's distributed that means you can turn up the parallelism so if one uh consumer or one uh receiver for kafka wasn't enough one um you know machine writing to the right head log isn't fast enough well you just add more and partition the data that's on a kafka topic okay so that's a little like you know five seconds of what kafka is so how do we get moving dots on the map and we've got producers that need to everything since we're since the target's geo server and these ogc standards we are trying to create simple features uh create things that are in that uh ogc standard so the producers have to get data from somewhere um in this particular case it was exact earth we need to create simple features put them on a kafka topic that kafka topic uh we've um written our own cryo format we can also write them out as avro avroz uh just a binary sort of json hierarchical format if you wanted to think about that um there's some particular details about how we're using that kafka topic relative to some of the things we do about being able to clear the entire uh view but that's a little specific um once we've got data flowing through that kafka topic the thing that reads it needs to hold an in-memory representation of everything on that topic and so that way when a new message for a given ship comes in it can replace the old one and the other thing that that in-memory database needs to do is actually be able to answer spatial queries so that's one of the requirements we have and at a conference like this someone in the audience is like well you know i know about arteries and quad trees did you try those and the problem that we've got is the updates for some of those would be really slow and h2 has great spatial support since i said in memory people should be thinking hey h2 is one of your favorite sql and memory databases whenever we tried this back in 2016 uh it was slow so um i always uh i want to frame any time that i uh am sharing uh benchmarking studies to say i couldn't make it work and i last time i tried was five years ago so uh to avoid stepping on anyone's toes in case someone came through in 2017 and made hq spacial support amazing for instance so we made our own little in-memory database the trick that we've got here is since we know we have an identifier for the thing moving through space and time we have a hash map of those ids to the actual records and the second thing in order to be able to answer spatial queries very quickly we actually just have a bucket index of spatial grid cells so imagine splitting the world up into grids and whenever you figure out what bucket uh a point should go into you just throw it in there willy-nilly with all the rest so if you ever need to if your view window touches that bucket you're just going to check all the records in there and it's fast enough at least for our use cases and updates do the obvious thing where you find the record from the hashmap you can remove it from the bucket index and insert the new uh thing so this works really well if you just needed to look up a particular vessel or if you need to do spatial queries we also integrated with an optional thing called cq engine this helps you add uh queries uh add indices on other attribute columns so this doesn't come up too too often but we did write a paper about it it's fun stuff okay uh and obligatory uh results from the paper which i'm gonna skip over here okay in terms of tools if you're trying to get started with either kafka or gma so kafka kafka has command line tools that help you manage topics like creating and deleting them you can also just send quick messages and listen to them and since gmasa's kind of wrapping things up in our own file or our own message formats on top of kafka we actually have tools that help us manage creating the right simple feature types for topics sending messages and listening to topics so you can just uh instead of seeing the binary blobs go through uh with the general kafka command line tools you can actually uh you know get a nice little like csv print out of hey this this this is a message it came through so those are great uh things to know about if you're getting started with this okay let's go ahead and talk about apache nifi for a few minutes um again i said a little bit about what it is just ripped straight from the project description on the front page apache nifi is supporting directed graphs of data routing and transformations it lets you do a lot of really cool etl some of the key things to know about it it's got a web-based user interface so if you like visual programming it's got that kind of canvas drag and drop thing going which is really cool uh it lowers the barrier to entry and it makes things pretty intuitive there are some parts of it that are frustrating that it's uh web-based like that but it does a lot a lot of things but of what a lot of different people see what's happening with the data flow so there's some things that it does in terms of being able to deal with uh back pressure and data providence that are really cool i really want to focus on the fact that it's designed for extension so all of the bricks that you can put on to your canvas that you're wiring up for how data's flowing around you can make new ones and that's where we have a project called gma's and nifi that adds those processors so typically use uh knife for um managing data flows in etl as a really concrete example uh without geospatial just to get us warmed up you might use get a get http or listen tcp processor to get data from a source you can you could have gotten an xml document there is a processor called transform xml that will apply an xslt document to it and transform the whole file uh there are other times when you can uh treat the files as records if you had csv or json and then you can transform those to uh another file format so you could change your csv into json or vice versa and you could remap fields once you've changed those files around and done that little bit of a transformation you can load it into another system with processors like what jdbc or uh put s3 and it'll help you get your data wherever it is it needs to go in your enterprise okay so let's do this for simple features um we've got a few different things to roll through as example processors since we've got in the enterprise since we're already using kafka we know we've got kafka topics that have gmaso records on it we've got a processor that will read from those topics uh this get gia mesa kafka record and it'll actually turn those into simple features and uh it works with the nifi record reader api and so that means that we can read from that kafka topic and write out csv or avro or json in terms of transformations gmasa has a converter library that helps you map whatever file format you've got initially into simple features since we've integrated with avro and worked out details of how apache avro could do geospatial we just store uh you know we store some simple feature information about the type and we store uh geometries as well known binary in avro we've got a processor that'll just let us use that converter and write out what we call geo avro in terms of loading things that converter since it creates simple features is good enough to just go ahead and say i've got input i've got a converter i want to write it to one of these gmasa managed back-ends like hbase um for these processors we've got they're they're all the same we've got them for hbase and cumulo cassandra radius the file system data store the other processor on this page is taking the geoavro that we might have created in another situation and loading it into hbase similarly we can map knife records into hbase and we've also got a processor that lets us do uh record updates and so that's a cool little addition there okay so as you start to do this because um and i'm i'm not kidding about the volume here um if you actually had 50 different uh data flows that you were trying to write to your hbase or cumulo in your enterprise at some point you really want to be able to centralize the configuration we've got uh pieces in nifi that help you configure your connection to the database once and then reuse it and nifi's record api uh we've implemented that for these geo avro records so if there's a processor that writes out nifi records you can you know configure the record set writer to write it out as geoabro so that's a really cool option there and we can you know put this all together in a high level flow where you could be reading from tcp you know on the sort of left hand side on the input side if the first thing you did is convert that to avro this gives you a pretty good uh thing to do because now that avro file can go to s3 to have it you know to store the source data and then that same file can be loaded into hbase and into kafka so that's a typical pattern that we have because then we have an archive in case heaven forbid anything happened to the database and you need to recover it or if someone says hey what files did you receive today you can just say okay hey i've got it in s3 hbase the benefit of having that distributed database is it's you know got lots of power for answering uh complicated queries and kafka for the livewear i'll say a few things about streaming analytics um like i said before we've built a lot with storm looking at city and k sequel one of the other things that i want to uh call out for people who might be trying things out in kafka streams is i want to give a shout out to will the forest at confluent he's added some spatial udfs to k-sql db in a project he's calling k-sequel geo and he's got a quick little demo of it uh so he picked a metro area that has uh information about their buses and has a little web demo of okay we can gather up all the observations for um you know a given geohash and have kind of counts and do some things like that so uh i i want to keep working with will to make his ex spatial extensions uh to k-sql look a little bit more like push gis he kind of just got started on his own focused on geohashes there's some really cool stuff going on there and uh with that uh i'll go ahead and say thanks and take some questions uh ccri is hiring we've also got uh all the ways to get in touch with us you can feel free to email me gmail.org has excellent documentation we're very active on gitter so if you've got questions feel free to um yeah feel free to reach out excellent yeah that was good so what is ccri is yeah let's make a drink oh yeah yeah yeah exactly uh ccri um historically we've been a small data science uh company here in central virginia uh we've recently been acquired by general atomic so that's where the logo is general atomic ccri ah yeah so yeah um we're still i mean we're an affiliate company so uh really it's there's 130 of us mostly based here in charlottesville virginia uh who uh on a good day we get to work on really hard problems uh like indexing spatial data wow yeah man so you can do so you were doing ships as an example i heard buses what else could you feed into this i mean what else anything with a spatial twist that's pumping out location i guess pretty much um i think from my point of view uh the biggest thing i want to highlight is when things are going to be a really good fit for gmasa yeah and in some of my other talks one of the things that i like to kind of focus on is like how the hell do you get a billion records in the first place um and uh the two examples i use to show that uh most people don't have big data problems if they don't have iot uh uh the the two examples i give are there's a there's a project from texas researchers called gdelt the global database of event language and tone where they applied nlp to every news article they could get their hands on since 1979 and it has about 250 million records last i checked which has been a few years but everything that's happened uh since you know 1979 is not a billion records and that they're they're uh the reason it's important for us in gis is they they try to uh infer a point so they you know if dc if someone in dc did something it's you know they've got a point for washington dc and if it's the us or france if they're relating their cinnamons they pick some place in kansas or middle of nowhere france right centroids the other data set is osm and the osm pbf that you can download is a hundred 200 gigabytes it's not that big right so if everything that's happened uh isn't a billion records and every place on the planet that you would want to go can fit on a thumb drive you haven't made it to big data um so how do you get a billion records you really need something moving through space and time saying i'm here now and as soon as you have a data source like that whether it's planes flying through the air uh people on cell phones whatever it is um you can start to use some of the big data approaches and one of you also get into a duality of uh the two questions where people either want to know where was everything or where is it where is it right now and this streaming side is really focused on the where is it right now and the analytics are asking questions about a recent window um so yeah cool man that's yeah my head hurts now thanks well it's also pretty early here in the eastern time zone so that's true that's true well excellent man i appreciate it we didn't have we didn't have any questions we had a lot of a lot of people chipping in saying thanks a nice job so okay you did good so uh hey with that we appreciate it and all right good luck great well i'll let you uh go ahead and get the next guy set up there yeah and uh good to see you thanks for hosting hey thanks you thank you for speaking cool of course see ya see ya okay and in my completely patented way i'm not even going to
Info
Channel: FOSS4G
Views: 39
Rating: undefined out of 5
Keywords:
Id: 2CxJw7iZs-8
Channel Id: undefined
Length: 32min 54sec (1974 seconds)
Published: Mon Nov 08 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.