Scraping Maine Secretary of State, technical.

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello there we go all right um hello i'm jordan hansen from coldwealth intelligence we are working on the secretary of state api today today we're going to the state of maine you can see these are the states we support right now we're adding a lot all the time wyoming's up close vermont's close indiana and louisiana i skipped because they have captures we've solved those before it shouldn't be difficult but i just have been dragging my feet on them maine looks like a nice place to go so we're gonna go to maine today and so this is gonna be a technical video going over the details of maine typically about 40 minutes okay so today i i want to see if we can do this first with um direct requests like fetch that's what i'm going to start with node import fetch from um and then i'm going to start search for the business and say uh export a sync function search businesses that's my business name here string then we're going to come over here to this request right search here i'm going to grab this whole thing let's grab this one and then we'll go back wow look at that that's kind of slow huh okay and then we're gonna search here for this business name and i'm gonna search and see if i can reproduce it so we got this right here this is the post request i'm going to copy it right here copy as node.js fetch bam and we're going to say const response it goes away to bam like that request annette as unknown now we look at what's different here and we're going to go we're also going to need cheerio to parse our import cheerio from cheerio i'm going to make this a little bigger even like that so cheerio we're gonna use the parser html we're gonna come over here and we're gonna say const html equals a weight response dot text like that and we say const equals look at this request and this is the thing i'm worried about with this cookie and we also have something oh that's it let's see if this session id is important or not because that may we'll change it around so right here we're like this and we're going to say um like that number there we'll probably use that like this and we'll say business name and then we'll just try to see click here to search oh we got our list right here uh right there actually we're just gonna log out of the whole issue we'll see what it looks like and then we go like this we're gonna call from here so we can test it locally here we're hoping that we can do this and we're going to go cons business equals await search businesses now please do it for maine maine none of these yes there all right and then we'll copy this business name right there like that console.log business which right now it shouldn't enter anything but let me go like this word travel npm run main local like that oh gosh how about that main spell correctly with knee at the end okay now we can search here i found out think of this there we go right there now what happens if i get rid of that get rid of this cookie let's see what happens then if this works then we are in good shape here yeah fail okay now okay so we are cooking now now the question is is this session id associated with this query that's what we're going to test next we're going to go like this we're going to say we're looking for another one look for another one right here and then we're going to type in pizza uh like that one looks good where's my new search button click search no i want a new one there we go right here corporate search okay now i searched that guy i believe that okay that looks good now we're using the same session id for a different query and we hope it works the same this is what we hope that looks good right there okay perfect so the session id now there's always a concern we'll put a note even here to see if this will expire always a concern that's gonna expire for now i think we're good we're like this now we're gonna get our list of and we're gonna now we're gonna start pulling stuff from my elsewhere where though what's a good one arkansas i don't like the extra space there okay come down here i know it's so zoomed in is that kind of hard for you guys i know it's hard for me but i'm trying to make it easier for you guys to read it's like barely any code though can i get one with smaller can you read that hopefully it's so zoomed in i couldn't see anything okay so what now the bad thing is information is summary what if we had the looks like we do okay so we can get the information summary from here or like this like that we're gonna import this add all missing imports there we go and we're gonna do this kind of thing we're gonna come over here get our list of stuff this is a table and a table and a table i see that always makes it less fun let's see if i go table oh i have to search here table yeah this file table table nope what if we go t body tr the first ones are tara look at that oh gosh you can just skip the first like five six as the first one that's what we're gonna do if rho is less than six are header rows hopefully that'll work we go like this now we have our rows and then we come down here our title is going to be once we're in here it's going to be td and of type 2 font yeah whatever that right there so it'll be this one there we go like that and we will trim it that's good practice trim and there's no status here so that's not gonna work no status there and we're at the business link that's what i want i don't really care about the business id but i do want the sos id let sos id string this is how we're going to do it we're going to come down here i don't really care about the e do i i don't think so i think i come over here and i say so say d equals now i've got this thing so it's going to be a little different than that that's going to be the end of type what four not font though it's gonna be a one two three four and then it's gonna be font and then a that's not gonna be it's gonna be text no it's not it's gonna be attribute and it's gonna be href and then we want to split on corp sum like that and then we take that one after that perfect look at that looks so good and i'm not gonna worry about this because i'm not going to and this if sosid else this return alternative business names otherwise we're going to return a weight with businesses get business details right here we're going to say async function business details ssid string and that's what i want right there say d right there yeah i like that testing title type test there we go and we're getting the business details we'll get that with the sos id and we say yes this right here now this i think we'll go just like this const response equals a way to fetch url it's like this right like that sos id now what did it look like over here yeah it had it kind of encoded there so i think we're fine and then we go digital weight response text like this console.log html the html from the details page that is there we go and testing title over here we see everyone no i don't want that oh gosh how many rows are there total i'm not even doing that i'm just checking them now there's probably some after that what's gonna happen here well we'll see i guess so we're going to see am i still logging out they should i don't want that one anymore where come over here and say found match i want that source id i think as well ssid like that there we go run it run it run it run it oh we're barely gonna see this though okay that was my concern oh we'll go look at this what about this haha i'm not gonna do it unless it's ahead of the set this i'll say well penalty tests a bunch of rows in this table that don't have anything we want that may not have titles like that well that's not good oh i see same thing here homies same thing here i don't really want this [Applause] yeah like that okay but why none of them ah whatever i just get rid of this okay you got it i go here look i found it let's go over here now and see what we got we're searching for uh like david perfect we're in business my friends we are in business yep just like that i'll come over here and do this get there get down to work let's say dollar sign equals cheerio.load.html we know the drill right how good are these things going to be not good okay what if it's different it doesn't look like one that would be different i hope so we're gonna use a bunch of exact selectors which is just okay all right here we go what's like what's this look like br's uh we've done this a lot of times too in fact we probably can copy this one nope i don't like that one recently what did i do right before this one hawaii illinois now hawaii probably watch out why yeah that's what i want this thing right here a nice parse address function to help me out all right now we're going to this okay we're ready we're gonna say constant business equals it's gonna be from my business yeah like that and then we're gonna return business right here i'm gonna import this guy and now we're going to say title is what we dollar sign this and sos id yeah we're gonna need stuff here now i can may as well do a state of sos registration that's the easiest one because i know exactly what's going to be it's going to be this may no what is mi me it's embarrassing i don't know what that is okay all right now we'll go to this don't have to get an exact selector for this which is going to be the hard part what do we got your table table table okay so it's in the second table and then we go like what tr and of type one three four five and then we go td and of type one there there's my first one text dot trim assuming we find text now let's do that again but this time we're gonna do it with sos id called charter number and this one is going to be this one right here two right i'm sure yeah i like where this is going we can do this then we have entity type that's gonna be type three or yep right there and then we have what do we have status okay let's type four now filing day is gonna go down two rows so it's gonna be seven and then one filing date right there and state of formation is going to be still seven but it's going to be three is that right let's see so i got seven and three yeah that's big okay but we gotta abbreviate state like that and what else um agent information will get there's no address but other than that we just get agent information and we should be good like this will go like this and parse that's all within that same thing though so i think i'm gonna go like this no it's not like this i'm gonna i gotta think there's some we're going to say interface i address and name equals this we're going to say extends my address we're just going to add a nice little name there like that okay now we're going to go i address and name there we go now we're going to say name is this right there now if there's an address then we parse it through we can say we're going to split on brs the first one is always going to be the name and if there's a suite we're gonna have how many we're gonna have so we have this and we split on brs let's take it and try ah just yeah yeah i know you got line breaks or whatever i don't even need them in them okay but they've got this and we say address split on br three so if it's greater than three we're assuming a suite and now i'll just always do the second one street we'll always do the second one like that now if there's more than three then we're assuming this one right now there's three there's more than three and then the last one is just going to be city state and z which comes from the end which would be it's going to be minus one i'll rest but on comma and the city will be the first and then we'll do this on space yeah that's good this is all good that all looks good okay now we've got this const agent how about that equals parks address and name it's going to be something i gotta find the proper selector here okay what row is this probably 8 9 9 10 11 12 12. and then we just got tb right that's only one like this but it has to be the html of this this time that html like that and then we say business dot agent name equals agent info.name street agent street address is street i'm gonna go city and we go state and we go zip and we go city state how we doing 21 minutes i don't want to get optimistic here but i think we're cruising here now we've got this abbreviated there's no physical or melee address that's interesting isn't it i think i think it's interesting now let's see what we got oh it's beautiful gosh that makes me happy look at that that's great let's get rid of this stuff i don't really want this as much and then we go over here like this and we're going to try this one yeah that works well too dang that's good okay now we're gonna push it up and see if it's really as good as i hope um we're gonna make a new function here new lambda one we're gonna say over here it's going to be main search we have to follow the convention so everything works come over here we say lambda x we configure it we should need more than that we should need i'm not going to do more than 29 seconds after that our thing times out we stop asynchronous invocation we use out the retry attempts to zero what else we've gotta add a layer there we go they're 18. we're just moving up and up on these and then we have to exports yes we do have to do the exports department that should be easy everything else looks good we'll come over here ah we need arkansas i'll come up to that this part right here exports handling this is what we need this is so uh lambda can call it yeah like that we search for everything that's arkansas we replace it with main oh now i do sos id which is fine that should be easy like this we go let business response equal this it's going to be this like that and we say if there's an sos id we're just going to call directly business response equals oh wait get business details sos id like that else we're going to call search business like that like that okay run send state to lambda main that baby's going up search query or test pizza search query and right there save changes send it oh my gosh it just worked that feels so good i want i wish i could have some i want to see if i can find some other suite though to make sure i'm handling that scenarios correctly but let's search by sosid here ssos id oh yeah okay sad for a second thought i wasn't just gonna have the perfect day oh baby yes i love it okay see if i can find some other businesses here my sample size is not huge yet right here i don't have any in this list okay all right hold on let's see if i can find another one how about this one okay this one then we're gonna search for randos that appears to be none no okay is the abbreviation in me for sure yeah okay well let's search for something else here what about uh fire a crackling fire production yeah baby let's do that testifier search query there we go like that i'm gonna go over here get this thing that worked great so well now this one is put together let's say this let's test the source id that's great looks really good okay now we're going to publish this thing we're done here friends that did it supports exos id and search query that was a fast one 27 minutes it's a record all-time record i feel like i've had faster but i don't know query bam right there wait i thought i changed you i do it for the prop for the alias as well that's not good okay now we come over here we say index and we're going to update my functions because i am not patient we go over here update states available we do this while we're doing this we get and hit commit add support for maine push origin master i'll go over here i should refresh and then main should show up on our list main yes like no test on it yet yeah i'll get that in there we'll do that right after this main oh no i'm too small not responsive it's just okay on this site i guess bam look at that and what if i search by sos id yes okay we are in business now oh i don't want to put master up get push origin i mean we're done i'm going to add some tests and then i think we're good to go all right thanks everyone that's all
Info
Channel: Cobalt Intelligence
Views: 29
Rating: undefined out of 5
Keywords: business data, secretary of state, api, maine, web scraping
Id: YPci12tYLPs
Channel Id: undefined
Length: 29min 17sec (1757 seconds)
Published: Thu Sep 16 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.