The Impact of Artificial Intelligence on Genealogy

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this episode we're going to tackle a few small kind of geeky tech questions about artificial intelligence better known as ai that may in reality have a pretty big impact on your life on your genealogy life if you do family history and just on your life in general or the lives of your loved ones even if you aren't a genealogist like many of us watching today we're going to talk about questions like is artificial intelligence the same thing is machine learning maybe you've had that question and if not how are they related and am i using ai without even being aware of it and what impact is ai really having on our lives is it all good are there some pitfalls that we need to be aware of uh we're gonna take these on with a focus on family history but pretty quickly i think you're gonna discover that it's a much more far-reaching subject than just in terms of family history and so that means that this episode is for everyone i hope if you find it insightful you will share it with your friends and your family as well now while i've done my own homework on this subject and of course i've written about artificial intelligence and the impact particularly on google search um in my book the genealogist google toolbox i'm smart enough to call in an expert when an expert is needed so my special guest today is benjamin lee he is the developer of the newspaper navigator it's that uh free tool that we talked about uh in a past episode episode 26 of 11's is with lisa and what it does is it helps you find search for and retrieve and get access to the images that are in historic newspapers and newspaper navigator is available of course as pro part of chronicling america at the library of congress website ben is a 2020 innovator in residence at the library of congress as well as a third year phd student in the paul g allen school for computer science and engineering at the university of washington where he studies human ai interaction with his advisor professor daniel weld he graduated from harvard college in 2017 and has served as the inaugural digital humanities humanities associate fellow at the united states holocaust memorial museum as well as a visiting fellow in harvard's history department and currently he is a national science foundation graduate research fellow so without further ado let's jump into a conversation about artificial intelligence with benjamin lee [Music] hi ben nice to see you via zoom versus just in chat in our last episode it's so good to see you here so nice to see you too and uh once again thank you so much for the um it was such a wonderful surprise for jamie and me to be able to pop in and see newspaper navigator being demoed and you did such an incredible job with it and we're so excited that you you were able to reach such a wide audience with it as well oh thank you well i was i was in there thinking how did they figure out i mean it's hard enough for most of my viewers just to make it there live at the right time how did you find us so we found it i believe one of the other lc labs team members found it on twitter right around when it was launching and so we quickly got on and saw the youtube video recording was live and popped on so i'm really glad we were able to catch it and then i was able to re-watch the portion that i had missed at the beginning on the archive version on youtube which was a lot of fun um but yeah it was uh you know what i think one of the best or most enjoyable parts of newspaper navigator has been seeing other people use it and your demo was so in-depth and you covered all the bases and so we were just so glad that you were um you know able to communicate it so effectively and so well so we're so appreciative of that oh well you had my favorite stuff you had newspapers you had photographs you had the the search capability i mean it was just everything that catches my eye is sparkling objects out there and i know that everybody watching was going to be excited to try and use it so we did our best to try to get in depth into it and i'd be interested to have you share with everybody what was the inspiration behind creating that project to begin with yeah definitely i can give a little bit of context on the project itself and a bit more about my own trajectory as well if that would be helpful that'd be great yeah so newspaper navigator really comes out of some questions that i've been you know thinking about in terms of how we can apply computational techniques like you know machine learning being the buzzword to various you know archival or library collections and think about really how we can improve information access or really what i might call our ability to search over this content and so i had um been you know fortunate enough to have seen a lot of the work with chronicling america um at the library and the national endowment for the humanities and also the crowdsourcing initiative beyond words where um volunteers were asked to effectively draw bounding boxes around various visual content in the world war one era pages in chronicling america and so that was really the the main inspiration for newspaper navigator was thinking through you know if we have all of these you know uh volunteers who have produced this incredible data set or collection of this photos and illustrations and whatnot we perhaps use that as some sort of training data for a machine learning algorithm to then try to do it overall chronicling america and so the project newspaper navigator really had two central parts the first part was running machine learning over all of the pages to extract all the photos and whatnot and then the second phase resulted in the search gap which you um which you demoed which uh really thought about now that we have all this visual content how can we really try to navigate this c or this wealth of information in in new ways and so um that's been the you know sort of the arc of the project um in terms of my own background i'm continuing to work on newspaper navigator in my own dissertation work um and they're thinking about the ideas of searching over information what i might call exploratory search and how we can really provide those kinds of affordances or new approaches for all sorts of people whether it's genealogists or historians or educators who are trying to make this kind of you know rich historical content available to the public so well that's what caught my eye again and fascinated me was i've always been very much into data visualization and helping genealogists think you know okay you have these stacks of papers you've got databases but have you looked at it in a new visual way in a visual format i use mapping a lot for that but i i noticed that data visualization was something that you've spoken on in the past as well so we have a lot of intersections and our interest in all this i know that you have um background in this so you were working with lc labs the library of congress labs which uh was actually new to me i didn't realize they had kind of a formalized labs going on i think that was new to many of my viewers and it said that uh in your work interning with them that you'd also worked with the holocaust museum tell us a little bit about that definitely yeah so my own personal sort of background and i guess entry into thinking about these kinds of questions to begin with is my grandmother is a survivor of auschwitz-birkenau concentration camp and so throughout my life i've had an interest in your holocaust studies or you know taking coursework um related to the holocaust and so the year after i graduated college i was actually a fellow at the united states holocaust memorial museum doing digital humanities work and so a lot of my work there was thinking through they have a very enormous collection or database called the international tracing service archive which has about 200 million pages of digitized documents um related to um the victims of the holocaust and my my own grandmother is actually in the collection itself and so i had encountered it maybe about a decade ago when i went and visited the museum with my grandmother so it was almost in a senseless very uh full circle story where i started with my grandmother working and doing genealogical work in the archive and then eventually ended up on the other side and so a lot of my own interest in the space really comes from thinking about these kinds of you know the exact kinds of questions that your viewers are thinking about in terms of finding information on family and using i think those kinds of questions that i was encountering as an opportunity to then inform my own sort of work or my research as well and so i would say that you know i i particularly appreciate the opportunity to speak to other genealogists because that particular i think drive and interest has been a real motivating factor in terms of what you know the arc of my own career and everything as well and i really do think it's it's the future of genealogical research people often ask me okay you've been at this for a long time what's coming what's in the future and i say it really is the application of the technology to this information we already have because uh just when we think we've gleaned everything out of it you realize oh if i visualize it this way or if i search it in this way then new things come to light so that's why i just think your work is absolutely fascinating i'd love to talk to you you mentioned machine learning and you mentioned uh you know people here ai and artificial intelligence um i i know some of those kinds of things and i've talked about it in my book but you know so much more help us separate that out tease it out a little bit when we hear machine learning and we hear artificial intelligence what are these two components and how they interplay yeah definitely so you know certainly it's the case that machine learning and artificial intelligence are buzz words and so you know they're i think they're pretty oversaturated in terms of the media and whatnot um but generally speaking the idea behind um you know artificial intelligence is how we can use computers to try to replicate some form of you know tasks that humans um are able to do from uh sort of like intellectual or cognitive standpoint um so this takes the form of a number of different things some that you uh think we all might be familiar with are you know image classification so for example feeding a computer an image and figuring out what's in it so for example does this image have a dog or a catnip another example would be on sort of question answering so if you're using uh you know siri or your uh your echo or something like that you ask it a question then the system effectively learns how to process that information and return an answer the distinction between machine learning and artificial intelligence i know is often unclear in terms of how it's framed um machine learning effectively refers to one kind of approach toward artificial intelligence so you can really think about artificial intelligence as the larger set containing machine learning and some other types of approaches um it just so happens that a lot of the techniques that people are pushing on these days are sort of within the scope of machine learning um but really i think for the you know the general public there they can be used pretty interchangeably um and so you know in my case a lot of what i've been we're working on in the space with machine learning is how we can apply it to um visual content so whether it's a document or a newspaper page or a photograph and effectively try to use these algorithms to try to infer different things about what's present so in newspaper navigator for example um differentiating say portraits of people from uh you know photos of uh buildings um and so trying to use these ideas that have emerged a lot from that sort of technology space and move them back into uh you know library or archives and stuff like that well i know a lot of the content uh though chronicling american library congress as i understand it like you said they drew the boxes so they could say this is a picture pay attention to this but then there's also tagging and metadata and things going on and i found myself as i was working about it with it thinking um okay i'm getting this essentially the generalized direction i'm trying to go but i would love to be able to then add in not only even just keywords or tags but even you know operators that might help further explain to the system what i'm trying to find what what do you see in the future of because i'm sure that you know you had a limited amount of time to work on you said you're continuing maybe to work on it what kinds of things do you see in the future that can be added to it yeah that's a good question i think precisely what you're you're talking about are things that i'm very interested in so in addition to letting people effectively just sort of click on images and say what they're interested in giving people the ability to then specify with as you said applying tags or typing in their own keywords another angle that i think i'm really excited about and pushing ahead with my my own research is thinking about how we can share these different navigators that we've created in a sense in effect i might have one that i'm interested in but i'm really interested in the idea of if we can sort of do this as a community and start sharing them that might really open up doors for how we think about the information and so really i think both improving the kinds of ways that we're interacting or you know providing feedback to the system and also how we can do that in a collectively manner yeah i think about um visual content is so ripe for this because i teach a lot about using google books and yes you can keyword search but i i just i just did this last week i was saying some of the really amazing things that you find though were these one-of-a-kind maps that were drawn just for this book that just delineate this situation at a moment in history and wouldn't it be nice to also be able to grab images as well as just the text i haven't seen that functionality so much yet in google books but i could see that as a potential no definitely i would say that you know i think i rather naively always thought about newspapers as a very sort of textual material right but seeing beyond words really challenged me on that and i was just so excited by all of the kinds of visual content coming back out and i definitely agree with you i think there's so much interesting work to be done surrounding you know all sorts of collections in terms of not just thinking about the text there but thinking about the visual aspect and really both how that uh you know we can extract it and then also sort of compare things and draw broader comparisons you know whether it's for the kinds of questions that you're asking on you know with your viewers or the kinds of questions that a historian might ask about the history of printing or something like that too has that got you thinking about you know you've had a lot of exposure to different kinds of materials you certainly have been into the genealogical space are there other opportunities i mean okay we're talking about pictures in documents or books but what other kinds of things have we not yet fully tapped into that are sitting right there yeah no that's a great question i mean i think the you know the the world is our oyster so to speak um i think one of the i think one of the the real treats about having a chance to work with lc labs in the library of congress is just getting a sense of the kinds of collections that the library has i'm always overwhelmed you know just exploring whether it's you know lock.gov or just more generally sort of poking around at just the really kinds of incredible material that they have and you know i think this goes even beyond sort of physical materials that are then digitized even thinking about you know the web which the library is archiving how can we try to process all these web pages from the 1990s for example which you know 40 or 50 years down the road will be i think part of our collective past in an entirely different way um so really i think there's so many opportunities there and it seems like there's a lot of momentum within this broader community in terms of trying to apply these techniques so i'm really excited to see what happens over the you know the coming years in the field now there certainly are with all the great stuff that can come out of it there's always the potential of the pitfalls and i'm thinking about you know everybody whether they're conscious of it or not has experienced on facebook that they're on facebook and pretty soon it starts to become this echo chamber of people and ideas and you start to think where's everybody else i i'm not getting this in my feed and i'm sure you touched on some of that as you worked with newspaper navigator discovering that in a sense you could be given a false sense of what exists and what does not exist within the collection i showed an example of this in a recent presentation where i was um searching politics is the big thing that's you know obviously on everybody's mind right now and i was just doing a simple anti-trump anti-biden anti-hillary anti-bernie search and the variations in the impressions that google gave for each one were really surprised not surprising um eye-opening and that bing delivered something very very different so what we know is is that in a sense what you get prompted to find more of and here is the machine learning trying to find more of what it thinks you want you also run into the potential of missing whole swaths of information give me some of your thoughts on that because i think this is on people's minds definitely yeah i would say that thinking through the ways in which machine learning is not going to be a cure-all or going to give us you know only perfect answers is a really important part of both the field and i think of i've tried to incorporate it into my work as well you know machine learning does for all the the great things it provides does have a really long history of perpetuating bias and marginalization machine learning ultimately at the end of the day is depending it's dependent on the kinds of training data you're used for the system and the training data ultimately is labeled by humans and so that you're training on it you know you never escape the human biases in these systems and moreover i think because people tend to trust technology a lot these things can go unchecked and can be really harmful so you know one aspect of this that creeped in with newspaper navigator actually is that these kinds of machine learning models for recognizing visual similarity tend not to perform well the image quality is lower and so one example of this is that in the microfilm uh microphone digitization process for newspaper pages i'm sure you've noticed this effect as well but people with darker skin tones tend to be washed out because the color tends to be stretched um you know pretty strongly to you know either completely white completely black and so there are cases where effectively people have darker skin tones their features are entirely washed out and so the underlying machine learning algorithm then can't really even recognize that there's a person there to compare them so um i ended up as part of newspaper navigator writing what i call the data archaeology you can find it on this the search application site where i try to dive into these questions in more detail and i think you know ultimately at the end of the day it's really important that both the people who create these systems but also the people who are using them are really aware of these kinds of ways in which these systems can go wrong and also try to think through the ways in which it's you know not just that it's creating this problem but how we can actively try to push back against it or incorporate this knowledge into how we use the tools um so it's a very uh welcome point that you've made and i think something that we all really need to keep in mind whenever we see machine learning appearing absolutely and you know that just this last year of being kind of so kind of essentially stuck at home doing our research from home it's certainly expanded our horizons in terms of what's available online it's brought more online quicker but at the same time and people will say oh lisa don't you know you can go in person well absolutely you know but a lot of people can't for whatever reason whether it's covet or just physical situation or proximity to the archive itself but we do run into challenges when we are working with online data because again that is also controlled and can be manipulated and i was thinking back i was watching a documentary and you know them in communist china they destroyed people's genealogies and that's really something on people's minds is how do i protect when i put my genealogy my family tree whatever my research is into a digital format i'm backing it up okay that's great i'm trying to protect it but in a sense i am still putting it out there and it's it's minute it can be manipulated so what are some of your thoughts about preservation and what people can do to make sure that that they can preserve their own history yeah i would say you know i i'm often times um you know i fall victim to this as well where i will think is if i digitize something it means it's safe with my my grandmother's documents and everything but right it's oftentimes you know the case where you mislabel a file you don't have a backup or whatever and it's just as easy for something as digitized to disappear or maybe even easier you know if you have a computer and it updates and all of a sudden some file types no longer work um so i think being very disciplined about keeping regular backups of your files um also yeah i think also keeping physical copies of things when you can is really important um in terms of the sharing component of it as well i think what you're saying is is definitely an important point as well which is that often times when we upload our files to these proprietary systems you know they aren't just they're not at the you know they're they're existing somewhere in the cloud they're on someone else's servers and so we don't necessarily have as much control over our data as we we might like and i think one example of this where we should all be a bit more vigilant is that if we upload something into the cloud on say you know not not to pick on any specific company but into the google you know google drive or something like that there is no real guarantee that 20 years from now everything in google drive will be there and so also i think having local copies is really important as well yes i wholeheartedly agree one of the things i keep kind of pounding away at is it's if you want to put your online family tree on ancestry my heritage awesome you got to have your own database you got to have that backed up and even then i periodically do print out a report just so it's all printed in paper i put that in a fire safe you know safe um because gosh things have media and file types have changed so much just in the in our lifetime it's unbelievable um going back to ai where are we where are you seeing this um touching our life whether it's on websites or the search engines we're using or the databases for historical collections give us a little more of your thoughts about where you see the future of all this going yeah i mean i think it will continue to be an ever-present part of our lives in various facets we see it appearing in new places every day i think one place where you know for better or for worse it's appearing is sort of the idea of the internet of things so that example might be you know instead of just your phone you might have your your fitbit or your smart watch or all of these things which are also giving you a lot of feedback on your life but the trade-off of course is that there's also data being collected um and so i think we'll start to see um these kinds of uh you know ai or machine learning being deployed on these kinds of devices and these kinds of systems to to try to you know gather or process different kinds of data that we might not currently be getting just with the cell phone um in terms of you know other concrete examples i would say um i think it already is you know present from when we need directions on our phone to when we're uh you know with with healthcare applications and so i think it will just increasingly become a part of sort of the fabric of our lives and um uh yeah i don't know if i have uh necessarily one concrete place where i would say it would show up or where i you know think it will really show up more and more um but it is definitely the case that there is a lot of interest and a lot of really you know cool research going on in this space and so i think the the kinds of questions that we've been discussing here on the show are the important ones you know trying to weigh the trade-off between the really kinds of you know immense benefits that we get in our lives with the real kinds of downsides if it's through bias or if it's through you know data protection and how we can all be sort of vigilant about those kinds of trade-offs yeah absolutely and just one last question on ai i know that many people are wondering is that interacting am i interacting with that um i'd love to because i know this is by far you're the expert on this um what kinds of motions in our day what things do we do during the day that in a sense is data that can be then collected and then used for the machine learning and the use of artificial intelligence and other things obviously when we're on the web we're clicking things and we're looking at things and we're pausing at things what kinds of things do we do on a regular basis that contribute to the data yes so definitely whenever we're online and we're clicking things or going to new tabs or typing in search queries that typically is um being recorded and then you know if you do a google search or if uh if you um start browsing on amazon and you get new products recommended to you those are all machine learning algorithms on the back end doing but you know really whenever we're carrying our smartphone with us there's data there in terms of you know different applications might have preferences set to be able to access your location um then you know in these kinds of uh you know smart appliances that you might see so for example smart tv those two are also communicating data on what kinds of things watching if they're plugged into the internet typically there is some sort of data collection being done um but really um uh yeah surprising amount of data is being collected on us on a uh on a daily basis and so um you know it's um just something for us to i think to be aware of and understand that it is being collected um even though you know when i'm certainly clicking around on the internet it doesn't always strike me that that's you know necessarily being recorded or whatnot but somewhere out there it probably is by the various uh you know companies or tracking services or cookies or whatnot as well yeah absolutely well you said you were continuing to work on newspaper navigator and um so might we be seen because this was one of the big questions i got after the show that i did was are they going to add features we touched on some of the types of things that you might be able to do how much time do you have on your busy day i know you're what you're going for your phd and you're involved a lot of projects what kinds of things uh and how soon might we see additions to newspaper navigator yeah great question so with newspaper navigator moving forward i will definitely be incorporating new features um and this will be you know deployed primarily actually through the university of washington now which is where i'm doing my phd um there you know the the version that you're seeing on the library of congress site on labs.lock.gov will be maintained for about two years um in that current state um but in terms of my own research we'll definitely be incorporating new functionality and when i do so i can definitely reach out to you and let you know when we have new features um if you'd be interested um and um you know i'd say that those are probably on the the time scale of you know months not years or anything like that from what i'll be rolling them out definitely i will be iterating on different kinds of new features and functions um in terms and seeing you know how people are interacting with them um but certainly you know feedback on the application is always welcome um i know you know very from various specific things like related to being able to apply and or operators and the the the keyword search functionality to sort of pie in the sky type ideas about how you want visual content all of that is very useful for me and ultimately at the end of the day i do want my work to be very user-centric in the sense that i really do believe that the ultimate goal of these tools are to provide features that make it more useful to the you know to the viewers or whoever use is using the tool so really trying to serve them first and so that kind of feedback is is very valuable awesome well and i'd love to see you be able to give it not give it sell it to google books i would love to see it in those kinds of collections it's just amazing how much is out there and um one thing i will add sorry to here is that um you know all of the code for the project um and data set itself is all in the public domain so all it belongs to the american people and so um on the site if you click around you'll be able to get to the code but the goal of that really was to try to make it as open as accessible as possible and so we're really hoping that other cultural heritage institutions or other um groups leverage the code or try to build on it to produce their own versions of the search application so it really is definitely um not only you know something that is you know available but we're actively really trying to hope to see engagement with the code and sort of track longitudinal use in terms of hopefully seeing other tools like newspaper navigator or spring up from the code that we've made available oh that'd be fantastic the library of congress has the labs are you going to be involved in any other future different kinds of projects there or was that kind of a one-time event yeah no great question i mean i am i love working with the labs team they're all incredible um and i would highly recommend uh everyone to check out some of their other projects that they've been working on um while i was there the uh other innovator in residence his name was bro is brian foo did an incredible project where he effectively grabbed audio clips throughout all the various audio visual collections of the library and made um a hip-hop sampler so you can put drum pizza for eclipse and it's a really really cool way of searching the audio content at the library um but i think it's just one of examples of many different projects that people are pursuing with lab to really push the envelope on how we access materials and think about searching them um you know i would certainly love opportunities moving forward to continue working with labs and luckily you know i would newspaper navigator have an excuse to be in contact with them well into the future um but would definitely recommend um everyone to take a look at their website which is labs.lock.gov and poke around at some of the really cool projects that they have going on yeah i did a future technology and genealogy presentation a week or so ago and one of the things i was talking about was that now we can search for um words in podcasts which is huge it's not just the title in the description but we can actually search and see that you know this episode has somebody talking about something we're interested in would love to see this kind of uh search coming into audio and video so it doesn't rely just on the text associated with it but it's actually digging into the the audio itself that'd be amazing well anything else uh that you are up to that we should look for or would you like to share your website and have us visit you oh sure yeah that would be great um you can find my website it is bcglee.com um you can follow me on twitter i believe my handle is le underscore vcg um i'm pretty active there but definitely you know anyone feel free to reach out by email or over twitter um you know it's always a real privilege to be able to talk to people who are using newspaper navigator who are just curious um but anyway i'm just so appreciative of the response that uh you know from you and your viewers with newspaper navigator and so glad that it's helped people you know find photos that they've been interested in and whatnot and so definitely looking forward to you know hearing from people moving forward we want to thank you for the kind of work that you're doing with this because um it's exciting to see and we we love when there's new opportunities in places that are tried and true that we've already visited once like chronicling america and then to see all new opportunities there is really cool then thank you so much such a pleasure talking with you you're from uh my neck of the woods i lived in tacoma for many many years so you're you're a little further north but uh yeah you know moved out to seattle a couple of years ago now and i've loved it i'm actually currently on the east coast right now i'm just uh staying with my parents due to codewood but um i love it out in seattle and i've been i'm really glad that i've had a chance to explore the pacific northwest it's lovely bring an umbrella right [Music] all right thanks so much ben great talking to you thank you i really appreciate it [Music] thank you so much to ben for a really interesting conversation um and for newspaper navigator gosh if you haven't checked it out yet you got to do it go watch episode 26 you'll get the the full rundown on it and it's a wonderful tool i am really looking forward to hearing from him about his updates and his improvements that sound like they're going to be coming not too much in the distant future you know covering technology and its application to genealogy is always a bit of a double-edged sword and it can be really awesome but also a little problematic and its invasiveness i know some of you guys were talking about that in the live chat you know ben said is surprising amount of data is being collected on us which is probably even an understatement and another question to have and that we can talk about now is how our data and specifically our genealogy data is being used with all this collection what's being done with it and who's using it i mean tools like website hints like the ancestry in my heritage and find my past hints that we get those are great the newspaper navigator google lens they all have a lot to offer and they're all using in some form artificial intelligence but you know on a personal level i'm actually a bit of a skeptical person and i don't tend to take things at face value i'm actually really concerned about the long reaching effects of artificial intelligence on the future and most importantly on our descendants my descendants and my grandkids aia can seriously impact our privacy our security and even our freedom and we're currently experiencing kind of a canceled culture today that i think everybody is is kind of deeply concerned about i would never have imagined in my lifetime that we would see some of the censorship that we're seeing today what this really tells us is that big tech in particular has a lot of power and they among others are using our data and want to use our data so i think this is definitely a conversation worth having i did a lot of research for this episode and i discovered a few things i wanted to kind of share with you artificial intelligence is really from everything that i'm seeing having the same kind of massive and disrupting impact that dna has had on our genealogy but with almost none of the publicity it's really interesting you know what we found was we were doing dna testing as genealogists kind of on the forefront of that and um looking at all our matches and then we come to discover that you know we put them into a public database that we think only genealogists are using something like gen match come to find out that criminal investigators figured out wow there's some value there in that data and that's not surprising but it wasn't necessarily on most of our the forefront of our minds well there's been a lot of debate about that and um you know the fact that dna results really are a valuable commodity for lots of different purposes not even just criminal investigations um so there's been a lot of hoopla about that but that got me wondering too you know who else might be interested in other kinds of data that we generate because like it or not genealogy is a fairly personal thing to be doing right so what kind of data do we have well in addition to our dna results we have family tree data we actually have the data that's generated and all the activity from all of our genealogy research well who might else be interested in that well academic researchers they have a very keen interest in it and i can only assume that our data also has value in the marketplace as well so i want to give you one example um it's the record linking lab at brigham young university now this came up a lot in the research that i was doing so this record linking lab of course at a university it's run by a byu economics professor and they just they published a research paper it was co-authored with uh not a byu professor it was an economics and women's studies professor out of notre dame so in them looking at record linking and genealogy data it was not about genealogy at all it was about economics and social issues and maybe even political issues you know who knows so i wanted to read to you some of the highlights from this published paper that kind of came up in my research because i think it gives us a pretty um eye-opening and close-up look at the value that those outside the genealogy community are placing on all of this massive genealogy data that we're collecting as genealogists and creating about our ancestors and of course just like with dna when we find information when we collect it when we put it out there we're actually putting out information about ourselves as well because we're related to them so and we're doing this publicly we're putting out there you know publicly available we don't tend to think of people who are not genealogists as looking at that data or being interested in it but they are so keep in mind as i read some of the excerpts from this paper which i don't normally read to you but i think this is worth doing um that there's kind of the story behind this is that if you think about it this way historical records birth records death records military information the census they've always been available to let's talk about the academic realm because that's where this paper comes from they've always been available to academic researchers but really on an individual basis each record was kind of standalone and they really couldn't follow those records for individual people and families and draw and really generations of families to see trends right they really couldn't look at these records and see these trends through generations of time and draw conclusions from them they were really stuck with individual one by one records you know the 1920 census and the family in that census but for the first time artificial intelligence is making it possible to link these records for real living people and past living people and their families because again we're all connected for social science and economic uses so let me just read you a couple of highlights because i think this is really interesting it definitely gives us a peek into perhaps ways in which people are interested in using the data that you and i are generating so the paper is called combining family history and machine learning to link historical records and they start out by saying that the ability to confidentially or excuse me not confidentially confidently link individuals across u.s census records opens up opportunities for important social science research for many of the most pressing questions in the social sciences empirical analysis relies on access to data that allow the researcher to observe people at different points in their life or across generations in this paper we propose a new approach that can be used in conjunction with other methods to link individuals across census records so they started with census records we focus specifically on a novel way to create large training sets for machine learning that can be used with supervised machine learning algorithms so they say our approach makes use of linkages created by individuals conducting family history research that's me and you the key feature we exploit is that when the profile for a deceased individual on one of these websites has multiple sources attached to it each pair of these sources can potentially be used to train the data to make new matches now keep in mind they're not talking about hinting matches they're talking about matches to create this greater larger interconnected database that can then also be used to further accelerate machine learning the genealogy platform we use for this study is family search we use a sample of individuals from this family tree that are attached to at least two census records between 1900 and 1920. so this is interesting it's talking about you know when we use these different sites whether they're free or subscription you know we read the terms of service or we try to it's kind of a lot of reading but um i haven't heard about any of this kind of stuff in terms of coming from these sites we don't often hear in what other ways our data as we add it to the site might be used so they talk about this in this paper family search provided us a file with a personal identification number it's called a pid for the individual profile on the family tree and for the census record that allows us to observe these matches this process produces a large detailed and highly representative training set training data which is training the machine learning plays a key role in supervised machine learning algorithms and lack of training data has been one of the main barriers to using these methods to link historical records this is the problem they've had in the past a key contribution of our paper is the insight that the decisions that are made in the process of doing family history research that you and i are doing on various genealogical websites can provide an additional source of training data this can provide a relatively low cost way to create very large training sets the family tree is a public wiki style resource and as you know the family search tree is one big global tree we don't each have our own individual tree so it's possible for outside researchers to use the family search api to obtain similar training data directly from the family tree there are also other websites that have public trees for which data could potentially be gathered using automated approaches you know automating pulling that data off the trees the public member trees on ancestry.com could also be provide a larger training set one way to achieve a high level of accuracy in the training data would be to focus just on those public member trees created by professional genealogists and we're currently working with ancestry.com to provide a training set that could be shared with academics through the same system process at the university of minnesota that provides access to the census data files so they may be working with these different companies and sharing even further we'll be working with family search to create a specific email campaign in which we share predicted links that have been identified with experienced users and alert them to the fact that they should be extra careful about these record hints because they might not be correct the precision threshold that family search uses normally for uses for record hints is 95 once the email campaign is sent out we'll be able to observe the decisions that these users made me knew with regards to our record hints in terms of whether they decide to attach the record or indicate that it was not a match so that got me wondering do we ever get emails that prompt us for different activities and we don't realize necessarily that there's another organization involved or that they will be as they said here observing or monitoring the activity that we do after we interact with it it's a good question so when it gets back down to their conclusion area they said there's a huge research potential for from being able to link large samples of individuals across historical records clearly there is our paper provides a unique contribution by focusing on a source of training data that was largely untapped by the research the academic research community what we propose in this paper is that these pairs of records attached to the same person be used as training data and combined with supervised machine learning algorithms to link individuals across historical records one advantage of the source of training data that we use is the that the same process could be applied to any two types of records that are available on various genealogical platforms military information birth and death information that kind of thing so i hope you appreciate in this when they talk about you know what this creates it's something that's really never existed before it's and who knew that our genealogy research each day would have this kind of impact um and it's a lot to think about in fact um i kind of came up with a few ideas here that i wanted to share with you about this um so we can see that there's a real application for them in using um the information and feeding it into machine learning and for them it's it's about the research the economics the social issues it's not genealogically driven and as you heard genealogy companies are working with them and probably other groups as well this is just one example of one published paper that i found on the web working with them working with the various genealogy website companies and sharing the information that we provide with these other interested parties in a couple of different ways now if you might recall in using family search several years ago you didn't have to have a family search account right you could just go and use the free website and now you do so there they've kind of gotten over one hurdle which is we now have a unique identifier as a researcher we've uniquely identified ourselves with every interaction that we do with each website now one of the problems that these researchers had in the past when they talked about the census was that the census did not indicate social security number so while you found individuals you could never be totally sure you didn't have a unique identifier for each person so now as a researcher we have one with the websites and family search now as they said in the paper assigns a personal identification number or pdi to each person in the global family tree so this supports the use of our information outside of genealogy research for sure this solves a lot of problems for them and it makes this data even much more shareable and much more valuable there's definitely reasons within genealogy that that pid would be interesting too because you're trying to say we're trying to identify this one person but it's interesting to think about that like all other technology sometimes we see a tool come along or we see a functionality come along and we think it's for our benefit but who knows it could be for a variety of different reasons and benefits uh the the record linking lab is just one partner with family search and now they're working with ancestry or we're talking about working with them but i think this just gives you a little glimpse into this world of artificial intelligence how it can accelerate the use of our data how it can actually turn it into new pieces of information that were never available in the past and that there's a lot of interest in it so it's a lot to think about i think it's really important that we're aware of and understand how our data is being used well outside of our own genealogy sandbox i know it was news to me as i did this and i and i don't have time to cover everything that i came across but uh it's easy to research yourself and i encourage you to do your own homework learn more about it learn more about that and that will help you continue to make educated decisions about what you do with your data what goes public and what doesn't and you can always change your mind as well so we don't have to continue we can pull data down we can move it we can do whatever we want to do we're in the driver's seat and that's my message to you i want us all to be aware and eyes open and getting the most out of our family history research for our own families i hope that you have found today's discussion thought provoking it's been thought provoking for me it's certainly been an interesting week of research i love the conversation with ben uh check out the show notes for this um we have a web page of show notes uh just for this particular episode which is number 32. you can find it at genealogygems.com slash elevenses get lots more information i'll have the name of this paper so that you can go out and find it yourself and read it and con the complete paper my plan is also i'd like to put together a quick survey i want to get your ideas about ai and your genealogy data and how you feel about these things and what you'd like to see and what your what your concerns are so look for that survey on the web page and take that i'll be sharing the results here in an upcoming episode my thanks to ben for sharing his thoughts and expertise and this hour has just flown by i'm gonna be checking out the chat and answering any questions that you have there as best i can in the show notes for this episode so again you'll find those on the website in about a day or so from this live event and i do want to hear your thoughts so if you have a question or comment uh genealogygems.com sorry genealogygemspodcast gmail.com i've only been using it for 14 years i should know it by now you can also call and leave a voicemail on the voicemail line 925-272-4021 and most importantly leave a comment here and or a question here and we'll take a look at it whether you're watching live today as soon as this video is over there'll be a comment section below the video here on youtube or a comment section on the show notes page so with that uh hey get out there have a wonderful week tell your friends about the show give some thought to the role that ai is playing in your genealogy life and my friend find some wonderful genealogy gems this week okay thanks so much for joining me i'll talk to you soon [Music] you
Info
Channel: Lisa Louise Cooke's Genealogy Gems
Views: 2,087
Rating: 4.9757576 out of 5
Keywords: lisa louise cooke, elevenses with Lisa, artificial intelligence, privacy, AI, machine learning, AI vs. Machine Learning, data collection, genealogy gems
Id: 2WeEYQWFiUM
Channel Id: undefined
Length: 54min 7sec (3247 seconds)
Published: Thu Nov 05 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.