Future directions in biodiversity monitoring using citizen science data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone and welcome to um this um rss um meeting organized uh by the environmental statistics section of the royal statistical society um this um this meeting uh well the plan for this uh meeting is what you can see on your screens right now um so we have three talks uh roughly half an hour for each including time for questions the meeting is being recorded with the hope that we will be able to release the recording um and uh i don't want to take up too much more time from our first speaker for today who is um ali johnston um ali is the assistant director and the center for avian population studies and ecological data science program leader at the cornell lab of ornithology um in the us so um if you want to ask questions uh please post them in the chat and i can read them out um or you can raise your hands and i will invite you to unmute yourselves and ask questions at the end of each uh talk so i will stop here and ali if you want to share your screen and start whenever you're ready hi good morning everyone or good afternoon depending on where in the world you are um it's a great pleasure to be here today i'm going to be talking about some work that i've been thinking about recently along with elaine and emily who are both here today about outstanding challenges and future directions in the analysis of citizen science data so these are some of the things that we think could make the biggest difference if some of these challenges were solved in terms of our ability to leverage the knowledge and information that is within community science data and you notice i'm calling it community science data rather than citizen science i think both terms are being used now but in the states at least there is an increasing use of the term community science data because in because some participants feel that it's it's exclusionary to use the term citizen so we're moving towards that term but i do sometimes use them interchangeably so yeah they're the same thing in and how i'm using them and um i work at the cornell level orthology so as you may guess all the examples i'm going to show here are going to be birds but i do i have heard there are other species in the world and that we can also survey other things with citizen science data um so apologies that it's pretty bird focused so as we're many of us in this room are already aware there is a huge growth in the amount of community science data in the world this is a chart of the number of records within gbif which collates data from many different projects around the world and so we can see right now it's over 1500 million records of different plants and animals within gbif now of course this represents potentially a huge amount of knowledge that we could learn about ecology and about the world but we um and we also see that as well as this increase in the amount of data available we're also seeing an increase in how these data are being used in science so this is a web of science search for papers that have as a topic citizen science or community science and over the last 10 years there's been a rapid increase in the use of these data all these terms within scientific papers so we know not only do they exist in the world but they're actually making a massive difference for understanding the world around us so to give one example of what we can learn i'm going to show a few slides of some analysis that we've done with ebird data so i'm sure many of you have heard of ebird it's a global bird community science project where people choose to go birding where they want to and how they want and we collate all those lists of species observations together and right now within the database there's 1 billion bird observations from 70 million checklists and those are contributed by over 700 000 observers so there's an immense amount of information here and we can learn a huge amount about birds from it so just to show one example this is spotted toei which is a type of sparrow that um occurs in the west of the u.s so this is its from its range map the breeding and non-breeding areas so the species migrates between those and then there's also an area where it's resident where the species occurs all year round so this is what we know about the species distribution from a range map but what can we learn about it when we analyze ebay data well it turns out we can actually learn a lot more so this is an estimate of the relative abundance of spotted toei across its range within the first week of january and 2019. the areas of dark purple are where the species occurs at highest density and where it's yellow that's where it's occurring at lowest density and we can see how this how the species moves throughout its annual cycle so the population on the east has migrated northwards to its breeding range although there's still individuals in the south and then my great south again for winter but we can see that that population on the west coast is present all year round so there's two additional bits of information that we have here over and above the traditional range map one is we can learn about the relative abundance of species in different places and the second is that we can understand their annual movements in this very fine temple scale at this weekly scale where understanding how the population's moving so these two bits of information can be really useful in terms of helping us learn about these birds and helping us think about how to conserve them so this already is a huge amount of information and we can summarize this across this is summarized relative abundance across the breeding season for this species so this we can think of as similar to the range map but again we have information on relative abundance and we you can see it's a much finer spatial scale level of information than we had in the range map we're really seeing at a detailed level across the landscape where this species occurs and where it doesn't and as well as looking at how the species the density of the species changes in space we can also look at how it changes in time so this next slide is some in progress work where we're looking at trends of this species during the breeding season so on this map the size of the dot represents the density of the population during the breeding season and the colour of the dot represents the population change over a 12-year period and you can see on the west coast that resident population occurs at high density because those dots are large but it is also declining rapidly at three to four percent per year in many of those locations whilst the migratory population on the east is occurring at lower density but it is increasing and so using this amazing resource of community science data we're able to learn so much about this species at fine spatial and temporal scales understanding its dynamics within years and across years in ways that were not possible without this kind of data so that's just very briefly an example of the kinds of things we can do just to motivate why we care about finding better ways to analyze community science data because it can actually teach us so much if we can analyze it robustly okay so that was a little intro now i'm going to take a step back and say well what do we mean when we say community science data what what are the different types well as we're aware there is a spectrum of different types of data and let's imagine we have a community of course of birds and there are seven birds in this community and how do people what kind of information do we get when people record these species well one type of information is someone just recording an individual observation hey i saw a bar now and we might alongside that information alongside that observation have information about the location and time at which they recorded it the next level up we can think of is having a list of species so maybe someone reporting hey i saw a barn owl and a red shank and for this we might have metadata of location time and perhaps duration a number of observations so we're starting to combine observations together in the list the next level up is pretty similar we also have a list but this time we've reported all species and all individuals that we've seen and so we're also reporting the sparrows which were not reported in the second list and then the last type of information we have here is something where observers might have have conducted a pre-de pre-determined type of survey with predefined protocol location and time so they're not choosing where to go or how to bird watch but they're following the protocol to contribute to a more structured survey and some of the names that we give these different types of categories uh on the left this would be presence only because we only know there was a bar now there we don't know what what else was not there so we just got this point estimate of a bar now in the middle we can sometimes think of these lists as incomplete and complete and complete is because all species and individuals that the observer could detect and identify were reported so we know that there's no observer preference coming in there and so we see this spectrum here from unstructured on the left to structured on the right and of course there are many projects that don't fit neatly into one of these boxes but this is this gives us an impression of roughly the kind of data that we're working with and i think traditionally we think oh i forgot about that sometimes we think of the data in the middle as called semi-structured traditionally we think that there can be a trade-off between the amount of data we have and the structure and so we can think of something like this where the more observers we have the less structure we have and if we have a very structured survey on the up here we won't have many observers contributing to it but actually i think we have a situation more like this where if we compare i'm just going to block out this middle a survey on the left which might have high structure and few observers but it could be very structured data all coming from a predefined protocol whereas on the right hand side we don't it's not all yellow we have a spectrum of types of information that are being contributed by these projects that contribute the most data and so it's not as simple as if we have more observers we have lower quality but if we have more observers we have variable quality and this means we have to think more carefully about how to analyze it in order to make the most of the data and so when we were thinking about some of the outstanding challenges and analysis this is the top 10 list that the three of us came up with and we'd love your feedback on this and feel free to put something in the chat or contact us afterwards but these are the things that we think could make the biggest difference if they were solved um in terms of us being able to leverage the power within community science data now i don't have time to go into all of these today you'll be relieved to hear so i'm going to focus on a couple that i've spent some time thinking about over the past few years and that's number three reporting preferences and number four observers so we're going to start with observers so with community science data we often have a lot of variation both within observers and between observers and this is in how they survey and then what they survey and in terms of their motivation and so one of the things that one of the analyses we conducted to try and characterize the differences between observers was looking at how many species they record and so this is data for a single observer and on the x-axis we have the duration of their checklist and on the y the number of species they report and we can think of this as a kind of individual species accumulation curve and this is the data for a second individual so this person has on average reports many more species in a single hour than person a so this simple analysis can help us characterize some of the differences between observers and when we do this for every observer within a region this is the kind of picture we get so showing huge variation in a given one-hour checklist we expect observers to record many different numbers of species on average now and we called this estimate of the expected number of species that they would detect in one hour the checklist calibration index and so this helps us understand and account for some of the many differences between observers by calibrating them against each other this means we can use everybody's data even someone like me who's not a very good bird watcher my data can be used because it's associated with a checklist calibration index and when we account for these differences between observers we get stronger results we find that our models are better and we get um better fitting estimates of where species occur and so this is an example of the difference between a model where we included cci the checklist calibration index and a model where we didn't and um where it's pink the model with cci estimates a higher um occupancy rate of chickadee in those areas and where it's blue the model with cci estimated at lower rate of occupancy of chickadee and so you can see that if we don't account for the differences between observers we get an estimated species distribution that is spatially biased because we're not accounting for some important heterogeneity in the observation process and this is with an occupancy model that actually does a pretty good job of accounting for some of these um or differences anyway and if we're not using an occupancy model we actually find the difference between estimates of distributions when we include observer differences and when we don't to be much much greater so this can be a really important aspect of understanding species distributions by accounting for this element of differences between observers but we also know that not only are there are differences between observers but there are differences within observers so we found in a previous analysis that observers learn as they contribute so this would be an estimated individual species accumulation curve for a single observer and as they contribute more checklists their their expected number of species within an hour or within three hours goes up and so individuals themselves are learning which i think is a is often a goal of community science is that by participating people will engage more with nature and learn and we're showing that here but it also means that when we're thinking about how to use these data to actually understand species we need to account for this additional element of heterogeneity that comes while observers learn so another way we can visualize this difference within an individual is by looking at say across time how the expected number of species they would record per hour might increase and so this is not not now by checklist but by year and but one of the interesting results happens when we look at the number of expected species per hour across the whole population of everyone contributing to ebird within this region and we actually see a decrease in the number of species reported so how does this apparent paradox occur well it's because the types of people who are contributing to ebird change over time and so if we take an individual random effect um for the expected number of species that an observer will see per hour and plot how that changes across time what we see is that the first year people the people who joined early in 2002 when ebird started tended to have a fairly high observer random effect and the people who joined later tended to have a lower observer random effect and this decrease we're seeing is that as the project expands and grows the people who are joining the project as a new observer tend to be a little bit less experienced than the people who joined at the beginning who might have been those who were already enthusiastic and experienced and so we have these complex differences both within and between observers happening across this whole pool of people contributing data and this is just one project if we're thinking about a data set like gbif which has information collated from many different projects we're going to have even more complexity in understanding how the population of observers is changing and how individual observers are changing and to make the most of community science data it's really important that we consider these things and consider the people who are contributing and what information they are telling us but one additional thing we need to be careful of is that people are freely contributing this information because they enjoy taking part and many don't want to feel graded and so we need to be very sensitive when we're thinking about this from an analytical perspective that we're actually not losing the trust of the community of people who are contributing these data and so we want to be robust analytically but also sensitive to the fact that this is a community and these are individuals who are contributing their data to science but may not want to feel like personally they have a score attached to their name so this adds an extra degree of complexity when i'm thinking about how best to analyze these data so the other element of our 10 top 10 challenges that i wanted to talk about today is reporting preferences and again this is relates to observers because we have differences in reporting preferences both within observers and between observers so many of us might have species that we really love to see or get excited by maybe other species which we're a bit less excited by and so if we're going out to report record species we might report that we've seen a puffin and kind of forget to report that we saw a hundred probably like 400 here i guess um so so this is another aspect of community science data that we need to consider when we're thinking about the analysis and going back to the four types of data that i showed you earlier these reporting preferences should mostly affect the data in these first two categories the second two categories should have less influence of reporting preferences because people are reporting all the species that they detect and identify so i'm going to take a little tangent here and talk about reporting preferences instead of from community science data but from another way that we can that we've assessed which species people might be more interested in than others and that's by using information from google we all know those secret questions we type into google that we don't want other people to see google knows all our secrets it knows whether we search for a pigeon or a puffin and so we used information from google the spurred species that people were searching for to try and understand the preferences and so this is for two species across the u.s and it shows the density of their google searches um across states and we can compare that to the encounter rate of these species using ebird data and we see for that for both we've got high congruence between where people are searching for them online and where they're encountering them in ebird but this didn't hold for all species um for barn owl and for forster's turn we saw large differences in where people were searching for these species and where they were encountering them on ebird so we combined these data together and came up with two axes that describe the how people are relating to these species on the x-axis we have interest relative to species distribution and so that's how aligned or how congruent the two maps are the google search map and the ebird map and on the y-axis is interest relative to species abundance so was this searched on average more than it was encountered or was it searched on average less than it was encountered and each of these dots represents a different species now of course we're also interested in where different species occur in this and it turns out a lot of the species that occur near the top were large iconic easily identifiable species whereas many of the ones that occurred on the bottom might be what some birders refer to as little brown jobs small brown birds that maybe are apparently of less interest to people typing their species names into google and so we can see some of these differences when we look at the extremes but we also decided to model this and see whether we could detect differences analytically and so we use these seven traits to look and see whether they significantly affected where species occurred on this matrix so this is an average pacerane species where it occurs and if we look there were three variables that had a significant effect on where species occurred one was body mass so larger species tended to be near the top of the graph with smaller ones tending to be near the bottom the second was federal protection status so federally protected species were more likely to have higher interest relative to their abundance and to have their interest more aligned to their actual distribution so a lot of local searches online and the third one was whether they're a professional sports team mascot and so ravens and baltimore orioles and cardinals some of these species are used by professional sports teams as their logo and these species tended to have high interest relative to their abundance and to not be related to their distribution and so when thinking about this back in the community science world we wanted we thought one way we could assess this is by comparing incomplete to complete because incomplete should be influenced by species preference reporting preference whereas complete should not so on the x-axis here we have species prevalence on complete checklists so this is how often the species is encountered by people participating in community science on the y we have the ratio of intercom incomplete to complete checklists so where it's higher that means the species is being reported more on incomplete than would be expected and where it's lower that means the species is being reported less on incomplete checklists than would be expected by chance so if we're in a room i'd ask you now to predict what you think this shape will look like so i want you to imagine in your head what you think this relationship is going to look like so what we found was this negative relationship where each point here is a species and we're seeing that the rarer species are reported more often on incomplete checklists than would be expected and the commoner species are being reported less on incomplete checklists than might be expected on average and so rare species are maybe more interesting to people participating in ebird which we're detecting by this high ratio and if we look at where different species occur we see some of these large iconic species again having positive residuals and other maybe smaller brown species having or goals which might be less interesting having negative residuals so as well as seeing an effect of species prevalence we're also again seeing some of the traits that affected the google searches coming out here and we do see a weak but positive correlation between these residuals and our google popularity so these are the 10 things we came up with i've just very briefly talked about some of the challenges we have with reporting preferences and observers in regards to analysis but we think these ten things can be tackled with three different ways one is by thinking about how we collect data secondly is by using existing methods on more different community science data sets and thirdly by developing new analytical methods that can help us address these thank you very much happy to take questions um thank you very much ali for an excellent talk yeah i can see already people giving you a virtual uh rounds of applause uh please keep doing that that's uh that's really good um if you have any questions you can uh pop them in the chat or you can raise your hand so i can see already at least one raised hand um so kareem is that the name yes okay yep go ahead thank you thanks ali that was a very very interesting talk um i had a question regarding the reported uh preferences i was wondering how you could account for that in your species distribution models or such model like that using ebird because so far you presented the bias that could come from those uh preferences but i was wondering how you expect to account for this and how yeah that's a great question i mean partly we we're listing those ten as challenges because they're not solved yet and one of the way the main way we deal with that in ebird analysis is by only using complete checklists so the analysis that i showed you is a result of only using the complete ones so that reduces and and hopefully almost eliminates the impact of the reporting preference bias on the data that we're using but i think if we could as a community think about better ways to to analyze incomplete data and think about how to build in reporting preferences then that would open up a lot more types of data that we could use and might help us learn about species in parts of the world where we don't have many complete checklists so i think you know i have a few ideas and i'm sure you do as well and i think you know as a community tackling some of these challenges is really going to open up possibilities okay thanks for the question uh karine so um i suggest that we move on but there will be an opportunity after the end of the last talk for people to ask more questions um okay yeah there's more questions being popped in the chat so ali can then try and answer these um so i'll um we'll move on with our next uh speaker uh for this afternoon who's uh michael popcock who's um a senior ecologist at the center for ecology and hydrology in the in the uk so um so please everyone else mute your mics uh and pop your questions in the chat either for ali or for uh michael or um you can ask the questions at the end of the talk thank you okay and uh you can hear me can you okay eleni i can hear you perfectly i can see your slides brilliant okay um thank you thank you so much um it's always a joy to give a talk following ali because she introduces the subject so very very well and i knew that she would do a good job so i wanted to take um i suppose in terms of my philosophy of this talk a slightly bigger or slightly different angle um in terms of the bigger picture of the role of statistical thinking within this citizen or community science approach um and so i'm part of the biological records center which is based at the uk center for ecology and hydrology i work with a wonderful bunch of colleagues there and some of what i'll be talking about will be drawing on some of their work as well within the biological record center in the uk we've got a long history of recording so we're about 60 years old but the recording in the uk goes back a lot further than that um you can see here some of the advances that have um occurred with the plant monitoring in the uk um and and some of these have been really quite fundamental for example the the plants um sorry the atlas of the british flora in 1962 was doing some really fundamental um innovative work um at the time that that was produced all the way through to apps and things like that nowadays now one of the challenges as ali talked about is this issue of um potential what we might call spatial bias within data sets so here's an example um of a data set from a study which was run out of the biological record center and and if we zoom in and have a look at this in terms of the patterns of distribution of those records so the records are shown in black in that bit of southeast england on the right and you can see that it almost perfectly correlates with population density human population density so there seems to be some sort of problem there and i then um have worked with a whole bunch of interesting colleagues i know some of whom are in this meeting and and got data on 20 projects from england and france looking at the spatial distribution of of the records and so we used a plus on point process model and we were relating the distribution of records to these four different things human population the accessibility of a location the lack of deprivation so this was based on a deprivation score um a human deprivation score so i suppose a measure of relative poverty and the presence of nature reserves and we looked at four different types of projects broadly the wildlife recording and mass participation where anyone can record wherever they choose garden-based projects and structured sampling and you can see here that these are the sorts of relationships broadly that we found um that you can see that many of them had positive relationships with human population density and accessibility as you might expect so people are tending to record where people are or where they can get to and you can also see that people oft in many cases tended to record um in nature reserves and that makes sense people go to nature reserves to watch and record wildlife and so that's where you get so many of the records one of the really striking things is that across england and france for these data sets there was a positive association with the lack of deprivation so people tend to occur sorry people tend to record or at least records are made more from the areas that are more affluent and that's both in the uk where areas of deprivation tend to be focused in cities and in france where a slightly different measure of human deprivation is actually spread between um a sort of rural deprivation and as well as cities city centres and these sorts of things i think are really really important because for you as statisticians you might then be thinking aha so these are problems that we need to overcome but actually if we put this some of this in a slightly bigger perspective what we're actually doing is we're touching on some i think some really quite important societal issues that we need to consider and actually there's issues about representativeness potentially or certainly in terms of locations but also potentially in terms of people involved in this community science chris chanel recently talked about this in terms of systemic racism and issues like this and actually these are really important issues which which we really need to be thinking about and so the thing is is that when dealing with the sorts of data sets that i've talked about and ali has talked about it's quite easy to sit there in front of a computer and go ah yeah unstructured data biased recording low recorder effort and well what what they ought to be doing is and i think those last three words are probably the most miserable three words that i can sometimes hear when you get people sat in front of computers and saying ah those people out there they ought to be recording absences or whatever it might be recording effort better um and i hear it occasionally um but of course the recorders out there exactly as ali was talking about other people who are thinking well i'm doing this entirely for my own enjoyment i'm submitting records for a wide range of different motivations and actually it's how these two um sort of communities interact i think is really important and so for me i think it's it's vital to be taking this whole um system approach so we're not just thinking about this in terms of reporting biases but actually we're thinking about it more in terms of an unevenness which comes from the data the data just are we don't it's not that the data are biased it's what we might do with them that might lead to biases and we need to think about observers and reporters within this potentially to try and think about reverse engineering the observation process to make accurate ecological inferences and one of the ways in which we do this as nick isaac leads a group of people at the center for ecology and hydrology using occupancy models and broadly i'll skip through this quite quickly but broadly what they're trying to do is estimate um the the occupancy of sites while taking and detection into account and doing that in similar ways to those which ali was talking about using things like list length and various things like that for many of the taxes that we deal with it's not quite as neat as birds where there's a really distinct community of taxes that we're recording and these occupancy models are really valuable because you can see here the numbers of species that occur in britain and the darker bars are the portions of the the species within each taxon where we can produce um occupancy models which are which is incredible really and so really thinking about this in is is valuable and i think actually thinking about the observers and the reporters within this whole process and having a much greater interaction between us as analysts scientists thinking about dealing with the data and where the data has come from i think helps inform those models as ali demonstrated but also it helps us think about informing design as well now once again i'm covering ground that ali has covered but from a slightly different direction um although we've got a quote from a great set of scientists there at the bottom that that people are different people are different in the way that they do record they're different in their expertise and various things like this and so with some work led by tom august we looked at the data submitted to the i record butterflies and within this app all the records are centralized and we used um what's that three years four years worth of data from um over 5000 recorders and so what we did was we had a go at classifying these thousand recorders who were active on um more than 10 days across the project and what we wanted to do was to try and describe their patterns of recording and so we looked at that temporally with a range of different um measures there we looked at it spatially in terms of the area in which they were recording and how um how concentrated or how large or how clustered sorry how fragmented that spatial range was we also looked at the information content in terms of the numbers and the types of species which were recorded by those people and so when we got all the data from these thousand people and and classified them with the principal components analysis we ended up with these four um distinct axes of variation and actually they're they're relatively independent from each other so we were able to say that people differed in terms of their recording intensity how frequently and regularly they record their spatial extent so are they are they patchworkers who stay in their little patch or are they people who dash all around the country um or at least travel around the country looking for um different different butterflies and then in terms of the information content the recording potential which is a function of how many records they're um providing but particularly including rarities as well um and versus and and sorry the fourth axis is then the rarity recording which is the twitching rare species and the recording potential is interesting it's i think it's that sort of balance between incidental recorders versus those who are more more complete naturalists if you like who go out there making notes of all the things that they see and traveling a fair bit as well so that's all very interesting but so what well some work that i've been doing just recently fairly hot off the press has been then thinking about well so what what impact does that actually have and so what we did was we took a data set of butterfly records over 20 years and we did occupancy models for these for that data set and what we then did was we um went through the the slightly challenging process of grouping it down uh sorry of cleaning up the data set um in terms of the names and and classifying all those different recorders and then we created smaller subsets of the data which were either random or were biased and those biased data sets were created either as having high or low values um or at least comprised of recorders that had high or low values on each of the four axes that we were interested in that i talked about a moment ago and so when we ran those models we ran them for several species of butterfly and we were interested in these three outputs the slope so the trend the biodiversity trend the variance of the slopes more accurately we can record that trend and the average difference in occupancy and so what we did was um on the the next few graphs that you'll see um you'll see the four recorder behaviors you'll see the when we selected the high values of those that's in gold and low values of those who are in blue and you'll see the z score um on on the y axis and so where that z score falls outside of the the range plus or minus 2 you can say that actually that's important so when we modeled these subsets of data and looked at the slope what we found was relatively speaking the biased data sets didn't actually have a a substantial impact on the slope so the slope is quite robust the biodiversity trend is quite robust to variations and recorded behavior and that's really great because that suggests that we we're not ending up with um biased data sets when when we compare data sets compare trends from data sets that have different recorder behaviors when we looked at variants of the slope and also occupancy what we found was that there were some sensitivities to record a behavior and in particular recording potential so this this sort of measure basically saying how incidental versus how much of a complete naturalist the people where in terms of their recording behavior that actually had quite big impact on recording um on the variance of the slope and on the estimates of occupancy in particular and so i think from this as i say this fairly hot off the press piece of work the questions which are remaining from this so that actually is has recorded behavior changed over time and how does any changes in recorder behavior time affect the outputs we found that slope is quite robust but if occupancy changes systematically over time then then that could be problematic and i think it's really interesting thinking then about recording and the way people record so not only have smartphone apps and all the rest of it made it much easier for people to record but the way in which people record and the types of data that people record is changing and i've got three examples there and plant net with image recognition bio acoustics e dna all of these sorts of things actually lead to things like probabilistic data and i think this this the use of probabilistic data rather than the sort of certainty i definitely saw whatever it was um is something that's going to be important and combining different types of data i think is also going to be important and so a brief bit of work which has been led by francesca mancini has been looking at some of these integrated models to to think about how we can best combine some structured data with unstructured data there's a high overhead to collecting structured data not so many people want to do it and we talked about that unstructured data we've got loads and loads more records but we can end up with these issues of um uneven coverage in space and possibly time which can be challenging statistically to cope with and so what francesca did and i've put this is preliminary work because i'm aware that she's still doing a fair bit of work on this and i think may well have updated some of the findings here but anyway what she found when doing this was that when combining structured and unstructured data together she actually found that that increased the precision of the trends which she was able to get out from those models and so that's all really really good i think one of the interesting things then is what can new statistical approaches like this integrated modeling what can that add in terms of the value to the datasets that we have and particularly these new ways things like combining these data sets statistically in essence is it worth the statistical effort well are the results good enough already we might always say ah but we can always get better but when we're thinking about influencing things like policy and management maybe we don't need any better maybe we just need to be influencing actually what happens on the ground a bit better the second thing is how can this knowledge actually inform the design of projects and the engagement with recorders so we might say look um a small amount of structured data is really really helpful we don't need to collect a whole data set of structured data but a small amount might be really helpful so um thinking then thinking then about some of these issues with data i think one of the important things is to be thinking about how much is enough and this is quite a can be quite a challenging question so we i was thinking about this um with colleagues a year or so ago thinking about how it would be useful to have rules of thumb to assess how much in particular unstructured data this this data collected by people as and when they want are good enough or um to get good models and so these are two examples of of species i think they're both bees in this case um and you can see that the model on the left um you've got 95 credit credible intervals from the occupancy model it doesn't look like an especially good model and the one on the right looks much more believable it looks much more credible in terms of in terms of some decisions that we might want to make off the back of that so what we did was we showed a whole bunch well a few a few experts these sorts of um graphs and said look which ones do you think are good enough to deal with and we were able to come up with um a threshold in terms of a measure of precision that that people said yep the ones the ones that have higher levels of precision are good for us to use the ones that are lower but we're not so sure about i should say that in all cases obviously this doesn't necessarily mean that the models are true the outputs are true and it doesn't necessarily mean that they're useful or unbiased um but what it does mean is that they're probably good enough to consider further and so we've got trends for over 10 000 species from mainly from but entirely from other people's work within uk ch and um so what we did was we tried to look at the structure of the data to basically say which one of these um species was above the threshold and which one of those was below the threshold now huge amount of complex stuff which you can boil down to the fact that um we within the context of these multi-species data sets we only needed about 30 records per species per year and to be able to get records where the trends looked reasonable which does seem remarkably low of course the more data that we have the greater the precision which is a good thing but it gives some sort of indication so then when we when we go to places like um as colleagues of mine did in wales and said look how much data is enough and and have we got enough to run some of these models well you can start and actually this is shown for great britain here you can start to say well how many species within these different taxonomic groups have we got enough data to to lead to models that look like they're going to have reasonable levels of precision so models that we might want to do something with and you can see that for a wide range of species we're able to get um trends with precision above that threshold for a reasonable number of species within a wide range of these groups and i'll just go back and one one thing to note on that slide is within all of these data sets only sites that were visited in more than one year were included in the analysis and so that led us to think well if that's the case that means that there are a whole load of sites out there which have data in only one year and so aren't in our data sets and if we were able to target revisits to those squares while people going to those squares um would actually basically their records would effectively count twice because both the historic record and the current record would then get put into our data sets which presumably seems to be valuable and so um we did some work developing this tool and so it was developed as a pilot last year we're running it for a few different species groups this year basically the idea is is can you change these pink squares to green squares pink squares which have only been visited one year in the past to green squares which are have records in more than one year so you can see there's a square not far from where i work which has got records from a range of different grasshopper taxa in one year only so if i went there this summer visited those then then that site becomes added to our data set of course we need to think about does that create biases and all sorts of things like this or challenges with the analysis but but it seems to be really valuable and it seems to be something that's well received by recorders and actually seems to um or at least the feedback is that that boosts their recording um uh the visits and the time they spend out in the field doing recording and the benefits and the joy that they get from doing that and then of course the other question is you look at a map like this from north yorkshire and actually there's there's not many records there at all so suddenly it opens up this big question of well where and when can i where where should i be going out doing my recording for best effort so many people who are happy to go out and and to say yep i'm willing to go to places which really need records tell me where to go and this leads on to the the final example that i'd like to draw on which is a new project that i've got called decide funded by the natural environment research council and what we understand is that spatial biodiversity information is really valuable a lot of what i've talked about so far has been trends but spatial is really important it supports strategic planning and it supports things like natural capital accounting it supports assessment of biodiversity net gain clearly when we're thinking about fine scale data we need to use models we can't be asking people to go out and record butterflies in every single hectare in england let alone across the rest of the world and we know that recording by volunteers is spatially uneven because people live in different parts of the country and people go to places where they want and all sorts of things like that so therefore it's unlikely to be optimal for these spatial um the spatial modeling but could the recording and the patterns of recording actually be influenced there's a couple of studies which suggest that yes it could be a really great um quite statistically complicated paper by zhang ju on avi caching indicates that actually with some information about where records would be valuable they reckon that 19 of effort was shifted in this one particular pilot project to undersampled locations and corey callahan recently has been doing some other work looking at this sort of optimizing sampling by citizen scientists and so within this project which has um been going for a few months now we've intentionally taken a very interdisciplinary approach especially to think about this link between data and recorders and people and so we're drawing on ecology and social science and visualization and co-design and computer science a whole range of things to try and make this big package which hopefully will go on to influence um people's recording and so our aims within this project are to create these high resolution near real-time species distribution models so that we can map these areas where we have deficits of data well then or what we're then doing is developing adaptive sampling and and so that that's the whole idea of delivering recommendations so that we can reduce uncertainty in these models and and do that in a way which will be informed by simulation modeling and various other things in order that we can do that in a in a truly informed way and from all of this and then working through this process of co-design create a tool and a um prototype of it is shown on the right there for create or for providing these nudges and feedback to people people are open to going to places where records are most needed but they need to know where those places are first of all so this is something i think is going to be really exciting as we think about this interaction between the information that we we need or that other people need and the tools and the statistics to get there and the ways that we interact with recorders on the right of of this figure here are six keywords which came out of a set of workshops i did earlier this year when we were thinking about the the future of citizen science and there are many aspects of this in terms of precision delivering precision citizen science or making it personalized or localized or even actionable but i think all of which have bearings on the way that we use statistics within our citizen and community science and i think one of the big questions for us you as a community is really what's the role of statistical thinking in this and actually what's the vision of statistical thinking in this i think while he was talking about is is really important because not only was she talking about what statistics can we do with existing data but actually how can we bring that statistical thinking to bear on the way that we design and influence and encourage and motivate recording by others in the future thank you very much yeah that's it that's helped all right thank you very much michael i can see a lot of uploading obviously a very interesting talk um we have time for a quick question if anyone wants to ask um a question i don't see anything in the chat uh is there anyone who told steve hello yes just one question that's been bugging me for a while there's a lot of emphasis now on occupancy models uh in trend analysis but for a lot of the species i work with in mammals and things i think the link between occupancy and abundance is quite weak whereas the weak the link between detectability and abundance may actually be stronger and i just worry that we're chucking away some important information on abundance when we use occupancy modelling yes and i think another example are plants where actually that that link can be quite challenging um so there is a certainly my opinion there's a less strong link between abundance and detectability for example implants which is slightly different to what you're saying but i think raises a similar sort of challenge so yes i think there's a a lot of um a lot of opportunities and i certainly don't think that occupancy modeling is the sole way of analyzing or addressing these sorts of questions but it's it's certainly useful for many taxa and many aspects of the sorts of data that we currently have but maybe some of the insights yours and others insights can actually help to inform and encourage and and motivate those volunteers who are going out as well many of whom are actually doing it for their own interest enjoyment well-being joy all the rest of it but also doing it and happy to do it for a bigger purpose and sometimes actually there can be small asks which can be made and and even small nudges which can be done to to make recording in in one particular way or at least submitting records in one particular way which is particularly beneficial much easier for the recorder and and i think these some of these subtle nudges could also be very valuable i totally agree with that we've yeah in our data sets you see some frustrating instances where there are lots of sites that have been surveyed in the past and haven't been revisited okay um that's a really good question and obviously michael answered it really well um as i said before please feel free to write questions in the chat for either ali or michael um and now i suggest we move on to the third speaker for this afternoon who is alex diana and he is a post-doctoral research um assistant at the university um of kent whenever you're ready alex uh yeah yes i think you can all see the screen so yeah yeah thanks for having me and as helen he said i work a cans with alani and i'm going to talk about this model that we developed with emily and byron to perform vision inference on very big on cuban c data sets so if you've attended the ncmc this is going to be a very similar talk so one problem especially that we're going to deal with that comes to citizen science data is the fact that this data can be really large so if you just apply standard methods then the methods can be really really slow so we try to work to develop slightly faster method that can allow us to fit the model maybe just less than a day to get a quick answer so michael and alison have sure done a better job than me and ex introducing season science data because this is the only slide i have and then i'm gonna just talk about modeling so i guess yeah i guess you all know that it is good to be model for citizen science data because they're becoming more and more available and sometimes they're the only way to get an answer if you want to estimate also spatial variation for example because we just like standardized protocol sometimes it's not you don't have the scale that you have with citizen science data and um yeah what we're trying to do is just try to build the visual model which can estimate spatial and temporal variation but in a reasonable amount of time so just a little bit briefly about the existing approaches so one of the early approaches to mode to that was the dynamic cuban c model of royal indoratio and this model is mainly designed to uh follow sites over time because it has parameter like colonization extinction so this is really just designed for short studies and first studies they're visited each year well with seasons as data as you all know some sites that this is just one and once and then never again then another approach is the classical approach of emily that allows to fit the model quite quickly also because it's a classical approach but it works just for each model separately another very good basic approach is the approach of outweights richard chandler and nick isaac where they build a vision model and to model them for the correlation they use a random work approach or random work prior and they also implemented this model in the package part which up to date is probably the standard to fetal cuban c model is also michael was talking about and another more recent approach is the approach of rushing we use a special moodle spline with time demand coefficient and in this way they in a kind of gum framework so in this way they basically they're allowed to model the spatial variation as well as the time temporal variation using this trick so the start of the cuban c model is where you essentially have a site that can be either occupied in a year or never occupied for the year and then essentially you have an observation and presence process so in the observation process you essentially have your capture probability that is modeled as a combination some year specific intercept and some potential covariance a condition of questions have been occupied and then there is the presence process which is the one we are anxious about um where the direction probe your cubans probably disorient depend on temporal effects spatial satisfactions some combined okay and so our idea to gain um a little bit of computational advantage over the existing methods was to view these two equations which are essentially the two main equation in a cubancy model and that's just linked by this other equation to view this equation in a logistic regression framework and then we can once we are in the regression framework we can use some very efficient method for phrase and inference which are considerably faster uh for example like the polygamous scheme is the one we use and then another addition of our model is that we use uh gaussian process or cushion fields if you want for the temporal effect and then the spatial random effect of the equivalency probability so i'm just going to talk very briefly about gaseo gaussian processes about being too mathematical so gaussian process is a really useful tool because they define a non-parametric prior function by defining a prior on the values that the function takes on each combination of points and so the idea is really simple it's just to say the value on this combination of points is a normal distribution where the covariance depends on the distance between two points so closer point will be more correlated as just bats and then more distance point will have a correlation that decreases quadratically and there are two tuning parameters for the gaussian process so one is the an overall variance parameter of the process and then another is a scale parameter which determines the correlation between closed points so this might look a bit like an overkill in our model since we don't actually have an infinite number of points we just have finite number of points so actually what we do is just we just assume a multivariate distribution on the combination of points in which we want to compute the calcium process so in this case it's gonna in our case it's gonna be the temporal random effect and the spatial random effects so we're just gonna assume a calcium uh distribution with this commerce matrix on these effects and what we just think is useful to think of it in the custom process framework because custom processes are very well studied topic and so in case we want to use approximation that have been developed uh there is already a lot of it okay so as i said in the model we're just going to keep the same cuban c probability as is used also in the other model for example in spata and we're going to use a gaussian process on the special on the temporary effect so the correlation between clause adjacent ears is going to be doing by this parameter lt and then similarly we also assume another custom process over the spatial random effect which is one random effect per site okay and again the variance in the correlation is going to be tuned by this parameter lt and sigma t for the temporal effects and ls sigma s for this spatial random effect so and then by assuming additional prior on this parameter we're going to infer how much year varies over time and how much is the extent of the spatial autocorrelation and then we also assume independent side-specific random effect on this through this random variable epsilon so this feels a bit inefficient because we have essentially as many sudden dependent random effects as the number of sites and they're also completely independent so we have to estimate each of them separately so what i try to do at the beginning was to use clustering to estimate this size specific random effect but then i found that the model was not behaving very well it was not mixing well and it was not finding really just some fixed cluster it was just jumping within cluster so i just concluded that we with this model is probably not possible i don't know if it's a feature of the data or it's a feature of the model but it might be not possible to estimate to cluster also site using just the leftover spatial variation so just to compare briefly on the random work approach of outweight in our approach so the random work approach essentially models each site as a random more condition on the previous uh each star each year random effect condition on the previous in this random mock style and if we just want to visualize the the various matrix in that case in our case we see that the only the difference is that our model essentially assume the same correlation and covariance uh between here that's at the same distance while in their model they have essentially an increasing variance across time okay so you should probably see that in the in practice since we are dealing with season size data so the size of the data is huge uh this doesn't really matter because the likelihood almost always overcomes the prior so this is more like in practice just like a mathematical detail but it's something to keep in mind if you are planning to use the model also for um very small data sets where the prior as an effect so this is our funnel mode looks like so here we have our observation process condition on the capture probability and the presence process then we have a cuban super ability where the science the space the sorry the poor random effect and desire random effect has got this to normal distribution then we got some claudia's and some additional side-specific random effects so what the what is the main assumption that we made in our model so the first is that each side is either always occupied or never occupied in a given year and this assumption might seem really strong because if you look at the species like butterflies this is actually the one we are going to analyze then clearly a side is that is not always a combiner never occupied because for some period of uh of the year butterflies do not fly and so it's impossible to detect them but this can actually be solved by just assuming um time marine detection probabilities as we are going to see later while another assumption is that cuban state in each year is conditionally dependent on the states in the other ear so the fact that the site is occupied in a given year conditionally on all the modern parameter it doesn't give you information on the states in the next year as i said conditionally on all the model parameters and which is a difference with the dynamic occupancy model where they have a parameter for each side but clearly there is nothing we can do with this with this assumption because we are in a regression firmware and then obviously we assume that all the observation which are inside in a given ear i depend so as the model is explained is actually quite inefficient because as you see here we got as many random effects as the number of sides and their model as a multivitamin normal so clearly because the number of size we deal is like in hundreds of thousands this is clearly not practical so because sampling from the posterior distribution requires that many operation which is impossible to do in each initial duration so what we do is that we actually we're forced to approximate the initial gaussian process on this s location with another gaussian process on a smaller number of locations say m where m is much smaller than s so if m is like if s is like hundred of thousand if you choose for example m by choosing if you say we start in the uk we just choose squares of a side of 20 kilometer then we end up with around a thousand for m which is much more price still large but it's much more practical practical and then what we do then is that we approximate each of the initial spatial effects with the close special effect of the approximation and this approximation in the gaussian process literature is known as is known as subset of data approximation and as i said since we are in the gaussian process framework we could also potentially look at other approximations for example one is called the subset of regressor approximation which is slightly smoother and this gives the idea essentially if you want to investigate our other approximation also another advantage of using if you use for example a uniformly spaced grid is that the conditional number of this matrix is much lower which is known to be a problem with gaussian processes because if you have two sides that are really close in location that the determinant of this matrix goes to zero and then the inference is uh is very stable so this is another reason to approximate the initial sides with sides and a uniform grid and i guess the last thing i'm going to talk about about the model is this polygon scheme that you we use to perform the inference so as i said we can view both the detection process and the observation process as to logistic regression okay conditional decide specific random effect and so the advantage is that since we see this as a logistic regression we can sample all the model parameters together so the baseline for the cuban's probability temporarily effects spatial enquiries we can support them jointly from the model and so the inference becomes really efficient okay so how does the polygamous scheme work so without going too much into detail the idea is that you just augment the space with some new variable which are dyspolia gamma distributed random variable so the polygamy is quite an exotic distribution but the advantage of the wavering is that they offered some fast metal to simulate this polygamous distribution so this was actually the main advantage of them of what they bring in the paper and so this step is not a war zone anymore and then conditional on this uh polygama random variable the posterior on this all disk efficient is actually a multivariable and so we are back just in the normal framework uh in the linear regression framework even though we are dealing with a logistic regression okay and yeah so just keep in mind that this posterior and the efficient is still normal because this could be used also later for doing other things with the model so so we implemented this method in a package so we just called it faster cuban see because we didn't have much creativity and we put a on kit up and then we compare this model against sparta which is uh which is a very good model so the only problem is that if the number of sites is really large then the model can take contain weeks and so this was kind of motivation for trying to come up with a better solution so first check i did is the simulated data with 15 years and around two visits per site per year and then i've added the number of sides and these are minutes and as you can see like as the number of size grows then the difference between our model and sparta becomes larger and larger so i gotta run this on more sides again 5000 is not what you would you have in practice you probably have much more but you know just to have results in a fast in a reasonable amount of time just a limited to 5 000 and um also we wanted to check if if it really makes sense to add some special random effect in the model so we wanted to check is it really possible uh by adding this special random effect to estimate a special pattern uh in the data so we just simulate data with a special pattern which is the center column so the whites are the areas with the higher humans probabilities and black is where a lower occupancy probability and on the left we have our estimate so we can see that we are pretty much able to recover the the estimates of the cuban's probability in the true model and on the left we have a model we just independent side specific random effect and as you can see because there is no correlation uh the model essentially has to estimate each side independently is not able to estimate to recover the rent so there is some hint but obviously you won't be able to draw any conclusion um and this was in a sparse framework so this was a when we simulated a thousand sites and we didn't visit each side in every year which is kind of like what you would expect when you're dealing with season size data and then we also have some results on real data so as i said before we're going to deal with butterflies and especially we're going to look at ringlets which is a species that is much more abundant and then the duke of burgundy which is a species that um that is threatened and is actually declining and the data were collected as part of this vietnam record scheme which runs over more than 40 years and it's got more than 11 million records so for the ringlets we have more than 2 million records from uh one thousand uh sorry 140 000 sites and the ring light was recorded more than 200 thousand times but for burgundy we have 1.5 million record but it was recorded only uh six thousand times so they're two pretty big data sets and the model uh honor machine took around a day for their english and i ran off a day for to do burgundy so again it is not like uh the time it takes for a cup of coffee but if you think that we're dealing visual mode if we're fitting visual models estimating also spatial autocorrelation for uh very large data set then it is surely an improvement clearly if you just the most expensive part of the model is estimating the spatial autocorrelation because the number m here which makes up these coefficients can be really large okay so well this can be in the order of 40 and this you can maybe tank a virus this can be in the order of thousands so actually without special relation the model will be much much faster so we look at some results so uh first of all we want to have an idea of the average occupancy rate of the species and how is it evolving over time so to estimate that which look looked the cuban occupants index defined bamly which is just the average estimated human super ability every year across all sites and we see that ringlets is species that are actually increasing and uh this is something that was known already but the dugout burgundy is a species that is declining even though it seems to have stabilized get stable over the recent years then as i said we're also able to estimate some spatial pattern of the human super ability so rather areas are species with directives probability and the lather [Music] are more towards whites have a lower human super ability and as you expect that the ring light is actually expanding across the uk which is something that was known already and as i said you can see that it's quite patchy because i used a square of side step size of 20 kilometers uh i tried to make some extra some more simulation with the different step size but the results were pretty similar so it was good to check that it was it was not too sensitive to the shoes of this step size even obviously if you end up choosing uh very large step size then uh it is going to be sensible because if you just think of the model you just want square you're going to be back with a model without special autocorrelation and then we also have some estimate on the spatial trend of tudor burgundy even though we don't really have many data to estimate the [Music] spatial correlation since species essentially recorded just in this area here and in this area here and what we can see across the years is that the species going down which is not really surprisingly also another thing as i said we to account for the fact that flying time differs across the across the the year we use as covariates the three powers of the julian dates okay and this is estimated pattern of the detection probability for the england and for the gopro gundy so this is the action probability my idea was to try to link that back to the flying time because this estimate are actually pretty similar to what is reported on butterfly conservation which is that's the ringlets the flying time of the ring lindens to be around july august and for burgundy around june july so starting from the very uh well-known formula where the detection probability is just a function of the population size and available at time t and the probability of detecting a single animal if you just assume that probably attacking a single animal is costing across time and rearrange that we can see that actually we can estimate uh we can see these plots as an estimate after proportionality when the proportionality is this thing of the [Music] number of individuals okay so i mean i wasn't able to find a better explanation to link the detection probability and the flying time or the number of individuals flying so i mean if anyone is a better idea then that's welcome and uh also another thing to check that we are not forgetting anything in the model we also performs and goodness of fits so as a test of this we use first the number of related direction across all sides and in shear and then the number the direction in a region of 20 kilometers times 20 kilometer for every gear so we simulated this this is from the model and then we compare them with the true statistics so for the important goodness of fit it looks like we do a pretty good job since we're able to capture since true one is the square and we're able to capture it even though i might say that when we're at the beginning when we were assuming uh the detection probability was not firing across here so we just had one intercept then the model was actually biased so this was a good indication that if we don't assume for uh time margin detection probabilities you essentially get bias in in your estimate and then we have the estimate for the special containers of fits so there are this is the ringlet and this is the burgundy so for the ringlet you can see that so i have many colors because it distinguishes the model is outside the 95 percent uh competence incredible interval of the 99 credible interval as we can see there is still some special auto correlation that's we are missing somehow and uh but we still didn't think that it was worth uh assuming also special regulation detection probability because otherwise this would be uh this correlation would be uh almost unidentifiable with the one cubans probability and so yeah we think this is probably worth uh adams and covariates such as temperature or popular rainfall which might potentially start uh help addressing this issue so yeah so just wrap up we use that the user goodness of feeds considerably understanding where the biases are in the data and also helps understanding information in the data for example without performing this good enough we wouldn't know that actually there is this special bias which is something that arizona michael already talked about in the data so actually probably after hearing arizona talk i would not have been surprised but before yeah i was and yeah we can see that there is uh variation across space and across time of the direction probability and one thing that could potentially help is to adding information on the individual observer because we use already the least lengths but that is not sufficient so information on the observer such as for example i don't know the age might potentially help addressing this bias so through a popular of the model so we implemented this model to fit patient cuban c model on large data sets as relatively fast because it still takes quite a bit of time on these very large data sets so some potential extension at the introduction of more covariate as i said then we could potentially use the polygamous scheme to also do variables like method which i didn't touch on because we didn't have many covers but related to point a to the first point if we had more covariance that's this could be useful then the we could potentially investigate different gaussian process approximation and also another interesting thing is that i assume that the spatial and important trend they're added in the random effects well it won't make sense to model some kind of interaction between the two and one idea also from the gaussian process uh in the custom process framework is a super joint custom process across space and time but obviously this needs some thought because uh this considerable increases the number of variable and so makes it potentially even slower so it should be done in a kind of clever way and yeah this is the link for the package issue another look and yeah that's it i can take any questions this time that's very good uh thanks alex as a excellent talking very uh well timed with one minute on me to go so i'll give a round of applause to alex everyone and and of course the one of the speakers uh for i think what was i think a very interesting set of talks uh quite varied um but also complementary in the sense so are there any questions for um for alex um i i it it's really good to see the discussion um on the on the chat um are there any questions for alex in the chat okay emily has a question for alex and um also we have a request for you to post your github link so i guess alex you can yeah yeah um thanks yeah thanks for the talk alex i was just thinking earlier about given ali saying that a lot of the ebird data is a mixture of incomplete and complete lists whether within this framework we can see ways to analyze a mixture of these data where you know at the moment it's just based on opportunistic list data but if we had a subset of complete lists could we bring it in somehow yeah i think so yeah yeah i think we could probably be like a joint model to to link to do yeah i think we'll probably help [Music] information keep more information about some sites yeah yeah i'll have to think of that but yeah i think it's a good uh good idea okay thanks emily for this uh questions are there any other questions for alex or for any of the other speakers if anyone uh [Music] i can see his hand sorry my laptop takes a remy hi hello uh i have a question for the three of you uh you've been talking a lot about the species level approach like occupancy of the species and his distribution and have you also worked on community level opportunities for example estimating species richness from this kind of data any other thoughts about like beyond the species level what can we tell maybe this from from community data yeah it's a very interesting question uh which one of the speakers would like to answer to attempt to answer this quite tricky one any volunteers i'm thinking of special estimations for example using these five action curves we have seen earlier today i didn't get that last point remy can you repeat that yes for example uh using this card of action curves or species accumulation curves to estimate species a multi-species framework where you look at species richness rather than looking at one species at the time right okay uh michael i'm gonna think see that you uh have an answer to this um well i i think my simple answer would be that most of the work that i've done and i'm aware that my colleagues have done has tended to be species focused but obviously this is one species within the context of others in terms of that um inferred non-detection data and then been looking at things like species richness simply by summing or aggregating over the top of those um yeah i think that there obviously would be value in enduring some of these sorts of approaches um and particularly some of the things that i've been thinking about recently is in terms of things like co-occurrence as well where actually the the presence of one species informs our knowledge about the the presence or the absence of another so i think these sorts of things um could potentially be important um and it would remain to see what added value there is from doing it in a much more integrated way versus doing it in in a sense one by one and summing across species okay any other one of the speakers who would like to add something to that i'm afraid i missed the question i was distracted someone could repeat the key question or lenny yeah so i guess i guess my understanding of remy's question was that whether we've thought about basically modelling species richness i'm looking at the community level rather than looking a species separately and i guess yeah michael was saying that you can do that and then aggregate across species um but there are different approaches yeah i guess people think of joint species distribution models which are have been built for this uh within this framework if you wanted to add something else yeah i think um there are some there are some good ways we can think about modeling community metrics from these data but i think sometimes there you know there might be other data sources that could actually do better for those kind of goals um and i you know thinking about why we want to know species richness or species diversity or something like that and then taking a step back and thinking right well which are the best sources of data to help us estimate this um can be a good step for those kind of situations but i you know in theory absolutely we can think about modeling mult you know diversity metric community metrics from these types of data one of the key things we need to consider is that detectability varies so much across species particularly in community science data and so we can't you know we you can't just like look at the number of species that were recorded that's not going to help you know and many of these methods like multi-species models and similar methods are good ways to account for some of those differences in detectability across species that help us get robust estimates of the community thank you sorry ellie um i was just going to um if you want answer the question from emily in the chat yeah i was good that was exactly what i was going to say that by your own yeah go ahead please yeah we do those maps do have estimates of uncertainty it's just um obviously hard to show those in a map format but there's an r package where you can download the estimates that underlie those and they also have uncertainty built into that so i'll post a link to that in the chat great thank you yeah the the discussion is on fire at the moment which is really good um so maybe i'll see that there's one last question because i'm aware we said we'll finish at 3 30 and we people might have had you know um other plans um at 3 30 and so thank you ali for that um so okay uh victoria and someone's put thumbs up um i guess you'd want um you want me to ask this question right so victoria says she's been so thinking about court care institute in particular when um they record something it would be great to have a tool to easily record presence or absence of speeches associated with it um i see so is that is that a general question as a tool in terms of what as a software as a as an app sorry i probably not not to understand that question very well i don't know if someone answers the question in the chat or victoria wants to clarify sorry yeah more from a recorder's perspective i just think it's it's really interesting how sometimes you get one species and there's things associated with that in other parts of the country there isn't and it's really nice to have a kind of easy way to be able to say you know i found the species what other things can i look for and i think recorders would kind of be like a tool like that it's just something i can't think ponder when i'm out recording not really a question okay thanks very much so i think that's the end of the questions um um and that's the end of the of the meeting so it flew by for me i hope everyone uh enjoyed it so regarding the video so this is up to the rss now the session has been recorded so keep an eye on the rss website or email me in a couple of weeks time if you can't find anything and i'll let you know if something has been uh posted and if it's in a usable format hopefully we'll manage um yeah so it flew by for me it was really interesting all three talks were excellent i hope everyone else has enjoyed it and i hope we make sort of help people think about open questions or different directions or has given them new tools to either collect their data or think about how to collect their data or analyze their data which was hope that was the aim of them of the meeting to stimulate discussion and give people some ideas uh for the future there's a lot of very positive feedback in the chat thank you very much everyone this is all from me please feel free to use the chat uh to post the good feedback i'm sure the speakers have worked really hard i know all three have worked really hard on these talks um and they wrote them for this uh for this meeting so thank you very much everyone really appreciate your time and what i know is a very busy period for everyone and i think i think it was worth it um to give the to do this have this meeting um especially right now with a lot going on thank you very much everyone bye for me
Info
Channel: RoyalStatSoc
Views: 82
Rating: 5 out of 5
Keywords:
Id: 8TjO8r7GPY0
Channel Id: undefined
Length: 101min 3sec (6063 seconds)
Published: Mon Jul 26 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.