Explore changes in art over time with tidymodels

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi my name is julia silke and i'm a data scientist and software engineer at r studio and today in this screencast we're going to use this week's tidy tuesday data set on the artwork in the tate collection and we're going to train um a regularized regression model using text data to understand the relationship between um the the media that the artwork are created with like oil paint or pencil on on paper or canvas and how that changes over time um with the um like when the artwork was created um let's get started okay let's learn something about the this art collection from the the tate galleries so um i have let's let's read in this um data set so there's actually two data sets in the um this tidy tuesday uh week one is what i'm using here which is the artwork and the other is the artist so you can actually join them together and learn more or you know spend time looking at the um at the at the artists and i am actually particularly interested here in the artwork because what caught my eye when i looked at this week's tidy tuesday is actually this column the column that is the um the medium so because this column is um it's a text column but actually it's really short text and um it is not uh um let's see if we let me set up a little thing here so if we were to look at um uh let's see let's call this artwork and then if i count i'll do it over here artwork and if i say count medium sort equals true if i have what i have here is first of all there's missing data i have missing data here and then what i also have is um a lot of different ways that these are described um you know graphite on paper is the most common but then we have oil paintings screen prints lithographs but then we have things like graphite and watercolor on paper you know so we have a lot of different ways of combining this so this is very short so this is text but it's very short text and i think this is a very interesting um like little data set um to i mean it's not a little data set we've got you know we've got quite a number of rows here but like a interesting way to kind of explore what is it that we can learn when we look at text data um and and one thing that i think would be interesting to look at is how the this the the media like the art media that um artists used changed at least the ones that ended up in the take collection so let's um let's look here at that distribution over time and i um i actually this week uh um i already looked at this and i uh so i know what this is gonna look like and i put um because boy look at that look at that distribution so um that is a pretty interesting bimodal distribution here right so this is the date if you look at you know kind of the info on what data is in here we have two different dates in there one of those dates was the date at which it was um if we go back over here one of these is the date that it was acquired that is the acquisition year and this is the date at which it was created so first of all there's again there's a lot of nas like there's a lot we don't know what it is but then also there's a lot of um there's a lot there's this very interesting distribution that we have here um so what what could explain this um uh i i don't i don't pretend to have like the um you know definitive answer for why the tate collection has you know a huge number of pieces of art from this period and then a lot from this you know very um you know this more modern art but um you know the tate there's like the tate modern um in london and then this you know there's another um tate uh gallery and so like that has different focus so like the the ways that the galleries chose to build the collection like what their focus was um could have contributed to this um when uh you know like when they had money to be able to acquire art that could have contributed to this like there's lots of different um things that could contribute to that um so if i want to understand the relationship between um the media like what kind of different art media people use and um uh this and this then like this is basically my outcome like this is the thing that i want to understand and to predict and that distribution is pretty wacky that the distribution is pretty wacky i asked on twitter earlier this week what would people do what do they think would be the best thing to do if they had a distribution like this that they wanted to predict um would they try to maybe you know like put a line right here and uh to predict old versus new would they just say you know what what matters it doesn't matter if it's normal like what matters and there's you know there's papers about this how the like the distribution of the outcomes doesn't matter what um it matters is the distribution of the residuals would they you know what would they what would they try to do um so what we're going to do is we're going to take you know one of the more simple options and treat this as our as our outcome of the year and what we're going to treat as our features are the text features in this medium column so and we're going to see how that goes and we're going to actually look at the residuals and say was that a reasonable thing to do is uh um what do we see at the end there so let's say what we let's build a little data set here that we're going to use for this so we're going to take this artwork um let's let's give ourselves at least a fighting chance on this distribution and let's filter out those really old ones so let's filter here um let's filter so we've uh just got that things after 1750 let's take year and medium so we're only going to deal with these two columns um let's keep uh keep artwork when we where we know both of those and uh let's let's just arrange by year so i can have that in order and let's call this um a tate df because let's keep in mind this is a very specific um data set that was created it has very certain biases you know this becomes very clear when you look at this right like this is not a distribution of art overall this is a distribution of art acquired created and acquired in with very certain biases so oil paint on canvas oil paint on canvas etching and engraving on paper these are the oldest things we have and if we looked at if we looked at the newest things we have you know rubber inner tubes steel hose pipes and ribbon and 21 aluminum bricks so you know so these are very modern things that we have so there's definitely i believe i'm guessing there's going to be very different things at the beginning near the end here um just so we can get an idea of what the most common um uh art media are here let us take this let's do unnest tokens and let's un-nest into word from medium and then let's count word sort equals true so we can see what the most common words are so we've got um on paper graphite watercolor paint oil canvas screen print lithograph so these are like the most common words in this medium that we have okay let's go straight and to building our model so we're going to build we're going to use tidy models to do this so let's load tidy models and let's first um we are we have a lot of data here so we are going to split into testing and training and we are let's call it art split initial split so let's call it t df and as let's um we can stratify by year so that means we're there we're going to use the quantiles of the um of year to stratify and then let's make art let me put this it's not way down at the bottom of the screen so art train training art split art test testing art split so this is now training and testing and then we're also going to make some resamples that we're going to use for tuning and evaluating performance of our of our model so let's call this art folds and let's use a v-fold cross validation for this so we we re-sample the training data that's what we use for tuning and estimating performance from the training data so folds there so we have now we have training data testing data and re-sample training data to use for tuning and choosing models then uh so next let's get started with um uh pre-processing our data so this is we have uh here i guess i never really showed what we what it looks like so this is our this is our training data that we have here you know we still have quite a lot of data notice that so in our resampled data um we have 10 cross validation splits and each one of these splits this is how much data we will use for fitting for training this is what we call the analysis set and then this is how much data we have within each fold for um for what we would call um the the analog the what's analogous to testing what we call the assessment set so in so we have 10 folds of that that have that much data so this is quite a lot of data which is nice we are not hurting for data here so that's certainly not going to be the issue let's load we're going to load text recipes which has recipe steps for pre-processing text so we're going to say our in our recipe we're going to say we're going to explain year year is our outcome we're going to end our features are coming from medium this uh the art medium that each artwork is created with and then our training data is art training so that what this tells us in the recipe is what are the columns that we are dealing with and now we can start building up our feature engineering um uh recipe that we can apply to um that we can learn from the training data and then apply to other data like testing data or new data so the first thing we'll do is tokenize the data the text data um let's remove stop words there aren't actually a ton but i'm not that interested actually in on and and and whatnot so let's remove them um there let's put a token filter on medium and so what this does is we say um i i don't want to keep every single you know this is all the words that were are in here and that's a lot of features like if i kept everything that would mean i would have 1463 features and i don't want to keep that many features as i go to train my model instead i'm just going to keep the top 500 words after removing stock words that's a pretty good um default for a lot of situations i am going i could now as i transform this into a matrix to use for modeling i could um weight this by the counts but a good default you know is to actually wait by the tf idf these are really short words uh really short so tf idf i don't know actually maybe i'll just maybe i'll just do maybe i'll just do this because they're so short actually that um uh i tf idf does you work with short documents but let's just actually use the counts let's just use the counts here so we're going to do a step um term frequency for medium and then i am going to [Music] um oh i have to normalize because i'm about to use a regularized regression so i will normalize all predictors like so and so this actually normalizes the counts so they're all on the same scale so this is a feature pre-processing recipe for this data set of text here so these are all the steps that we need to do so that we can take something that looks like this and convert it to something that the um that um that the model can train on so that's the feature preprocessing step now let's make um a model so we're going to do a good default model to go with for text is regularized regression lasso regression i'm going to do i am going to do lasso so i'm going to tune the penalty and i'm just going to keep the mixture at 1 so that i can set some of these words to zero and say just throw them out and say i don't i don't want them anymore i'm going to set the engine to glimnet or glmnet however you prefer to say that and then i'm going to set up a workflow and i am going to first i will add the recipe and the recipe is called art wreck and then i will add the model and it's called lasso spec like this and it is now um [Music] it is now a workflow um yes okay so actually um i can maybe use this as an opportunity to um talk about the the blueprint we recently added um we recently added support for um for sparse data here so if i can add so if i call a sparse i'm going to make a sparse blueprint so hard hat default recipe blueprint with composition equals and then if the composition i make um as this this sparse matrix thing that i have over here and then i can say add recipe blueprint equals sparse blueprint like this and then if i i need something that can just be sparse like this so this will have zeros in it and so what this does is it will actually pass a sparse matrix into um into glmnet and it will um it will stay sparse throughout the whole fitting and that will um uh actually train a lot faster because glm net is something that trains faster with sparse data it which is great to be able to do that all right so let's set up my parallel background let's set up a grid of um possible penalties to try um let's try 20 values and you know penalty it goes really small and i don't want to go quite that small i don't think it's going to need so instead of minus 10 to zero let's try like minus 3 to 0 like this so what that instead of the default which goes super tiny like if you use the default it goes super tiny so the way to change that is to change the range in this way if we wanted to pipe it i could have but so now this makes a um a a range that doesn't go quite as small so let's do that and now let's um let's tune let's just tune um a grid just tune on this grid that we made so we we tr we do the um our the workflow we do we send in the resamples and our resamples are called art folds and then let's send in the grid here like so and let's call this lasso results like so okay now is this ready to go i believe it is um it is loaded the alumni is now training so now what it's doing is for every um so for every is it done yes it's done so that went quite fast both because um it's a linear model we use sparse the sparse um representation so what it did was for every well it'll just show us uh lasso so for every um split for every um uh fold it trained um it trained every one of those 20 um possible values for lambda all right so we did it we trained them so we have let's so now let's see let's see how this went so this is a linear model and that so that means that uh that is important to keep in mind as to what we just learned so we basically are about to see and understand hey when we have that kind of you know very unusual um distribution that we had for the outcome and we just are pushing forward with regression um was it okay what we did um what can we learn here so let's look at um the results what do we see okay so this looks pretty good so uh we have rmse and r squared so it looks like it's kind of flat it goes up a smidge so the best value is going to be you know like right in this range right here so we can we can say show best lasso results we can say like find the best rmse here so that's yeah that's like you know right in here or where we're having these best values for um the rmse so r squared the best ones are it's like 0.77 um you know that's that's that's not great um you know that that tells you how well the data the model is fitting the data like how well can we predict the year of a um of artwork based on the medium like the art media that was used and then the arma c like that's in the units of the original um predictor so this is in the units of years so we can predict this to like 30 years the r the root mean squared error is 30 for-ish years or whatever so that's what we're um the mean is is um 34 years so that's what we're getting there for our results so let's um let's pick let's get that best value out so let's call best rmse uh let's say select best lasso uh whoops what am i doing rmse so we can do best we also i mean just so you know you can say um and you might want to do this with a lasso model you can say you don't have to select the best one you can also select by one standard error like if you would like a simpler model which you might with something like lasso you might want to pick a slightly simpler model and now we say let's say final lasso and we will finalize the workflow so we have our workflow which remember had it said tune penalty equals tune and now we put instead of um putting uh instead of having penalty equal to tune now we'll have penalty equals the thing that we said was the best okay so now we have we have chosen the value for penalty we're going to use so now let's fit it one more time so we're fitting now our finalized workflow and let's let's send it the split so now what this will do let's call this let's call this art final so now what this will do is it will fit one time to the training data and evaluate one time on the testing data and so what we can do things like collect metrics on this and so what the metrics that we're seeing here are on the testing data not on the training data and so we can see we see about the same rmse about the same r squared so we've not um overfit so that is good and now we can start to do some quite interesting things as we evaluate and say what what what do we think about this model is it good or bad or can we learn something from it even if it is bad um the first thing i want to do is um look at variable importance for linear model this is you know variable importance is just the coefficients but the vip package has some very convenient functions for getting that out so if we take the final lasso and uh we're actually going to fit it one more time i wonder if there's a way to get it out maybe if we get if we take if we take art final the workflow and if we if we pull the workflow fit out of this does this work yes so we can we don't have to fit it again we can because it already has been fit in last fit and then we can say vi let's call this art vip so this is the variable importance right here so art vip so vi is a function from the vip package and so what this is telling us is the variable importance for those 500 features that we made that we picked when we said max tokens equals 500. how many will we keep in our in our recipe so um now let's um make so this is so this houses a range right now so rfeip we can arrange by importance oh that's my so this is those zeros those are the things that got um taken out um beans boxes so these are the things that you know the um regularization uh set to zero that it could not buy by its way in to the um to the model so these are the things with the most importance um graphite paper ballpoint watercolor plates and whatnot okay so let's um let's take this let's group by the sign because some things are positive and some things are negative some things are more towards new art and some things are more towards old art let's slice max and so what we're going to order by there is the abs absolute value of importance and let's keep um let's keep 20 on either side here so now we have 40 instead of 500 and we can make like a well a nice little graph so let's mutate some of these things so let's say variable let's remove [Music] from variable all this for making our graph let us importance let's make that equal to the absolute value of importance because there are negative numbers in there and i want to plot them all in in just their absolute value let us um oh yeah let's let's do variable equals fact reorder uh this will need to come after variable fact reorder not q recode reorder variable importance and let's let's let's clarify what this sign is so when sine or it could be it could be if else because there's only two of them if else so when if sine is equal to pause then it's more in later art else it's more in earlier art like this all right so let's make sure that runs okay great and now we can save that or we can just pipe it straight to a plot so what goes on the x-axis importance what goes on the y-axis the variable let's make fill equal to sine let's make a little bars let us and then facet by sine let's see how that turns out nice okay so we need to do scales equals free i think just free we don't need that legend and we do not need that label okay very nice all right so let's look at this okay so in earlier art people were more likely to use graphite paper pen oil um paint watercolors engraving intoglio mahogany in later art people were more likely to use dung and photographs and coffee and glitter so i'd say this is a success gosh i just love text analysis honestly honestly i love it could not be happier about this okay um perfect i love it okay okay so i mean yeah so so modern art is amazing and i love it and i need to see more art with glitter i haven't been in a museum in so long i'm sad okay this is fantastic all right so um this so this is this is the which features push the predictions more in one direction or the other but we also can measure how well the predictions do so let's do use collect predictions on art final and so what this is doing um if so if we run this what it gives us is for this is again this is for the testing set the um the predicted date and the real date predicted date real date so we can make a plot where we say the real date the predicted date and then let's set this up um let's put you know let's put our friend the um the um the slope equals one line on here let's make it gray let's make it a little bit thick and let's put the let's put these on as points so let's make them a little bit trans probably a lot transparent there's a lot of points on here and colorful and probably a chord fixed is going to be helpful here okay let's make these a little bit bigger that will help i think okay okay this is super interesting super super interesting so what is happening here is that um uh we've got these clumps we've got these clumps so over up here in like this new art we see we have things that are um we have these things we have these things that are being predicted um well um uh they're close to the line because they're being predicted well these are the things made with glitter and photographs and coffee and dung we have these things that are predicted well that are really old but notice that it spreads you know it just spreads up right so some newer art also uses these old um uh these media that are that are largely used uh in the old as well although we kind of get that going both way but look at this this line here that's basically in the middle i mean that might just be literally the the mean or median or something of the whole data set something is just being predicted in the middle of the whole thing all the time and it's just wrong all the time it's just wrong so that's the predicted value notice it could come at any time that's probably the most common um the most common uh medium like something media something like oil oil paints something like paper that um uh just can be anywhere and so we don't have it is not very predictive it's not um it just it just goes to the middle so those predictions are just wrong all the time so we can look and see what are those you know like look at the most misclassified options we can do that by say let's say collect collect predictions on art final and we can bind this to um like all we want is that medium column like this whoops select medium like this so now we we've just we just you know stuck them together so we have um so this is a testing data like so and we can um let's filter for the most uh misclassified ones so where the let's say where the um absolute value of the year minus the prediction is greater than a hundred where it's like more than a hundred years off in its misclassification misclassified okay so so there's 400 that were that misclassified um okay so here are the new here are the new ones the new ones um uh so oil paint uh mezzotint on paper um etching so these are new things that were predicted to be old um so these are you know old uh art these are old um techniques that were that were used in recent years and so they were predicted to be old at least things that existed a long time ago so here are things that were old but were predicted to be medium so look oil paint on canvas oil paint on canvas oil paint so this oil paint on canvas is always predicted to be 1898. so this is how a linear model works and a linear text model just you know adds together the prediction the com you know the um like how much is oil worth how much is paint worth how much is canvas work and where does it land you and so oil paint on canvas what you know was you know one of the most common media and so it just lands you in the middle like the middle of the whole distribution so that's that's i actually am loving how this is turning out because it really explains how um how linear models work with text i mean often they're a great option but we can really see how they do and don't work well so in these misclassified i mean um art works what are the most common things and pretty much the same as you know the the overall paper graphite etching canvas watercolor it's we've got about the same things okay the last thing i want to do is look at um the residuals so we can get the residuals from art final via the augment um uh uh verb the augment function so we have this is the this is the testing data so with this art final is um this thing that we did we is the output of last fit so we can get the um the residuals here so we have the true year the predicted year and the residual and we actually have the um the feature here that we use and so we could do you know lots of different things here to understand this let's just make let's just make a pretty straightforward residual plot predicted residual whoop residual and let's make let's put a horizontal line at y intercept equals zero let's put the same kind of line as what we had here let us put the same kind of points as we had here and then we can um let's put a smoother on there and see how this goes so let's look at what we've got here okay so residual plot there we go okay so um we definitely see heteroskedasticity i don't know how do you think i did on that i'm not going to try to say it again um so the the there is more variance in the residuals um you know i would say at medium to low uh years and not very much at high years and so in like in the modern art when we predict a a modern a new year um it is likely to be a new year uh like a a recent year so when we have things like glitter or dung it is likely to be a new year however the um if we have something like oil paint on canvas um uh we don't we don't really like that's not um [Music] it's not very specific or predictive and so we do not we have a high variance there in our residuals so this plot helps us understand when we um have uh residuals that are low and when we have residuals that are broad and also that the middle those predictions in the middle are wrong both high and low but the residuals at low at early years are only low to the high direction at least the way that we filtered our data and but the residuals at the recent years are um are are much tighter all right so um we created a model using this data set from artwork from the take collection and then we um we evaluated that model using model diagnostics like the residuals so is this model good or bad is this a terrible model um i would probably say for most purposes um this model did not turn out great you know when we look at the um the graph where we look at predicted versus real um true values of the year or we look at the residuals plot it did not though that's not a sign of a really um well-performing model but it is interesting to think about how much we learned about um the the the media that are used to create these artwork and how that changed over time even with this model that did not have the best properties so thanks so much and i'll see you next time

Info

Channel: Julia Silge

Views: 2,126

Rating: 4.9166665 out of 5

Keywords:

Id: -ErHI3MJyDQ

Channel Id: undefined

Length: 41min 43sec (2503 seconds)

Published: Fri Jan 15 2021