SAS Tutorial | 5 Ways to Make Your SAS Code Run Faster

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
don't you just hate waiting for results after you submit your sas programs i sure do in this video i'm going to share five easy ways to make your code run faster we'll compare the results from different techniques using both sas data sets and oracle tables as our source we'll get a good idea of what runs fastest in most circumstances so let's get to it i'll be running my code in sas via 2021.1 the techniques we show here will work in your compute server whether you're running sas 9 4 or other versions of via 2. you should get similar results first let's check the difference between using a keep statement at a keep equal data set option on the data set that we're reading if you recall the keep statement make sure that we don't write out values that we don't need but it doesn't prevent any of the variables from being read in in very wide data sets this can make a difference in the amount of time it takes to read when we run this using both techniques on sas data sets we're going to see that the the difference that it makes is very minimal right that the real time for the first one and the real time for the second one are almost the same with the keep equal running by a very narrow margin if we use this on database data though the database doesn't live on the same machine as my sas compute server so there's going to be some network latency involved let's run both techniques against the same database table and see what happens now if we look at the first one with keep id we can see that we selected from the database table we read it all over and it took almost 10 seconds of real time to execute when we run in the data keeping only the id variable only the id variable had to be passed over the network into sas and we see that that cut the read time down to 0.1 seconds that's a huge gain so minimal gains with sas data sets but still faster very very much faster with databases i think the keep equal on the set statement wins so we'll make our first thumb rule if a keep equal on the set statement and the keep statement in the data step will produce the same results let's use the keep equal data set option one of the next choices we'll have is how we subset our data we could use a subsetting if or a where statement to subset our data by id once again we're going to start using that sas data set this one is a narrow data set with about 5 million records so let's go ahead and see if we use subsetting if versus a where statement which one is faster we can see that interestingly enough the first one if id equals 4 had to read in all 5 million observations that took about 0.33 seconds pretty fast but the where statement only read in one observation and that took a 0.25 seconds that's a pretty good savings for just a small change in our code once again let's take a look at the impact of if versus where when we're using database data these same tables are replicated in my oracle database so i'll run the data step with a if id equals 4 and a where id equals 4 on the same table and we'll see that using the subsetting if now takes 8 seconds that's a very long time compared to only 0.2 seconds if we subset the data in the database once again significant gains on sas data sets and really awesome gains in speed on database tables gives us our second thumb rule if a subsetting if and where we'll yield the same result let's use the where statement speaking of where statements you can write a where statement a number of different ways for example we could say where scan of customer name one is silvestro now that will give us all rows from this table in the database where the first word in the customer name is silvestro we could have written that differently we could have written that using an ansi standard thing rather than the scan function now i love the scan function actually i was so excited when i discovered the scan function in sas it was so easy to take away take words apart from phrases but the scan function is not an ansi standard thing and most databases don't understand scan particularly when we're using live name access but the like operator is an nc standard thing not only does sas understand it but all databases do too so it should be fairly easy to get the database to do the subsetting in this case let's go ahead and test those two using the sas function versus the sc standard operator and we see that we got the results uh pretty quickly the where with a scan again read all the rows in from the database and this took about a .43 seconds but holy moly when we use the where statement instead of a subsetting if you can see that it was .03 seconds once again an enormous amount of gain and speed for just a small difference in the way we write our code now this will apply whether we're using a data step or whether we're using sql here i have the scan function in the where clause and the like expression in the where clause and generally sql does a pretty good job when working with databases so we still get the one record back that we had requested but in this case the first one with the scan took 0.1 seconds 0.16 seconds and the second run using the like operator took only 0.026 orders of magnitude difference in speed and so i think we have a clear winner here that when we're writing our where expressions we should keep those expressions as antsy as possible now of course there's going to be times where the function that we need is only available in sas so let's put it in the where clause and not worry about it but just realize that that might take a little longer to process than if we could uh write that in an ansi standard way so speaking of data step versus sql right which one should i use so let's do something simple like a summarization okay so i'm going to take a data step that summarizes that narrow data set in sas and i'm going to use the data step to do the summarization i'm going to use proc sql to do the summarization let's go ahead and check to see which one of those takes does the better job and i can see that using sas data sets as input the the times are similar but you can see there's a slight edge to the data step in other words the data step processed those five million rows in about point four seconds and the sql procedure took just a smidge longer to do that result i think when we're using um sas datasets as our input it's probably a wash what if the source is a database right proc sql is a language that's much closer to the native language of the database itself and so it should be able to do that summarization a little faster i think than a data step in other words we should be able to get the work done in the database itself so let's take a look at what happens if we use a database table as input instead now this is summarized on a pretty big chunk of data out there in the database but you can look and see that the first one that did the summarization of data step this time ran seven and a half seconds but by doing the summarization and sql except itself we managed to get all of the work done inside the database and the database has an enormous amount of cpus and ram and disk space available to it so it processed that way way way faster only a third of a second required to do that same processing in the database so because we had a wash pretty much a wash when we did sql data step using sas datasets and it made such a huge difference when we ran in the database i think we can make another little thumb rule here is that if a datastep and sql are going to produce the same result set then we should use sql now this assumes of course that we're only producing one result from that processing that we're doing if i'm using a data step to produce multiple data sets at once this might change and the reason it might change is because sql can only produce one one result set at a time and so we're going to need to run that sql query over and over and over again and read that big data set multiple times in order to get the same output so let's test that we're going to run a data step that breaks up a big table narrow into smaller tables groups 1 through 10 and we'll do the same process using sql but that requires an individual query for each one of those tables that we want out go ahead and run all of those now once again this is a very large table and we're reading it um over and over and over again in sql so this might take a little longer to run than we would like and we have our answer if we go back to the log and look at the data step we can see that it created all those 10 groups and the total time it took was about 4.6 seconds if we go down to the proc sql bottom line we can see that it created all 10 groups but it took 5.1 seconds so this is longer than it took with a data step and so from this we can conclude a new thumb rule that if you need to make multiple result sets by reading from a single large table then the most efficient way to do that would probably be a data step right because sql will require multiple passes on the big data all right and finally our last thing to test cpu bound processes everything we've talked about so far assumed that the the size of your data was the problem that that i o was the problem or network latency was the problem but what if that's not the problem what if what you're doing is really really complicated and so it takes a lot of cpu time a cpu bound process is uh is going to be something we need to think about let's take a look at a scoring algorithm that we run against that narrow data set and we're going to run this to see how long it takes to score that data now remember this is a data step so the scoring process runs single threaded on 5 million records [Music] and we wait and we wait and we have some coffee finally we've got some results notice that the real time and the cpu time are very very close to each other this is a great indication that the process is cpu bound now a cpu bound process it's not going to benefit that much from any of the other techniques that we looked at we need to run that thing multi-threaded so that it will go faster now fortunately for me i'm running in a sas via environment that means i have access to caas and cass is a native massively parallel processing environment and i can load that sas data set right up into my cast library so that's what i'm going to do in cast we have to preload the data first and then i'm just going to process it with that same data step with a couple of very small mods okay the small mod here is instead of writing out to the work library i'm writing out to that cast library i have and instead of reading the data from the sas library i'm reading it from the caslib that i have and otherwise i'm doing nothing different so i'm going to go ahead remember that ran 40 seconds last time i'm going to go ahead and run this again in kaz whoa instead of 40 seconds it took five almost six seconds there that's an enormous gain in processing speed you'll notice that that's because my cast server has 16 nodes and and it processed some of the observations on each one of those nodes so that went really really fast so i'm going to say that if you're doing something cpu-bound and you have access to cash you should be running that in cash but mark you say i'm i'm not in a via environment i'm running sas 9-4 i don't have access to a cache server but that's okay base sas also includes a language called ds2 now ds2 is a lot like a data step but it's natively built from the ground up to allow multi-threading in ds2 you create a thread that has the program you want to run for that each data includes reading in that sas data set that big one that we had now the thread is nothing but a stored program i'm going to go ahead and create that thread for you to see that this runs really really fast it just stores the program out there we haven't actually processed any data yet now that we have a thread program we can declare a copy of that thread in our ds2 data program i'm going to write right out here to the the work library just like i did with the base as data step and i'll use a special set from statement to set from that thread program and use as many threads as i have cpus available to me in my sas system i'm going to go ahead and run this this is running straight up base sas no fancy cast involved with this one and i got to remember that the original data step took 40 seconds to run right this in a ds2 i don't have 16 nodes on my compute server but i did have four and that ran the whole thing in about 8.8 seconds that beats the heck out of 40 and so there's a good deal right even if you're running straight based sass it is possible to multi-thread a cpu bond process and dramatically improve the speed so for this thumb rule we're going to say if you have something that's cpu bound and you have access to cast you should run it there if your process is cpu bound and you don't have access to cache consider rewriting that as a ds2 program in order to be able to multithread on the compute server so let's uh review quickly the the thumb rules that we had the thumb rules were uh if keep equal and a keep statement give you the same results use the keep equal on the set statement if subsetting if and aware clause give you the same results use the where right we talked about if you're doing sql and a data step particularly when you're using database data you should use sql instead of the data set and that had a caveat if you intend to produce multiple result sets from that one data step then use the data step instead of sql because sql will require multiple passes and finally we looked at cpu bound processes and we said hey if your process is cpu bound and you have access to cass in via then go for it run it in cast and if that didn't work for you then you might want to try rewriting that in ds2 which is available to anyone with vases all right for more information you can look at the links down here there should be a link to the code that i used to produce this video you can feel free to leave any questions in the comments and for more sassy tips like this subscribe to the sas users channel thanks for watching and may the sad speed [Music]
Info
Channel: SAS Users
Views: 1,631
Rating: 5 out of 5
Keywords: sas tutorial, sas code, sas how to tutorial, sas data sets, sql vs data step, sas, sas tips, run sas code, optimize sas code, sas code performance, sas code performance optimization, speed up sas code, make sas code run faster, sas sql, run sas code more efficiently
Id: he9ecikOPCE
Channel Id: undefined
Length: 17min 57sec (1077 seconds)
Published: Mon Sep 27 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.