SAS Tutorial | Running SAS 9 Programs in SAS Viya

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello. I'm Mark Jordan, your instructor for running SAS 9 Programs in SAS Viya. As a SAS programmer, you may have heard about the massively parallel processing capabilities available in SAS Viya. Perhaps, you've gotten access to SAS Viya at work. Maybe you're taking a university class using SAS Viya for Learners. Or it could be you're just wondering what SAS Viya is all about. In any case, I'm really looking forward to spending a few hours showing you how easy it is to get your SAS 9 code running in our high-performance SAS Viya environment. In lesson one, we'll learn a little bit about what SAS Viya actually is and how it's architected. And then we'll talk a little bit about how to work with data in CAS libraries. So let's dive right in. SAS Viya is a great addition to the SAS platform. The part of SAS Viya that excites me most is the Cloud Analytics Services, or CAS, provides us an in-memory processing engine. That means that you can lift data into memory and leave it there while you conduct several subsequent processes on it without having to write it back to disk in between. And you can do this safely because of the way the system redundancy is setup. This allows for extremely fast processing when you're doing things like manipulating data, and then doing some analysis on the results. In addition, the way CAS loads data into memory allows it to handle data of any size, even if the data is larger than the physical memory available to you. And it handles it all with lightning speed. As we mentioned before, SAS Viya is an addition to the SAS platform. And the SAS 9 engines still exist within this platform. This course focuses on those of us who have been programming and SAS 9 for years and are wondering how we can best leverage the new additions to the SAS platform. SAS Viya's an open platform. You can run SAS Viya in a variety of environments in a way that makes it very flexible to meet your needs. The main question on our minds is, in this new massively parallel processing, or MPP, environment, can we still use the traditional SAS coding methods that we've become accustomed to? And the answer is, yes, as long as you are aware of how SAS Viya works. And some of the differences that make in the way things process, you can certainly use SAS code that you're familiar with in the Viya environment. So in this course, we're going to learn how to connect the SAS Viya from SAS Compute Server session that would be in any session like SAS 9 or the Compute Server associated with a Viya platform itself. And then we're going to write code that will allow us to access data in CAS. And we'll run SAS 9 programs and see how we might modify them to better adapt them to the SAS Viya environment. SAS Viya is exciting because of the way it accesses data. You can access data from any source. You can access data that lives on the on-premises hardware or data that you read in from the cloud. You can process relational data. You can reprocess unstructured data. And your data can be saved to a variety of formats with ease. There's a new format for saving CAS tables called SASHDAT. This is an exceptionally cool format for this environment. Basically, it's memory mapped so that it looks on disk pretty much like it looks in memory. And it is stored distributed across the nodes in your environment. This data can be loaded in parallel of the ram at lightning speed, and is the preferred mode for storing tables from cars when you write them to disk. In this class, we're going to use SAS Studio on a pure Viya installation. You're going to see that the code runs there exactly like it does in a traditional SAS 9 installation. With a modern version of SAS 9, you can also use SAS Studio or Enterprise Guide or the SAS windowing environment even to connect the SAS Viya in CAS and execute code up there. But it's more open than that. You don't even need a SAS client really to run code Viya. You could do it from Jupyter Notebook. You could do it from R Studio. You can do it from a variety of different client interfaces depending on what you need to build your application. In this course, we're focused on running in CAS from SAS. The architecture for SAS Viya is different than our traditional SAS Compute Server installation. We're going to focus pretty much on this area here, Cloud Analytics Services. Cloud Analytics Services is designed to provide a distributed processing model in which pieces of the data are loaded across several machines and are processed in parallel. The data is lifted into memory, processed once, processed again. Subsequent analysis can happen all without writing the data back out the disk. And only when you wish to save out the data and not work with anymore, you explicitly save the data back to the offline storage. Working with CAS is made significantly easier in SAS Studio by the presence of snippets. Snippets in the SAS Studio interface are accessed by this icon you see over here. And their broken up into folders for snippets designed for different purposes. In the SAS Viya Cloud Analytics Services there are snippets like templates. They allow you to connect to CAS, to load data into the CAS memory, to save data back out at the offline storage. Or just to drop it out of memory, if you like, and to disconnect from your CAS session. So before we can do any processing in CAS from our SAS 9 environment, we first have to establish a connection to a CAS session. So let's start working with SAS Viya. The first thing we need to do, of course, is get into SAS Studio. And you can access SAS Studio by starting the Chrome browser down here on the Task Bar. When the chrome opens up, there's a folder appear with bookmarks. We'll just click on SAS Studio. User ID and password are student and Metadata0, capital M, e-t-a-d-a-t-a, with a numeric zero on the end. We won't need administrator authority for anything we're doing in this class. So when we log in we're going to choose the default, no about assuming SAS admin authority. Once SAS Studio is open, you can see that it's very light. I am an old submarine sailor and I am very fond of the new dark themes that are coming so much in vogue. SAS Studio has a dark theme that I can set. So I'm going to go up to my Profile here click on Settings. And here, we have a chance to choose a theme. You'll notice that this light theme is called illuminate. If you ever want to get back to it, that's what you use. The dark theme is called ignite. And I'm going to go ahead and choose that to make things a little easier on my eyes. Now remember, you're in a you're in a browser, right? So if you hold down the Control button and scroll your mouse, you can make things bigger. And then if you scroll the other way, you can make things smaller. So it's quite easy to adjust this environment to work as you would like it to work with your eyes. Now let's work with snippets to get access to the CAS. Down here on the Snippets bar you'll see that there's a whole section on SAS Viya Cloud Analytic services. And a New CAS Session is the one that we'll be needing. Now, you can Right-Click and say Open As A Program, or you can just Double-Click this and have it Open In A New Tab. This is a template for starting a session. If we leave it alone, our session will be called My Session. But you can name your session anything you'd like as long as it's a valid SAS name. And then in the Session Options, you can set some differences from the default, if you want. This determines how national language system formats and things are interpreted. And this is the active CASLIB that we want to use when we start up our CAS session. Now we could change any of these things, but we don't need to. We can just run this code. And in the log, you'll see that my session has been connected to the CAS and that the CAS user is the active CASLIB. Now that we have a cast session established, eventually, we're going to need to turn it off. So to terminate our CAS session, we have another snippet. And I'll just Double-Click Terminate The CAS Session. And you'll notice that because we accepted the default session name, this is all set already to terminate my CAS session. So I'm just going to go ahead and run this code. And we can see that My Session was successfully terminated. Now, to access files during the course, you'll need to be able to find those files and determine the path to get to them. Here we're going to go into the Server, Home, and in the Workshop folder, and sgf20s9v. And you can see that these are the folders that have our data in them. For example, if we wanted to write a SAS program that would access this Excel file, we could Right-Click on that, choose Properties, and there's the name of the file and the location. Now you notice, we had to click an awful lot to get down in here. We're going to be using these files a lot, so it'd be really helpful to have a shortcut to these files and make it quicker and easier to get to our course data. We can do this by taking the file that we want to make a shortcut to, Right-Click on it and choose Add A Shortcut. Give it a reasonable name. Now, I'm going to click OK. And when I do that, up here in the folder shortcuts, you'll now see a folder shortcut for sgf20s9v. And you'll see that it's a quick way of accessing those same folder and files that we saw in the workshop. So from now on, I'm going to use the shortcut to get at the files for this particular class. Now that we've connected to a CAS session, the next step would be to access some data up in the CAS library. In a traditional SAS environment, we use a LIBNAME to point to collections of SAS data files that are stored somewhere in a network location we have access to from our SAS Compute Server. So for example, in Linux your path might look like this, to a folder named MySAS, which contains several SAS data sets. We'd like to refer to this aggregate location for SAS data sets as MySAS. And so we use a LIBNAME statement to access that folder location using the Base SAS engine to interpret the data sets that are stored in there. Our CASLIBs work a little differently. A CASLIB consists of both an in-memory space and the physical data storage space that's offline. When you're working with CAS data, only the tables that are loaded in memory are available for processing. The data source can store many different type of data that can be loaded into memory quickly and easily in the CASLIB. Those sources are not yet available for processing and cast until they're loaded on the membrane space. The data source can store physical files, like Excel files and SAS data sets or SASHDAT files. But it can also store connection information that allows you to make high-speed connections to relational databases and other data sources. Now, CASLIBs have two different scopes. A Session Scope means that when you assign that CASLIB, it will be active only for as long as the session remains active. And as soon as you terminate this CAS session, the session-scoped CAS library is terminated also, and its tables disappear from memory. This works much like a work library word, for instance, in traditional base SAS. Conversely, a library can be assigned in CAS with Global Scope. Now you have to have permission to create a globally-scoped library. And you may either belong to a group that has been granted such permission by the SAS administrator, or perhaps, you're the SAS administrator yourself. CAS libraries created with Global Scope are visible to all cast sessions that have permission. You can set permissions on these. They would be visible to all cast sessions on that particular CAS server. And they persist in RAM even when you disconnect from CAS. So when you reconnect to a CAS session, those Global Scope tables are already still loaded in memory for you to continue processing in your next session. Remember, again, that all of this talk of data persisting is persisting in memory. Data that you saved to an offline storage location, of course, persists. It can't be processed until you load it back into the memory space. So what happens to a Session Scope CASLIB when a CAS session ends? And the answer is that, it's not available next time you log in, because Session Scope libraries are visible only to the person who created it and only for the duration of the CAS session in which it was created. Now, when you connect the CAS, there is always an active CASLIB. And there can only be one CASLIB lib active at any one time. Generally, if you connect CAS and don't specify, CAS will automatically make your CASUSER library active for you at the start of your CAS session. It's a place for you to store your own private files. And you're the only one that has access to the CASUSER library. At the invocation of CAS, or at any time while you're in a CAS session, you can always change the active CASLIB. So if you're starting a CAS session with a CAS Statement, you can use the CASLIB= option to specify a different CASLIB than CASUSER. Conversely, if the session is already in process and you want to switch active CASLIBs, you can use a CASLIB statement to do that. This kind of breaks out into three different types of CASLIBs. Your personal CASLIB, which is global. Your personal CASLIB will always be available to you when you log back in, restart a CAS session. If you don't specify a CASLIB when you start your cast session, it will automatically be the active CASLIB. And there are a predefined CASLIBs. And these are CASLIBs that have been defined by someone with administrative privileges. These are generally global in scope. And they have access controls on them to determine which users can tap into these CASLIBs during their CAS sessions. And these are generally created for popular data sources that many different people would require access to frequently. And finally, there are manually added CASLIBs. Now, you have to have permission to do this. Not everybody will have permission to manually add CASLIBs. But your user in this class has this permission. And when you add a CASLIB manually, you can create that as a session CASLIB, which will only exist for the duration of your session. Or you can create it as a global CASLIB, which will persist after your session disconnects. Now, the administrator can control who has access to that CASLIB. And we usually use these for ad hoc data access. So we frequently do have authorization to manually add CASLIBs. So now that I have a CASLIB active in my cast session, how can I see what's in there in the SAS 9 side of things? We could connect to the CAS session with a light LIBNAME statement. We provide the LIBREF that we want to use on the SAS side to refer to that data. In this case, we're using CASUSER. And we specify the CAS engine to connect to that. Now, you can stop there, put a semicolon and the active CASLIB about the time will become connected to your CASUSER LIBREF that you see here. Or you can go a little further and explicitly specify which CASLIB you would like that LIBNAME to connect to. When you've connected to a CASLIB with a line LIBNAME statement, only the tables that have already been loaded into memory are visible. Remember, we said, that in order to process data in CAS, the data has to be loaded into the memory space. So it is quite possible that you'll have several offline pieces of data stored in this CASLIB, but when you connect with the CASLIB, you'll see nothing because none of them have yet been lifted in the memory. Now rather than assign a CASLIB one at a time with a LIBNAME statement, you can use a CASLIB statement to assign LIBREFS for all of the CASLIBs available to you in your CAS session. You do this with the _ALL_ naming convention and the keyword ASSIGN. If you don't want them all, but you have three or four of them that you'd like to connect at once, rather than running three or four separate LIBNAME statements, you can just say, CASLIB, and list the CAS library names that you're interested in and the assigned keyword. This makes all those CAS libraries available using LIBREFS that are the same as the CASLIB name. And they'll be visible in your SAS session. So let's take CAS for a test spin. In SAS Studio, I'll find the demo program. In this program, we're setting a path macro variable that kind of points to the basis for all of the courses here. And then we have a LIBNAME statement that assigns a LIBREF to point to our SAS files that are located in the MySAS folder here. And then we're going to start up a CAS session named MySession. We're really using all the default settings. And we're going to list the CASLIBs. Let's go ahead and run that code. And you'll notice in the SAS log that we are connected to CAS, that our CASUSER library is now the active CASLIB. And then a list of all the CASLIBs that are available to me in my session. Even though CASUSER is the default, there are several other CASLIBs available out there. So we have connected the CAS, and we've actually assigned a LIBNAME name to our base SAS data. So here in my LIBNAMEs I see my SAS. But I don't see anything about CAS, no CAS libraries. Well, early on we said that we could use a line LIBNAME statement to assign a LIBREF of our choosing to a CAS library. Because CASUSER is my personal library in CAS, I'm going to call the LIBREF CASUSER also. I'm going to go ahead and assign that LIBREF here. And now my libraries show the CASUSER library, but there are no data files in memory at the moment. Now, if I wanted to assign LIBREFS to all of my CASLIBs all at once, I could use that CASLIB statement instead. Remember, CASLIB_ALL_ASSIGN. And when I do this, you'll notice that there's a lot of CASLIBs that did get assigned, but there's a couple of notes in here that said, basically, the CASLIB name was not valid for a use as a LIBREF. You remember that there's a restriction on LIBREFS that say we can only have eight characters in them. And the CASLIB model performance data was too long to make a LIBREF out of. And so the SAS LIBREF was not automatically assigned. We'd have to do that one by hand and give it some shortcut name of our own if we'd like. But it did expose all of these other CASLIBs to me. You'll notice that most of those don't have any data loaded in them right now either. Well, I wonder if there's anything in my CASLIB CASUSER. You'll notice that there's no content showing. Well, I can list the files in the outline storage with a list files and in PROC CASUTIL. PROC CASUTILs are very useful procedure for fiddling around with stuff in the CAS environment. We can list the files and the tables. The tables would be things that are in memory. I'm going to go ahead and run that list. And we've noticed that it gives us some great information on where the path is to the actual physical folder that is associated with the CASUSER library. And then we have a boat load of files out there. But when we did the list for the tables, nothing showed up. So no tables are loaded in memory. Let's have a look at the Public library. We can change the CASLIB to Public by restarting our CAS session with a new CASLIB= specification. And then we can list the things that are in the Public CASLIB. You'll notice that that does set Public to my active CASLIB. Now, this one says local equals no. That means that this is a global library. We could just as easily list the files and tables in Public by changing this, if we were curious. We found that there's no tables available in Public, but there are a couple of offline files that are available out there we could load into memory space if we wanted. Now we can change the CASLIB without having to reset the CAS session with a CASLIB statement. So we're going to make a new CASLIB called MyPath. And this is an ad hoc CASLIB that is local in scope. This is the data folder. We look at our folders out here in data, there's a folder called MyCAS. We're going to point our CASLIB at that and use that as our offline storage for the CASLIB lab called MyPath. And when we create a CASLIB with a CASLIB statement, it automatically becomes the active CASLIB for my CAS session. With this SAS function, GETSESSOPT, you can see that the active CASLIB is MyPath. Now what happens when I clear it? It disappears for my CAS session, and you'll notice that the CAS session never leaves you without an active CASLIB. Soon as we cleared the CASLIB, MyPath, CAS automatically set CASUSER, our personal CAS library, as the active CASLIB again. And that's our first dive into using CAS libraries. Hi there. It's Mark Jordan again. In lesson two of Programming for SAS Viya, we'll take a look at how to load data into CASLIB from the offline storage associated with the CASLIB. And also, loading data into the cast lib from files and data sets available to us in the SAS Session. So now that we've learned a little bit about accessing data in CASLIBs, you'll probably want to load some files end to the memory space and do some analysis in CAS. Now, before we can do any work with tables in CAS, of course, we have to have the table in memory. And as you can see in this CASLIB there are no in-memory tables loaded. But we have plenty of files in the server that we can use to load tables from. The server-side files reside in that offline storage space that belongs to the CASLIB. And they could take many different forms. In order to do analysis with any of those files, we need to load them into memory. Once the files are in memory, we no longer call them files, we call them tables. And tables are what CAS work with. Just like CASLIBs can have a Session Scope or a Global Scope, so your tables can also have a Session Scope or a Global Scope. The default is Session Scope. What this means is that you'll see the promoter equals no when you look at the contents of the table. And that table will be visible only to you or whoever it was that created that table in their CAS And when they disconnect from CAS, that table is automatically offloaded from memory. What if I need that table and subsequent sessions? What if others in my group need to share the data with me? In that case, I'll need a global table. A global table can be promoted to global as you load it, or it can be promoted from an existing Session Scope table. You'll know that a table is global in scope because in the contents, it will show promoted equals yes. This table will be visible across CAS sessions. You can disconnect and reconnect, and still see the table in memory. It will also be visible to any other users to whom the administrator has granted permission. And it will have all the necessary controls like row level locking in order to allow that table to be used simultaneously by more than one user. And as we said before, these tables are not dropped from memory when you disconnect from CAS, but remain in memory so that next time you connect, there they are. So just as a quick refresher check. If we create a Session Scope table, is it only for us and are we the only ones that can see it? And will it go away when I disconnect from CAS? The answer to that is, yes, of course. Session Scope tables are only visible to the user who created it. And they're automatically unloaded from memory as soon as the CAS session disconnects. Global tables, of course, are visible to multiple users and will persist in memory even if you disconnect your CAS session. Now maybe the data you want to analyze doesn't live in CAS. It's not in the CAS offline storage space or the data sources, but we have access to it in our SAS Session. You can think of the SAS Session here as the client that's talking to CAS. And so we refer to these types of files as client-side files. We can load these client-side files into CAS tables with the PROC CASUTIL LOAD DATA statement with a data equals and your SAS data set name. Hi there. In this demonstration, we're going to show you how to set up your autoexec file in SAS Studio so that your CAS session and CAS libraries automatically reconnect themselves every time you start SAS. Then we're going to load a client-side file, a SAS data set into CAS and investigate the contents of the in-memory table that we've created. In the previous demo, we showed you how to use the path macro variable, LIBNAME statements, and CAS and CASLIB statements to assign LIBREFS, point to local SAS data, and to CAS tables. We're going to copy this code, and in the Options we're going to choose Autoexec File. We're just going to paste that code in there, and then we're going to save it. Now to prove this works, I'm going to go ahead and log out of SAS Studio, and then sign back in. Now you'll notice that all of my CAS libraries that showed up in the previous demo were also showing up here, as well as the MySAS library that contains the SAS data sets for this class. So let's load a SAS data set into CAS. Now remember, when we do this, it's going to take data from an existing SAS library and it's going to load it up into the CAS memory space as Employees. So that means that when we do this, it should show up in the cache user library here. The load went well. And now, we can actually see the CASUSER table that has the employee information. And it matches the information that was in our original Employees table here in the SAS Session. Now before, when we listed the tables in our CASUSER library, we got nothing from the listing, if you remember. But when we run the list of the tables now, we can see that the Employee table is indeed loaded up in CAS. And you can see the path to the folder associated with that CAS library. Now you can use PROC CONTENTS on that table too, because CASUSER is also a LIBREF here on the SAS side. So we've looked at it from the CAS side using PROC CASUTIL. Take a quick look at the same information on the SAS side using PROC CONTENTS. And you'll see that we could access that data set either way from the CAS side or from the SAS side. Now, let's compare PROC UTIL and PROC CONTENTS. Notice that the information that we get from PROC CASUTIL is fairly extensive. And PROC CONTENTS generates the standard information. Now, we're not going to turn off the CAS session, but we can remove the employee table from our CAS library memory space with a DROP TABLE command. You see that it is no longer listed. And just for good measure, we can reload the Employees table into the CAS library and manipulate that table with a standard BASE SAS TOOL PROC DELETE. Notice that either way we do it, using PROC CASUTIL or standard SAS procedures, we can manipulate, access, and describe data on a CAS server from our SAS Session. Now, perhaps you want to load files from the client side that are not SAS data sets. And we could do that too. We use the load file equal version of the load statement. Don't forget that in the file name, you'll need to use a fully qualified path to the file in your local SAS session. And then you're going to be required to specify the name of the table that you're going to create. What about those external files that are stored in the CASLIB data source? Things like Excel spreadsheets and text files. Well, you'll also use a LOAD statement, but this one uses the CASDATA equal option, indicating that the file can be found in the CASLIB's offline storage space. It's not the active CASLIB. You can specify in CASLIB to let us know which CASLIB's offline storage has the file you're interested in. And, again, if you want to load that table into a memory space in a different CASLIB, you can specify the OUT CASLIB equals option. You can specify a table name that's not the same as the data source name, if you wish. And, of course, you can promote that table on load. By default, the name of the data source will be the same as the name of the table in the in-memory space. But the tables are not replaced by default. So if there's a table existing already, you must specify replace or you'll get an error. In a previous demo, we took a SAS data set from our SAS Session and loaded it up as a CAS table. This is kind of referred to as loading a client side table into CAS. In this demo, we're going to take a look at loading a file from the CAS's local offline storage up into memory for processing. The file we're going to use as an Excel file. You're going to see how easy it is to load of Excel file has a CAS table. And once we've reassured ourselves that everything's well with that table in our personal CASUSER library, we'll promote that table into the public space so that others can access the data too. The CASUSER CASLIB offline storage space includes a Excel file named sale.xlxs. We'll go ahead and use PROC CASUTIL to load that file up into CASUSER memory space as a table called SALESXLXS. Then we're going to take a quick contents of that to see what the data looks like. And in our results, we can see that the columns came out pretty much as we'd expect. Notice that all the character columns have come out as VARCHAR. Nice space saving measure. And that this is a local scoped table. Currently resides in our CASUSER library. Having satisfied ourselves that the data looks good, we would like to share this data with the rest of the people in this CAS Server. Now there's a CASLIB out there named Public that contains a bunch of Global Scoped tables that remain in memory for others to use. What we're going to do is promote the table from our CASUSER library to the Public CAS library using the same name. Now when we do this, the actual table is moved, if you you. It will appear in Public and no longer appear in CASUSER. Now, because I've already run this demo before, that table already exists in the Public library. And one of the attributes of a global table is that you can't replace it. In order to replace the table in the Public library, I'm going to have to drop it first. Now you can see that the table has been replaced with a table that we used to have in our CASUSER CASLIB. And just to verify that everything went as planned, we'll take a list of the tables in both CASUSER and in the Public library. From the output, we can see that in the Public CASLIB, the SALESXLXS table exists. But if we scroll back and look at the CASUSER CASLIB, there are no longer any tables listed there. So when we promote a table from one CASLIB to another, the table is moved and then promoted to a global table. Now in this class, you're using a virtual lab. You'll be logging in a student. This user ID is a data manager who has permission to load data into CAS and set access permissions for others. You may need to identify which files are needed to be loaded, whether their client side or server side, determine the CAS library or libraries where the in-memory tables should be stored. Finally, you'll want to take a look at table scope and see whether the tables scope should be local or global. After you have manipulated and processed the data, you may wish to say the modified data off to the offline storage. And if you do that, you may want to determine whether you want to save that data in its original format or in SASHDAT that format for rapid parallel loading when needed again. So after processing the tables, we might want to save it. And we'll show you how that's done. Now, there's a lot of tables loaded up in the memory space for this particular CASLIB. And we may want to drop them from memory or want to know when they are dropped from memory automatically. If you have the CAS Server restarted, of course, all in-memory tables will be dropped from memory. But you can also use PROC CASUTIL with a DROPTABLE statement to deliberately or explicitly remove tables from the memory space. Before you do, you may want to save that table off into the data source area for later reload. And an excellent format for storing your data is SASHDAT. This is a format that allows the data to be quickly loaded back into memory in parallel just by reading the header. It's typically much, much faster than transferring data from the client side to the server side. Now, let's have a look at how we might save a table that is in a memory space in a CAS library out to the permanent storage space of the CASLIB in SASHDAT format. Do this we're going to use a code snippet to help us out. First, let's have a look at the files that already exist in the CASLIBs we've been dealing with, CASUSER and Public. And we can see that the CASUSER library has the expected files, but there is no SALESXLXS SASHDAT file. Now, you remember that SASHDAT file was nice because it made for very, very fast loading of large data sets. Now, SALES isn't a particularly large data set, but it will do for our demonstration. In the Public memory space you'll see that the SALESXLXS table exists. So we want to take this table and start back as a SASHDAT file. So I'm going put my cursor right here in the code window, and I'm going to find me a code snippet that will help me save a table to the CASLIB. Now, if I Double-Click it, it opens up in a separate window. But if I Right-Click and choose Insert, it will insert code into my current program. So the table name that we want to save is SALESXLXS, and the CASLIB like that it's is its Public. The CASLIB where I want to store it really is my CASUSER. And we're going to call it salesxls.sashdat. Now that file should be available in my CASUSER library. And a quick run of PROC CASUTIL should prove our point. And now we see the salesxls.sashdat file is stored in our CASUSER library, And then rather than loading from an Excel spreadsheet, next time we can load it in parallel using the SASHDAT file. Hi. Mark Jordan, again. And in Lesson Three of Programming for SAS Viya, we'll take a look at our traditional SAS code, DATA Step and SQL, and see how we might have to modify that to run in SAS Viya. Now, if you're like me, you have extensive experience programming and DATA Step and Base SAS. You maybe wondering, have my skills become obsolete? Does everything I know go down the tubes? And that is a good question, but the answer is that it's not true. You can use much of the same syntax that you're very much used to using in DATA Step and run that in CAS at much faster speeds because of the parallel processing capabilities we find in CAS. Now, the very thing that makes CAS fast also changes a little bit the behavior of programs that we write and run in CAS. So we'll need to be aware of those changes and figure out how we want to strategize to mitigate the effects that we don't want, while keeping all those high speed effects that we do. So to understand what causes the difference, a DATA Step itself is a single-threaded process. It reads the data sequentially one row at a time. And the instructions that you write in a DATA Step are really instructions on what to do to each row of data. And so this concept of sequential single-threaded processing shouldn't come as a surprise. But in CAS, we have more than one worker that can be executing that same code at once. And so the data needs to be distributed amongst the workers so that we can maximize our throughput. And CAS and Viya take care of all that stuff automatically for us. And this means that our DATA Step will run now in multiple threads instead of in a single thread. Now, the data is partitioned out in blocks. And this means that not every thread is going to receive exactly the same number of rows of data. Another interesting thing about it is that when you have multiple people working on the same thing in parallel, some will finish quicker than others. So that the order that you get things back in from the processors is not always the same every time you run the process. So we'll look at some of these examples and how, in some cases, the parallel processing is completely transparent to us. In other situations, where we need to take some action because of the way that threads produce their results. Let's start with a simple Base SAS DATA Step program using a single thread. And then we'll run the same thing straight up in CAS and Viya and check it out. So the DATA Step is used to manage or manipulate tables in preparation for further analysis, generally, with some type of analytical procedure in SAS. We can modify the values that already exist in columns. We can compute new columns, and we can conditionally process and produce extra rows or only the rows that we desire. And we can combine tables in a DATA Step. Here's a relatively simple DATA Step. It reads in a data set called MYCUSTOMERS, and it writes out a data set called DEPARTMENTS. All we're doing is checking the value of CONTINENT to set the appropriate value for the DEPARTMENT. If both the output table and input table in a DATA Step are CAS tables, then this DATA Step will run in CAS automatically without any further need for me to do anything to my code. Now let's take a look at how to take an existing DATA Step program and modify it so that it can run in CAS. The big thing that we're going to have to remember is the DATA step has to read from a CAS in-memory table before it could run in CAS. And ideally, any data that it writes would also write out to a CAS table. So first, let's take a look at the data program and see how it works. The data program creates an output data set called WORKDEPARTMENTS by reading in the customer data set from our SAS library. The select group sets the value of DEPARTMENT depending on the CONTINENT in the particular record. And the output data set will only contain the variables CITY, CONTINENT, and DEPARTMENT. And writing the values of THREAD ID and N to the log will let us know how many rows were processed by this thread as we run a traditional DATA step just to see how it works. You can see that the THREAD ID was 1. In other words, there's only one thread associated with traditional a DATA Step. That's true. Now we've processed almost 200,000 records. Now we'd like to run this program in CAS. The first thing we're going to have to do is to make sure that the data set that we want to read is in CAS. You can't run it in CAS unless the data you're reading is in CAS. So I'm going to go ahead and load the CUSTOMER data set up into my CASUSER library. And I'm going to promote this to a global table. This means that every time I connect to CAS, my CUSTOMER table will already be ready for me to use in my personal CAS library. And that seems to have run quite well. So the next thing that I need to do is change my DATA Step to read from the CAS library. Now we have assigned a LIBREF to the CASUSER library called CASUSER. And my table should be available there as MYCUSTOMERS. So I'm going to change this to say CASUSER, and I'm going to change this to say MYCUSTOMERS. Now we also said, in order to get this baby to run in CAS all the way, would be good if we wrote the result out to a CAS library. So I'm going to write this out to CASUSER. And that should be all the modifications I need to get this thing to run in CAS. Wow. A lot of processing going on there. You can see that because we were reading from a CAS table, writing to a CAS table, the entire process ran in CAS. It's noted in the log. And that there were 16 threads involved in processing this data. But what about the output? Has anything changed? Well, in order to make this work so we can compare them side by side, I've made a copy of the original DATA Step program over here, and I've executed it. We can see that the output data contains Leinster and its first row of data. And if I rerun this code, you'll note that the output data always comes out in a consistent order. Leinster's always the first city listed. When I ran the identical code in CAS, the first city was Hanover. If I rerun that code, the order of the rows that is coming out of the DATA Step in CAS is different each time due to the fact that rows returned as soon as a thread has finished processing them. And our log confesses that we've got 16 threads working on this problem. In the original DATA Step, we had only one thread working on the problem. This is just an artifact of multi-threaded processing. All right, so we've seen in some cases, that the DATA Step runs sequentially in a single-threaded Base SAS environment and SAS Viya environment produced pretty much the same results. Maybe in a different order, but the same results. Let's talk about conditions that may cause the actual calculated results to be different. Let's take a look at an example that uses a SUM statement to create an accumulator variable. When we as a SUM statement in a DATA Step, the expression on the right-hand side is added to the accumulator variable on the left. And the value of that variable is retained automatically, so it doesn't get set back to missing every time the DATA Step iterates. Now here, we're using the SUM statement conditionally. So the expression is only added to the accumulator variable if the condition was true. This DATA Step creates the accumulator variable, NAcustomers. It counts the number of customers from North America. Both of the input and output data sets are in SAS, so this code is going to process sequentially in a single thread on the SAS Compute Server. When we're all finished, only one row will be output that has the total value for the number of North American customers. If we change this so that the table we're reading and writing from reside in CAS, then this will run in CAS, and so it will also execute in multiple threads. You can think of many copies of this program, right? Each running independently in multiple threads on their own blocks of the data. And when you look at the result, you can see that each copy of the DATA Step ran on its own thread. And when the thread had completed processing its data, it wrote a record out. So rather than having one record, having the total for the entire data set MYCUSTOMERS, we have 16 different rows of data output. One for each thread that was doing the processing. So this is an artifact of threaded processing general. SAS programs are processed and CAS. The data and a program is distributed across all of the nodes in the system. And after each node finishes up its processing, it returns the results to the system for writing out to the output table. Now when we divide and conquer the world like this, the work process is much, much faster. But in this particular case, it's fast but we don't have our final result that we were really looking for. So what do we do? Well, if you think about it, if this was a giant table, you have only 16 rows now of pre-summarized data that you can further summarize to get the answer. So it might be simple as writing another DATA Step and summarizing those results. But how can we make sure that DATA Step doesn't run in multiple threads and exacerbate the problem? We want a way of making the code run single-threaded. So with a single equal yes data set option we can force the processing to occur on a single thread even in CAS. Now there are still some DATA Step statements that cannot run in CAS. If connecting to CAS from a SAS 9.4 installation, then those DATA Step programs will just run on SAS 9.4. If you're operating in a pure SAS Viya environment, SAS Viya includes the SAS Compute Server. SAS Compute Server can run your code just as if it was running and SAS 94. Things that run in the Compute Server run single threaded. Even though your DATA Step, in this case, is running on Viya, it's not running in CAS, it's running on the Compute Server. And it will run in a single thread. Previously, we've seen how running a DATA Step in multiple threads of CAS affects the flow order of the output. Now we're going to turn our attention to a different problem. When you're summarizing an entire data set with a SUM statement and you run that process in 16 threads, you get 16 individual answers for subgroups, not the one single answer for the total data set as you had anticipated. We'll see this in action, and then we'll see a couple of ways we can address this problem. Now if this DATA Step was running in the SAS Compute Server, it would all run in a single thread. And as a result, we would only get one output row when the entire data set had been processed. But if this DATA Step is running in CAS, it will run in multiple threads. And each thread will detect when it runs out of data and produce an individual row of output. If we look at the output data, we can see that rather than one row with a sum for the entire data set, we have a series of 16 rows produced by each thread as the code was executing in CAS. So what could we do to resolve this problem? Well, when you're running a DATA Step in CAS, you have the ability to add an option to the DATA statement that specifies you require the process to run in a single thread. Now, if we did this, that would also have produced a single output. But there would have been a impact on performance. So let's take a quick look at the total cost of running both of these DATA Steps and compare it to what it would cost to run the entire process in a single thread all at once. So in the two DATA Step process. We used to 0.3 seconds and another 0.01 seconds for a total of 0.04 seconds to do the processing. By running the entire process in a single DATA Step, we did get the answer without writing too much additional code, but it required 0.09 seconds to run. That's almost twice as much time. And this is not a very large data set. So you can see that there are benefits to running distributed in CAS. And even if it does require a second DATA Step, frequently, the overall savings are well worth the effort. For a quick comparison, we'll look at the 2-step process. A single row of output and the single-threaded version of the same process. And notice that the answer returned as the same, but the total time required to acquire the answer in the two DATA Step process was much less. You could use a DATA Step to process data in groups or to merge two data sets together based on the contents of one or more variables. But the data has to be sorted first. If the data is sorted, you can do first-out last-out processing to identify the first and last observation in each group. This can allow you to do grouped aggregations. Sorting's a pretty resource intensive thing to do. And especially, as your gets large, this can bog down the process. In any case, after the sorting is completed, the data is processed one row at a time single-threaded from the data sets in question. Now, in CAS, the default behavior as to load your data in a distributed fashion across the nodes. And this has no regard for the values in any of the rows themselves. The DATA Step is executed in different threads. So if you're thinking about by groups, there could be members of those groups in each of those nodes. This is the default processing if there's no BY statement in your SAS DATA Step. In CAS, if you do have a BY statement in the DATA Step, there is no need to presort data. BY in CAS will ensure that all of the rows associated with a particular group wind up on a single node so that the group processing that you do will be accurate. This still allows the data to be distributed in groups across the multiple nodes. So parallel processing can happen, speeding up your throughput. It is no longer necessary for you to think about sorting your data before doing things like by group processing with first-out last-out or merges. If you apply a format to a variable in the by group, the ordering is in accordance with the formatted value of that by group variable. Now, the DATA Step with the BY statement can execute in each thread with the assurance that all of the values for a particular group will be on an individual node. And as each thread finishes processing, it will return its result. In this case, the yellow DATA Step node had the fewest rows to process, so it returned its results first. And then thread 4, and then 3, and then thread to 2, and so on. So the order that the values are returned may be different each time, but the by group processing values are completely accurate. Previously, we've seen the effects on a DATA Step which was doing an accumulation for the entire data set. Let's take a look at a similar process, but when we're accumulating in by groups. Now being before we can use it BY group processing in the traditional data program, we're going to need to the data. Because we want to do processing by continent, first, we'll sort the customer data by continent. And then we'll take the sorted data in with a BY statement in the data program and do first-out last-out processing. This basically will allow us to make a single output row by the summarized values for each unique value of CONTINENT in our original data. In order to be able to take a quick look at this data when it's done, I'm going to add a quick PROC PRINT step. Run our process and see what the output looks like. As you can see here, we've got one row of output for each unique value in the CONTINENT column. And because the data was nicely sorted and the process runs single-threaded, the observations came out of there in alphabetical order by continent. What would be different if we ran this program in CAS? Well, if you remember, in CAS, we're not going to need to preserve the data. So I won't need the PROC SORT step. I'm going to copy the DATA Step out. And I'm going to change this so that it runs in CAS. Now you remember, we already have a copy of the CUSTOMERS table up and my CASUSER library. I'll just modify the code to read from that table. And then because I want the whole process to run in CAS, I'm going to write the results out to a CAS table too. You might remember us talking earlier that said that when we do by group processing in CAS, we don't have to preserve the data. The CAS engine will take care of that for us. All I have to do is run the program. And you'll notice that we got the same five rows of output, but they're not necessarily in the same order. We can switch back and forth and see that the values are the same, but that the order of the rows is not the same. And once again, this is a function of parallel processing. Hi there. Mark Jordan, again. In the fourth and final lesson Programming in SAS Viya, we'll take a peek at some of our favorite Base SAS procedures like, PROC TRANSPOSE, PROC REPORT, PROC MEANS. And see how they have been unable to do much of their computations in CAS. And then we'll take a look at a new procedure, PROC MDSUMMARY that is built specifically for creating summary statistics in the environment. Beginning with SAS 9 M5, many Base SAS procedures were enhanced to process data and CAS. The goal is to have CAS do the heavy lifting, summarizing and analyzing larger data. And then downloading to the SAS Compute Server only the much smaller summarized data for post-processing. And post-processing might include things like computing additional statistics or displaying the results via the output delivery system. The documentation for each of the procedures will identify what procedural functions are supported in CAS. If you're programmed using one of these procedures includes a statement that is not supported on CAS, then the data for that in-memory table must be transferred down to the SAS Compute Server. And then that procedure will run on the Compute Server using the transferred data. A lot of the procedures that do data management things like, read table metadata, PROC CONTENTS, and PROC DATASETS, procedures like, PROC COPY, which make copies of the data, and PROC DELETE, which can delete tables can all run in CAS. But the key to getting this to run in CAS is to make sure that you're working with CAS tables through a CAS engine LIBREF. So you can see here that for PROC CONTENTS, the data equal should have a CASLIB as the LIBREF for the contents procedure to work on CAS. PROC COPY, the N equal, and so on. Many procedures that manipulate data like, PROC APPEND, DS2, FEDSQL, PROC TRANSPOSE are also CAS enabled. And they can take advantage of that massively parallel multi-threaded environment. Now PROC APPEND allows you to append one table to another. The BASE= table gets the data in the data equal table appended to it. But because the CASLIBNAME engine does not support updating a data table in place, you can't really upend to an existing table in CAS. You can use PROC APPEND to create a whole new table if you specify the name of the BASE= CAS library table as a table that doesn't exist yet. Then the DATA= table is copied up in place with a new name assigned to it. As SAS programmers, we're used to using PROC APPEND to append more data to an existing data table. We're going to see that in CAS, pending to a table is not permitted. But we can use a DATA Step with an APPEND equals yes option to produce a very similar effect. First, we'll create a couple of smaller subsets to work with. Now, this data program breaks up information from the CARS data set into output data sets, CARS1 and 2. One continuing only Honda, and the other only Acura cars. I could see that one has 17 observations, and the other has seven. Now, in order to append the data from the second table to the first, in a traditional Base SAS environment, we would just run PROC APPEND. And when we were done, the seven observations would have been added to the existing observations in the first data set, for a total of 24 observations in the resulting data. Now, let's load those data sets right up into CAS. And we'll try the same processes in PROC APPEND to append the data in the second table to the first. We quickly see that update access is not supported for tables in memory in CAS. And as a result, PROC APPEND fails. Now one thing you can do with PROC APPEND is you can make a copy pretty easily. If you append an existing table to a table that doesn't exist yet, then PROC APPEND merely makes a copy of the original table for you. Let's go ahead and take a look at that. I'm going to use PROC APPEND the copy a data set up into CAS, and then compare it to the original table. Because there was not yet an existing table, PROC APPEND was successful in creating a new table named CARS. How are we going to append the CARS2 data set to CARS1? Using a traditional DATA Step, writing to a CASUSER CARS1 table, we're going to set CARS2. But we have this APPEND equals yes data set option applied. And the end result is that the final caslib table has the 24 observations in it that we wished. And we have achieved the exact same results we would have been able to achieve with PROC APPEND. Now, I love SQL. But our old friend PROC SQL has not been enabled to run in CAS. What this means is that PROC SQL procedure can read CAS tables through a CAS-styled LIBREF. But the processing won't happen on CAS. Instead, that CAS table has to be brought down to the Compute Server and processed there. And because CAS tables can be pretty large, it's likely to have a pretty dismal effect on your performance. So what's an SWL lover to do? Well, you can execute SQL queries in CAS, but you must use FEDSQL instead. PROC FEDSQL is a SAS proprietary implementation written to the ANSI SQL-3 standard. In SAS Viya 3.3 and above, besides the traditional SAS double-precision floating point and fixed width character data, FEDSQL can also support int64, int32, varchar. So it has some advantages. Some of the other differences include PROC FEDSQL is much more ANSI compliant than PROC SQL was. And so things that are very SAS-oriented, like the mnemonics we use instead of operator symbols, are SASsy in nature and they don't work in PROC FEDSQL. Now the calculated keyword, the remeerging of calculated statistics, all of those things are not ANSI standard, and so they don't work in PROC FEDSQL. Other than his lack of support for non ANSI extensions, you'll find that FEDSQL has a really similar syntax to PROC SQL. However, not all the statements and functionality available in FEDSQL can run in CAS. The CREATE TABLE statement that creates a table from the results of a query expression is supported. And the DROP TABLE statement is available along with a SELECT statement. So most of the things you use most often will execute just fine in CAS. Some of the things that don't work in CAS include, sat operators like UNION, correlated subqueries, which are terrible for performance anyway. And a CREATE TABLE statement with an ORDER BY clause. Dictionary queries don't work on the CAS server, but you can query the dictionary using FEDSQL through a CAS-style LIBNAME statement. And the VIEW statement doesn't work in CAS to make views. Now, if you want to tell if you're PROC FEDSQL code is actually running in CAS, you can turn on the MSGLEVEL equal I system option. The default value, N, only prints and notes. I will add additional informational messages, which will include information on whether your code ran in CAS or not. We've learned that PROC SQL code runs single threaded and will not run in CAS. We've also heard about this new PROC FEDSQL. That FEDSQL code is more ANSI compliant and does run fully distributed in the CAS environment. So let's see what it takes to convert a simple PROC SQL query to running CAS using PROC FEDSQL. So here's a simple PROC SQL query that takes distinct combinations of customer name and city and produces a report ordered by the city value. We're only going to include those where the continent is in Africa. Let's have a look at the original program and how it runs. Now we can see that the values do indeed appear to be unique. Now, how can we get this process to run in CAS? We would just take the query and convert it to PROC FEDSQL. Now we'll would change the SQL to FEDSQL. Of course, I'm going to add the SESREF=MYSEESIONOPTION to point that SQL out to the CAS session in which we wish to process. And then I'm going to run the query just as it is. I'm using the option message double equals I to ask SAS to tell me whether the PROC FEDSQL ran successfully in CAS or not. Oh, no. When I ran my query directly as FEDSQL, I ran into a problem. And the problem is that the EQ mnemonic operator is not recognized by FEDSQL. Remember we said that FedSQL is very ANSI compliant. And the EQ, LT, GT type mnemonics are not ANSI at all. They are specifically valid only in SAS. But let's go back and change the FEDSQl code just a little bit to use the equal sign instead of the operator. And that really should resolve our issue. And here we can see that the process did indeed run in CAS. And that the results produced are identical to those produced by PROC SQL. So keeping your rescue all code ANSI compliant as much as possible will make conversions to FEDSQL much simpler in the future. Now, several rather Base SAS procedures have been enabled to process in CAS. One of these is PROC TRANSPOSE. Running PROC TRANSPOSE and CAS had several advantages over processing on the SAS Compute platform. It reduces network traffic and it processes much faster. If your input table and output tables are in the CAS server, then usually the transpose is performed entirely in CAS. Two other procedures that are CAS enabled are PRODUCT MEANS and PROC SUMMARY. When the input table is a CAS table, a lot of PROC MEANS processing can be formed right on the CAS server. If the results are directed back to CAS using OUTPUT statement, sometimes, there are some intermediate steps that need to be calculated on the SAS Compute Server. So some of this aggregate data may be passed back to SAS for processing before being written out to the CAS table. On the plus side, if you are using a BY statement and your PROC MEANS or PROC SUMMARY, you won't have to sort of the data before you use the BY statement any longer. There are a few things that you should avoid when processing in CAS. OBS=, FIRSTOBS, and RENAME= data set options are not supported CAS. And they will prevent processing there. Not all statements for a PROC MEANS and PROC SUMMARY or supported than CAS. If you use only the supported statements, then processing should happen completely in CAS. If the statement is not supported in CAS, the data has to be pulled to the SAS Compute Server in order for processing to occur. You will get the correct results, but performance may suffer. In addition to some unsupported statements, there are some unsupported statistics. PROC MEANS and PROC SUMMARY provide a bunch of statistics that are not supported in CAS. This means that if you use those statistics in your code, processing of the intermediate aggregates must be performed on the SAS Compute Server in order to get you the answers that you need. If you want all of the processing performed in CAS, you can invoke one of these actions directly using PROC CAS or one of the many other CAS clients and languages available to you. Another way of getting summary statistics with all of the processing in CAS is to use PROC MDSUMMARY instead of PROC MEANS or PROC SUMMARY. The MDSUMMARY procedure is designed specifically to work on CAS tables. It could compute a wide variety of basic descriptive stats for variables across all the observations or within groups. The MDSUMMARY procedure can only read CAS tables. And it doesn't display output, it produces an output table. You can use that table for further processing as you would use any other table produced by a SAS procedure. Now let's take a look at how to use that new MDSUMMARY procedure to produce summary statistics for tables in memory in CAS. At first glance, the syntax for PROC MDSUMMARY is so close to PROC SUMMARY, that we are tempted to just change the word SUMMARY to MDSUMMARY on our code. But if we do this, we'll quickly find that MDSUMMARY requires OUT equals. It does not produce a report, but only an OUT equal data set. So we'll add the output statement with an OUT equal specification to write the results to CASUSER user PRICE COSTS. And then we'll run a quick little FEDSQL query to show us the results. You'll remember that we needed a SESREF to get FEDSQL out to run on the CAS server. A quick look at results shows us that PROC MDSUMMARY has produced a wide variety of statistics, retail price, and cost. Starting in SAS 9.4M5, both PROC REPORT and PROC TABULATE summarization can be executed on the CAS server if your input data, of course, originates in CAS. Like PROC MEANS and PROC SUMMARY, PROC REPORT and PROC TABULATE both have a long list of statistics that they can analyze to produce a report. Like PROC MEANS and SUMMARY, not all of these statistics are supported in CAS. If you use all of this sort of statistics here, the analysis can be performed in CAS. If you choose other statistics, then it may have to be pulled onto the SAS Compute platform in order to complete your report. If your PROC REPORT program contains variables of usage display or order, then the data probably will have to be pulled back to the SAS Compute platform to finish creating your output. If you intend to run your PROC REPORT code in CAS, you should avoid using DISPLAY or ORDER variables. Well, I've lectured through all the lectures. And now, all that's left is for you to complete the practices. While you're doing your practices, I'd like to keep a couple of things in mind. Number one, you should do the practices in order. Sometimes, the output of one of the practices is required as input for the next practice exercise. In addition, there's a couple of typos on the PDF course notes for this course, particularly, in the exercises. On Practice 3.1, it tells you to open the starter program sv03e01.sas. But it should be sv03p-- as in papa-- 01.sas. Now, the same issue with the second exercises. Program name sv03e02, should be sc03p02. So just make sure that you open the right program file. All of the programs on the computer image should be fine. And I hope you enjoy these exercises and they help set in your mind all the things you've learned during the last few hours about programming in SAS Viya.
Info
Channel: SAS Users
Views: 2,433
Rating: 5 out of 5
Keywords: viya data step, viya programming, fedsql, running in cas, caslibs, cas-enabled, proc mdsummary, modify base sas programs, sas cloud analytic services, sas cas, what is sas viya, programming interfaces compatible with sas viya, snippets in cas, caslibs to access data, sasuser library, cas-enabled base sas procedures, sas tutorial, sas, sas global forum, sas global forum 2020, sas technology, virtual sas global forum, virtual sas global forum 2020
Id: YFp9HAqvC_A
Channel Id: undefined
Length: 76min 21sec (4581 seconds)
Published: Tue Apr 14 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.