SAS Tutorial | Combining Data in SAS: DATA Step Versus SQL

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi, my name is Dominique. I'm a technical trainer with SAS, I'm here to answer one of the questions that you ask most often. Combining data in SAS: DATA Step versus SQL, which do I use? Lucky for you, the answer is very simple: It depends. But truly, I'm here to help you out more than that. We're going to talk specifically about joins and merges. If you're looking to be an efficient, lazy programmer like me, sometimes you might choose SQL. But if you like to have a little control, you might use the DATA Step instead. I recommend you know both methods. I'm going to take you through a couple of programs. If you have any questions at all, make sure you leave them in the comments below so that we can get back to you. Let's get to it. First, let's talk about our data. Let's set the scene here. We're throwing a party. I have the invited data set. I have three columns with ten rows. We have the name of the person we want to invite, the method in which we want to invite them. That could be email, mail, hand delivery or it could be missing. And the plus column, that's like, do they get a plus one or do we not trust them to bring a friend to this party? Then we have the contacts table. You can think of this as my phone book. I have my friends' first names, their last names, their phone number, address, the city, state and email. And sometimes there are some missing information in there as well. As you can see, these datasets are not sorted. This is something we're going to have to consider when we're doing our manipulation, particularly with the DATA Step. Once the data is sorted, it's a little easier to see where we have matching rows. For example, Amy is invited to this party and Amy is also in my contacts data set. So I have some information on how I could contact her. It looks like I want to mail my invitation. So I have her mailing address in here, whereas Randy is invited, but I don't have any other contact information for Randy. Randy, if this data set were complete, if you could see all 30 rows here, you'd see that Randy isn't in the contacts table. So I'll have to find some other creative way to give Randy this invitation. Same thing for Tony. And then Aaron, Alex, Angelica, Asia, all these people that aren't highlighted on our right side. They're not invited to our party. All right, so that's our data. We'll do a quick overview of what we should know about joins before we get started with the syntax. There are many different types of joins, different ways to bring your data together in SAS. We're specifically going to look at the Inner Join, Left J oin, Full Join, and Right Join. The Inner Join would be us looking at only those people who are invited and we also have their contact information. So based on our matching criteria, which is that first name, who is in both tables? The Left Join would be to say, well, it's really only important that these people are invited. If I also have their contact information, that's great. But really, I just want to make sure that they are in that left table. A Full Join is I want to bring all of my data together, even if I'll have some missing information for their contact information or if they're not invited to this party. I want to have a data set that has it all, so I'll do a Full Join there. And then a Right Join includes all those rows that are in our right table or for us, I guess that's our contacts data set and then also the intersection. You can accomplish all of these join types with the DATA Step or PROC SQQL. It's just going to be a matter of syntax. There's a little bit of a difference when it comes to table relationships. There are different ways that our data can be related to each other. If our common column, our columns B and C, I can see that a One-to-One match would be there's a unique value and B that matches to one and only one value in C, and that is pretty consistent in this dataset. For One-to-Many, I have a unique value in our first dataset that matches to multiple values in our second dataset. Nonmatches means that I have some values that don't have a match in the opposite table. And you know what? That's OK, because they're complete on their own. They don't need a match to be successful. Sorry, pull it together, all right. Many-to-Many matching, this one gets a little bit more complex. So we have multiple values or repeated values in our first table that match to repeated values in the second table. And when it comes to these different relationships, they can truly be accomplished using either the DATA Step or PROC SQL. However, when it comes to Many-to-Many matching, I will 100% fully endorse using PROC SQL, that SQL procedure in SAS . You could try to use the DATA Step, but it's not going to look pretty. If you're looking to get a Cartesian product out of these values, use SQL. Now let's look at the actual programs. We'll start off with the most basic syntax and then add a few things as we move forward. With the most basic syntax for PROC SQL on your from clause, you specify that you want to do either a Full Join, Right Join, Left Join, or Inner Join between two data sets. And then you must include the on clause where you say that the name matches the first name. So the name column comes from the invited data set and the first name column comes from the contacts data set. I'll give this a run and we'll just check out these results. What we just did is a Full oin in SQL, meaning I'm going to return all of the columns and all of the rows from both tables, but we'll match the rows that have a matching name and first name. If you'd like to know more about this syntax, we actually have a tutorial for that. So check it out. Let's check out the DATA Step. With the DATA Step merge, before you can join data together, you have to make sure that the input data sets are sorted by the column that you want to match on. That wasn't required for PROC SQL because that all happens behind the scenes with SQL. Lower with my DATA Step, I'm creating a data set called ds merge, which I'll later PROC Print. For the basic merge syntax, there's a couple of things we want to be aware of. We'll have a merge statement as opposed to a set statement, which I know most of us are used to, and then we list off the data sets we want to include. For this merge, I had to include a rename equals data set option to be sure that the columns I want to match on have the same name in both tables. Because my matching criteria is defined on the by statement, I need to match by first name. In our first table, this column had a length of seven bytes . In our second table, the column had a length of nine bytes . To make sure that my data is not truncated, I have to include this length statement. And then when everything is said and done, I will print the results. And this is almost exactly the same as our SQL results, except for there's only one first name column. So if I go back to my code and just run all of it so we can see them side by side with the SQL Full Join results, notice that we have the Name column and then we have First_Name and sometimes the Name is missing and then sometimes First_Name is missing. But with the DATA Step, there's only one single column and it's always populated. Let's look at a couple more things we want to consider with PROC SQL. When it comes to any sort of data manipulation, a lot of people ask, "Well, which one's going to be faster? I'm working with a lot of data." The answer there again is it depends. But the way that you can see this is options fullstimer and check out your log in those notes that are produced. In the previous example, we opted to select all of the columns from both tables. Instead, I'm going to use the select clause to select specific columns and their order. Notice that the first column uses the coalesce function. This is to make sure that wherever there is a value in either the Name column or the First_Name column that I have my value in First_Name populated. This means that no matter what type of join I use, if I have a name involved, I will have it populated. Also, note that I can use other clauses like order by as well. These are the results for a Left Join. That means all of the people I've invited and if I have their contact information, that would be populated as well. Notice that in my invitations I've included the name Nina twice. I want to invite Nina by email and also mail. To change this join type, it's pretty simple. All I have to do is change the first keyword here so I can easily turn this into a Right Join. Also, notice that the default results for SQL is a report. If I wanted to create a dataset based off these results, I need to use a create table statement. Now let's look at some things to consider with the DATA Step merge. With the DATA Step merge, if I wanted to keep those exact same columns in that exact same order, I'll need to include a little bit of syntax for that. The retain statement is often used for different purposes, but here it works really well in ordering my columns. So I want to specify those statements first before I talk about which tables I'm actually bringing in. With my merge, if I wanted to do something like a Left Join, a Right Join, an Inner Join, or anything like that, I need to use the in= data set options. The in= data set option creates a temporary variable during execution time that I can use to say I only want those rows if they're in the first table or if they're in the second table. The variables that I'm creating are called inT1 and inT2. And then I use that in a subsetting if statement to say that if this row is in the first table, then include it in my results. This is our equivalent of doing a Left Join. We've seen these results before, I just wanted you to know that it is possible with the data step. If I want to change this Left Join into a Right Join, I'll simply change the variable that I'm using in my subsetting if statement. Pause. I know I'm showing you a lot here. If you have any questions, leave a comment down below. All right, let's get back to it. Now I want to specifically look at the DATA Step. What would be the benefits of including a DATA Step merge in my code? Well, some of the major benefits are that you have full control over the processing and you can also create multiple tables. I'll show you what I mean. This DATA Step is very similar to the previous DATA Step that we saw. However, we're creating three data sets. I want one data set for all those people that we're ready to contact. That means that we know that they're invited and we also have their contact information. I also want to create a table called need_info. I need their information. They're invited to my party, but I don't have their contact. And I'll create a third table. And those will be for the people who are not invited to the party. I'll use if-then-else logic to specify when to output to each table. I'll also use those in variables. So I have a couple of PROC Prints down here to make sure that we print all three of those data sets. There you have it! In a single data step, I created three data sets. I have the list of people who are ready to contact. I have that short list of people who I need their contact information. Obviously, I don't need all of these columns for this dataset. I could use the drop= or keep= data set options to specify that I don't need those columns in this table. I also have my not invited. So these are all the people who weren't invited to my party. The data step gives you full control over processing. You can see that processing in action with this window called the DATA Step debugger. If I click on the debug option in my editor, notice this little green guy pops up on the side of my screen. Once I click on that green bug, a window pops open where I can step through this DATA Step and watch execution happen one statement at a time. I can watch the values of my columns change as we're creating our rows. If you'd like to know more about the DATA Step debugger, leave a comment down below. So those are the benefits of using the DATA Step. Now, let's look at the SQL join and what it has to offer. SQL makes it really easy to join multiple tables together, to perform non-equijoins, and also Many-to-Many matching. Let's start with joining multiple tables and a non-equi join. I have another data set called SEATING. For this party, I want to create a seating chart. Back at my code, my first query includes those results from my Left Join with invited and contacts. But I also joined together a third table, so I included an additional join. I want to join those results with the party.Seating table on the basis that the last name in alphabetical order, it is greater than or equal to whatever is in the Name_Start column, but less than our Name_End column. This is called a non-equijoin. I'm not saying that Last_Name needs to be equal to anything. I'm saying that Last_Name needs to be between the two values. Based on that criteria, I can see how I've set up my tables. The table number is based on their last name, where I don't have a last name for Aaron, Randy or Tony. Now, what about the Many-to-Many matching? I also have a dataset called CONTACTS_MANY. Remember how I have two different Ninas in my invited table ? Well, in CONTACTS_MANY, I also have two different Ninas. I have Nina Harnet and I have Nina Wright. The DATA Step would not handle this very well. But with SQL, all I have to do is specify both of those tables that have those Many-to-Many values. For the name Nina, we actually produced a Cartesian product. So for each mail and email invite, I matched with Nina Wright And Nina Harnet. In conclusion, things you want to consider with either method, also understand that you tell PROC SQL what to do. You don't tell PROC SQLhow to do it. Behind the scenes, there's an optimizer that will actually choose how your join will be processed. So let's look at the benefits of either method. With the DATA Step merge, you can create multiple output tables in one step. You can have matches go to one table and nonmatches go to another. You also get full control over processing and that means that you can use the DATA Step debugger if you're using Enterprise Guide. The benefits of PROC SQL is you can join multiple tables in a single query. You can create a Cartesian product for many to many joins. You can perform non-equijoins and also SQL is a very common language among database management systems. So if you already know SQL, be sure to check out SAS' implementation with PROC SQL. As as you can see, each method has its own strengths. I recommend that you know a little bit of both. It'll probably make your job programming a lot easier. In case you saw something that was unfamiliar in this lesson, we're including some resources in the description box below. Don't forget to subscribe and turn on notifications so you get alerted when we come out with more content. Also, like in share with your colleagues so that we can all be more efficient, lazier programmers.
Info
Channel: SAS Users
Views: 11,291
Rating: 4.9386973 out of 5
Keywords: sas tutorial, combining data in sas, data step, sql, data step versus sql, sas how to tutorial, sql join, data step merge, data step programs, data step debugger, sas, how to combine data in sas, using data step in sas, sas datastep, sas data step, sas sql, proc sql, sql proceedure sas, sql join sas, sas data step merge, sas data step debugger
Id: ksZH_LoXnKk
Channel Id: undefined
Length: 16min 37sec (997 seconds)
Published: Mon Sep 14 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.