Hi, my name is Dominique. I'm a technical trainer with
SAS, I'm here to answer one of the questions that you
ask most often. Combining data in SAS: DATA Step
versus SQL, which do I use? Lucky for you, the answer
is very simple: It depends. But truly, I'm here to help
you out more than that. We're going to talk specifically
about joins and merges. If you're looking to be an
efficient, lazy programmer like me, sometimes you
might choose SQL. But if you like to have a little
control, you might use the DATA Step instead. I recommend you
know both methods. I'm going to take you
through a couple of programs. If you have any questions at
all, make sure you leave them in the comments below so that
we can get back to you. Let's get to it. First, let's talk about our data.
Let's set the scene here. We're throwing a party. I have the invited data set. I have three columns
with ten rows. We have the name of the
person we want to invite, the method in which we
want to invite them. That could be email, mail, hand
delivery or it could be missing. And the plus column,
that's like, do they get a plus one or do we not trust
them to bring a friend to this party? Then we have
the contacts table. You can think of this
as my phone book. I have my friends' first
names, their last names, their phone number, address, the
city, state and email. And sometimes there are some
missing information in there as well. As you can see,
these datasets are not sorted. This is something we're going
to have to consider when we're doing our manipulation,
particularly with the DATA Step. Once the data is sorted,
it's a little easier to see where we have matching rows. For example, Amy is invited to
this party and Amy is also in my contacts data set. So I have some information on
how I could contact her. It looks like I want
to mail my invitation. So I have her mailing address
in here, whereas Randy is invited, but I don't have any
other contact information for Randy. Randy, if this data set
were complete, if you could see all 30 rows here, you'd
see that Randy isn't in the contacts table. So I'll have
to find some other creative way to give Randy
this invitation. Same thing for Tony. And then Aaron, Alex, Angelica,
Asia, all these people that aren't highlighted on
our right side. They're not invited
to our party. All right, so that's our data. We'll do a quick overview of
what we should know about joins before we get
started with the syntax. There are many different types
of joins, different ways to bring your data
together in SAS. We're specifically going to
look at the Inner Join, Left J oin, Full Join,
and Right Join. The Inner Join would be us
looking at only those people who are invited and we also
have their contact information. So based on our matching
criteria, which is that first name, who is in both tables? The Left Join would be to
say, well, it's really only important that these
people are invited. If I also have their contact
information, that's great. But really, I just want to make
sure that they are in that left table. A Full Join is I want to
bring all of my data together, even if I'll have some
missing information for their contact information or if they're
not invited to this party. I want to have a data
set that has it all, so I'll do a Full Join there. And then a Right Join includes
all those rows that are in our right table or for us,
I guess that's our contacts data set and then
also the intersection. You can accomplish all of these
join types with the DATA Step or PROC SQQL. It's just going to be
a matter of syntax. There's a little bit of a
difference when it comes to table relationships. There are different
ways that our data can be related to each other. If our common column, our columns B
and C, I can see that a One-to-One match would be there's
a unique value and B that matches to one and only one value
in C, and that is pretty consistent in this dataset. For One-to-Many, I have a unique
value in our first dataset that matches to multiple values
in our second dataset. Nonmatches means that I have some
values that don't have a match in the opposite table. And you know what? That's
OK, because they're complete on their own. They don't need
a match to be successful. Sorry, pull it
together, all right. Many-to-Many matching, this one
gets a little bit more complex. So we have multiple
values or repeated values in our first table that match to
repeated values in the second table. And when it comes
to these different relationships, they can truly be accomplished
using either the DATA Step or PROC SQL. However, when it comes to
Many-to-Many matching, I will 100% fully endorse using PROC SQL,
that SQL procedure in SAS . You could try to use the
DATA Step, but it's not going to look pretty. If you're looking
to get a Cartesian product out of these values, use SQL. Now let's look at
the actual programs. We'll start off with the most
basic syntax and then add a few things as we move forward. With the most basic syntax for
PROC SQL on your from clause, you specify that you want to
do either a Full Join, Right Join, Left Join, or Inner
Join between two data sets. And then you must include the
on clause where you say that the name matches
the first name. So the name column comes from
the invited data set and the first name column comes from
the contacts data set. I'll give this a run and
we'll just check out these results. What we just did is a
Full oin in SQL, meaning I'm going to return all of the columns
and all of the rows from both tables, but we'll match the
rows that have a matching name and first name. If you'd like to know more
about this syntax, we actually have a tutorial for that. So check it out. Let's
check out the DATA Step. With the DATA Step merge,
before you can join data together, you have to make sure
that the input data sets are sorted by the column that
you want to match on. That wasn't required for PROC
SQL because that all happens behind the scenes with SQL. Lower with my DATA Step, I'm
creating a data set called ds merge, which I'll
later PROC Print. For the basic merge syntax,
there's a couple of things we want to be aware of. We'll
have a merge statement as opposed to a set statement,
which I know most of us are used to, and then we list off
the data sets we want to include. For this merge, I had to
include a rename equals data set option to be sure that the columns
I want to match on have the same name in both tables. Because my matching criteria is
defined on the by statement, I need to match by first name. In our first table, this column
had a length of seven bytes . In our second table, the column
had a length of nine bytes . To make sure that my data
is not truncated, I have to include this length statement. And then when everything is said
and done, I will print the results. And this is almost
exactly the same as our SQL results, except for there's only
one first name column. So if I go back to my code and
just run all of it so we can see them side by side with
the SQL Full Join results, notice that we have the Name
column and then we have First_Name and sometimes the
Name is missing and then sometimes First_Name
is missing. But with the DATA Step, there's
only one single column and it's always populated. Let's look at a couple more
things we want to consider with PROC SQL. When it comes to
any sort of data manipulation, a lot of people ask, "Well, which
one's going to be faster? I'm working with a lot of
data." The answer there again is it depends. But the way that
you can see this is options fullstimer and check out your
log in those notes that are produced. In the previous example,
we opted to select all of the columns
from both tables. Instead, I'm going to use
the select clause to select specific columns and
their order. Notice that the first column
uses the coalesce function. This is to make sure that
wherever there is a value in either the Name column or the
First_Name column that I have my value in
First_Name populated. This means that no matter what
type of join I use, if I have a name involved, I
will have it populated. Also, note that I can use
other clauses like order by as well. These are the results
for a Left Join. That means all of the people
I've invited and if I have their contact information, that
would be populated as well. Notice that in my invitations
I've included the name Nina twice. I want to invite Nina
by email and also mail. To change this join
type, it's pretty simple. All I have to do is change
the first keyword here so I can easily turn this into
a Right Join. Also, notice that the default
results for SQL is a report. If I wanted to create a
dataset based off these results, I need to use a
create table statement. Now let's look at some things
to consider with the DATA Step merge. With the DATA Step
merge, if I wanted to keep those exact same columns in
that exact same order, I'll need to include a little
bit of syntax for that. The retain statement is often
used for different purposes, but here it works really
well in ordering my columns. So I want to specify those
statements first before I talk about which tables I'm
actually bringing in. With my merge, if I wanted to
do something like a Left Join, a Right Join, an Inner Join,
or anything like that, I need to use the in=
data set options. The in= data set option creates
a temporary variable during execution time that I can use to
say I only want those rows if they're in the first table
or if they're in the second table. The variables that I'm
creating are called inT1 and inT2. And then I use that
in a subsetting if statement to say that if this row is in
the first table, then include it in my results. This is our equivalent of
doing a Left Join. We've seen these results before,
I just wanted you to know that it is possible
with the data step. If I want to change this Left
Join into a Right Join, I'll simply change the variable that
I'm using in my subsetting if statement. Pause. I know I'm showing
you a lot here. If you have any questions,
leave a comment down below. All right, let's get
back to it. Now I want to specifically
look at the DATA Step. What would be the benefits of
including a DATA Step merge in my code? Well, some of the major benefits
are that you have full control over the processing and
you can also create multiple tables. I'll show you what I mean.
This DATA Step is very similar to the previous DATA
Step that we saw. However, we're creating
three data sets. I want one data set for all
those people that we're ready to contact. That means that
we know that they're invited and we also have
their contact information. I also want to create
a table called need_info. I need their information. They're invited to my party,
but I don't have their contact. And I'll create
a third table. And those will be for the people
who are not invited to the party. I'll use if-then-else
logic to specify when to output to each table. I'll also use
those in variables. So I have a couple of PROC
Prints down here to make sure that we print all three
of those data sets. There you have it! In a single
data step, I created three data sets. I have the list
of people who are ready to contact. I have that short list
of people who I need their contact information. Obviously, I don't need all
of these columns for this dataset. I could use the drop=
or keep= data set options to specify that I don't need
those columns in this table. I also have my not invited. So these are all the people
who weren't invited to my party. The data step gives
you full control over processing. You can see that
processing in action with this window called the
DATA Step debugger. If I click on the debug option
in my editor, notice this little green guy pops up on
the side of my screen. Once I click on that green bug,
a window pops open where I can step through this DATA
Step and watch execution happen one statement at a time. I can watch the values of
my columns change as we're creating our rows. If you'd like to know more
about the DATA Step debugger, leave a comment down below. So those are the benefits
of using the DATA Step. Now, let's look at the SQL join
and what it has to offer. SQL makes it really easy
to join multiple tables together, to perform non-equijoins, and
also Many-to-Many matching. Let's start with joining multiple
tables and a non-equi join. I have another
data set called SEATING. For this party, I want
to create a seating chart. Back at my code, my first query
includes those results from my Left Join with
invited and contacts. But I also joined together a
third table, so I included an additional join. I want to join those results
with the party.Seating table on the basis that the last
name in alphabetical order, it is greater than or equal to
whatever is in the Name_Start column, but less than
our Name_End column. This is called a non-equijoin. I'm not saying that Last_Name
needs to be equal to anything. I'm saying that Last_Name
needs to be between the two values. Based on that criteria,
I can see how I've set up my tables. The table number
is based on their last name, where I don't have a last
name for Aaron, Randy or Tony. Now, what about
the Many-to-Many matching? I also have a
dataset called CONTACTS_MANY. Remember how I have two different
Ninas in my invited table ? Well, in CONTACTS_MANY, I
also have two different Ninas. I have Nina Harnet and
I have Nina Wright. The DATA Step would not
handle this very well. But with SQL, all I have to
do is specify both of those tables that have
those Many-to-Many values. For the name Nina, we
actually produced a Cartesian product. So for each mail and
email invite, I matched with Nina Wright And Nina Harnet. In conclusion, things you want
to consider with either method, also understand that you
tell PROC SQL what to do. You don't tell PROC
SQLhow to do it. Behind the scenes, there's an
optimizer that will actually choose how your join
will be processed. So let's look at the
benefits of either method. With the DATA Step merge,
you can create multiple output tables in one step. You can have matches go to
one table and nonmatches go to another. You also get full
control over processing and that means that you can use the
DATA Step debugger if you're using Enterprise Guide. The benefits of PROC SQL is you
can join multiple tables in a single query. You can create a Cartesian product
for many to many joins. You can perform non-equijoins
and also SQL is a very common language among database
management systems. So if you already know SQL,
be sure to check out SAS' implementation with PROC SQL. As as you can see, each
method has its own strengths. I recommend that you know
a little bit of both. It'll probably make your job
programming a lot easier. In case you saw something
that was unfamiliar in this lesson, we're including some
resources in the description box below. Don't forget to
subscribe and turn on notifications so you get alerted
when we come out with more content. Also, like in share
with your colleagues so that we can all be
more efficient, lazier programmers.