Combining Data Sets

This week Christine and I covered chapter 13, “Combining SAS Data Sets.” Most of this material was quite straightforward, and we did surprisingly well on the quiz at the end. But when we came together to discuss the chapter, we realized we both had had issues with one section: the dreaded temporary IN= variables. You would want to create a temporary IN= variable in order to match-merge two data sets to come up with a resulting data set that includes only observations that appear in both data sets (that sounds confusing even writing it).

The concept made very little sense to us when we read the certification guide, so we decided to come up with two data sets on our own that meant something to us. Oftentimes we have trouble understanding examples in programming books because the examples don’t mean anything to us personally. So we thought to ourselves, in what instance would it be important to have a data set that shows only observations that appear in both data sets we were combining?

After some trial and error, we came up with the example of a teacher who had a data set of all the students who had a score on record for two exams and a separate data set that contained all the students who had a score on record for two projects. We decided that this teacher needed a new data set that showed only students who had scores for all exams and all projects, because this was the only way a student would pass a class. Does this seem like a good example? It did to us, and we worked it out on our trusty white board (see photo [figure 13.1]); we of course didn’t erase our work afterward, because we wanted all our colleagues to see what technical work we’ve been doing to prepare for the exam).

But, as usual, we had a couple of questions. And who better to answer these questions than Michele Burlew, superstar author of Combining and Modifying SAS Data Sets: Examples, Second Edition, among other SAS Press titles. Michele, I hope you’ll chime in and help us out.

First, we were confused about what to call the temporary variables. In most of the examples in the certification guide, the temporary variables were the original data set names prefaced by “in.” The two data sets that were combined in the prep guide examples were called “demog” and “visit,” so the temporary variables were called “indemog” and “invisit” (see code example [figure 13.2]). Is this a requirement? Or can the temporary variables be called whatever we choose, as long as they are called out as such in the subsequent “if” statement?

Second, at the end of the page, the prep guide continues: “Thus, you can specify the subsetting IF statement from the previous example in either of the following ways. The first IF statement checks specifically for a value of 1. The second IF statement checks for a value that is neither missing nor 0 (which for IN=variables is always 1).” And then the two examples are written as such:

if indemog=1 and invisit=1;
if indemog and invisit;

Huh? Is the guide saying that either of these two statements will work interchangeably? We read the statement “The second IF statement checks for a value that is neither missing nor 0 (which for IN=variables is always 1” about a hundred times and we still didn’t get it. If they mean the same, in what case would you want to use the =1? Help, please?

5 Comments

Stacey Hamilton on June 25, 2010 4:33 pm

Thanks, Michele! Great example. I wish you could be our weekly study partner!
Prashant on June 24, 2010 5:28 pm

Hi Stacey,
Good to see your post after quite some time. Let me try to explain your queries.Michele's example above is Fantastic in explaining the boolean nature of IF Statement.
Answer to your questions are:
1) Temporary variables can be named in any way you want as long they follow rules for SAS variables.For eg i can name In=A for the first dataset and In=B for the second dataset.
2) Yes .You can Choose either of the IF statements.
Consider the following IF statement :
IF ;
It actually means that we want to restrict data where the IF Statement is TRUE.
IF statement is TRUE when the value of is either not equal to 0 and or not missing.The IF statement would be FALSE when the value of is either 0 or missing.
In our case since indemog is a temp variable when combining datasets ,it has only two possible values either 1 (when the dataset corresponding to IN variable (clinic.demog) contributes records to final dataset) or 0 (when the dataset corresponding to IN variable(clinic.demog) does not contribute records to final dataset).
So When you write :
IF indemog and invisit ;
It means that the IF statement is TRUE when both indemog and invisit have values which are either non zero or non missing.But in our case since we have only two possible values for these temp variables we can also write the above statement as follows:
IF indemog=1 and invisit=1;
Suppose i wanted a final dataset with only observations exclusive to Clinic.demog in the example given in the post above.
I can write the IF statement as
IF indemog=1 and invisit=0;
I can also write the above IF statement as :
IF indemog and invisit=0;
This would give me all records which are in clinic.demog dataset but not in clinic.visit for the by variables.
I guess by now you may got the point.
Michele Burlew on June 18, 2010 3:05 pm

I think Dave’s answers to your two questions are right on! With regard to the variable names, I usually use INsomething because that makes it easy for me to remember what I’m using the variable for. The two ways of writing your IF statements are interchangeable. Use the one that makes most sense to you.
To illustrate Dave’s second point, you could try the following DATA step. It reads in five values: 1, 5, -1, 0, and a missing value. The IF statement tests each value of X like you’re testing your IN= variables in your DATA step. When a value is nonmissing and non-zero, the THEN part of the statement writes “True” in the SAS log. When a value is missing or zero, the ELSE statement writes “False”.
data null;
input x @@;
if x then put "True: " x=;
else put "False: " x=;
datalines;
1 5 -1 0 .
;;;
Here’s the SAS log:
True: x=1
True: x=5
True: x=-1
False: x=0
False: x=.
I think it’s great that you’re trying these techniques out on data that means something to you. I think that’s the smartest way to figure out new concepts. Good luck, Stacey and Christine, with your studying!
Stacey Hamilton on June 18, 2010 10:40 am

Thank you, Dave! That makes total sense. I appreciate you taking the time to respond.
Dave Scocca on June 16, 2010 10:40 pm

Actually, when I do that kind of merge (keeping only obs in both datasets), I find PROC SQL much faster and more straightforward than the datastep.
The answers to your questions:
(1) you can call the in= variable anything. I often use a leading underscore (in=_inDemog) to guarantee I'm not going to conflict with any dataset variables. (When I'm using a match-merge as a lookup tool, I use (in=_exist) on the primary dataset so my IF statement is easy to interpret.)
(2) Yes, either statement works interchangeably. You can even cross-breed them:
if (_inDemog eq 1) and (_inVisit) ;
In an IF statement, a variable name with no comparison operator resolves to FALSE if the value of the variable is zero or missing, and TRUE otherwise. An IN= variable will never be missing--it will be either one or zero, depending on whether the current record came from the associated dataset (one) or not (zero).

Blogs

Blogs

Combining Data Sets

About Author

5 Comments