Most SAS programmers would agree that they use the SET statement without giving much thought to the syntax, because it’s such a widely used statement of choice. We routinely name the expected data sets and possibly a few options, and away we go. A visit to the documentation can be saved for more complex concerns such as arrays, hash tables, and regular expressions. Maybe a review of how the Program Data Vector (PDV) stores and outputs variable values could interest even the most experienced SAS programmer.
Often in Technical Support, we hear from customers who are referencing more than one data set on the SET statement and their results “aren’t correct”. Once we see the code, it’s obvious that this is a case of SAS behaving by design but different than expected by that customer. The following DATA steps and resulting output use missing values to illustrate one common situation where the PDV causes the DATA step to “misbehave”:
Why the IN= data set option helps, but isn’t the total solution
Because Z doesn’t exist in data set A, all observations from data set A meet the IF condition, thus all Y values are set to 999. It’s likely that most users want only the observations from data set B evaluated by the IF statement. The addition of the IN= data set option seems like a reasonable fix.
Don’t forget how the SET statement changes PDV processing
Observations from data set A are no longer affected by the IF statement. But wait, the eighth observation has a Y value of 999 although Z isn’t missing. What happened? The main points to remember when reading data sets with a SET statement are these:
- The SET statement does not reset the values in the program data vector to missing, except for variables whose values are calculated or assigned during the DATA step.
- Variables that are created by the DATA step are set to missing at the beginning of each iteration of the DATA step.
- Variables that are read from a data set are not.
These differences are key to what happens when data sets are combined with a SET statement. The Combining SAS Data Sets: Methods section of the SAS 9.4 Language Reference: Concepts manual is very helpful.
How to override the automatic retain in the PDV
To take the code sample one step further, observations six and seven correctly show Y values of 999, but since Z isn’t missing on the eighth observation, Y should be a missing value.
You can manually set Y to missing at the top of the DATA step to prevent the automatic retain. Each iteration of the DATA step sets Y to missing, and the SET and IF statements execute and populate the PDV accordingly. Now the resulting data set values are as expected.
Usage Note 48147: Variables read using SET, MERGE, and UPDATE statements are automatically retained is an excellent note to bookmark for future use when reading data sets with the SET, MERGE, or UPDATE statements.