Stick to the basics. Did you ever get that advice? Two of the papers at MidWest SAS Users Group 2013 used that most fundamental of SAS processing concepts—the Program Data Vector—to show why users might encounter unexpected errors in their DATA step programs.
In The Secret Life of DATA STEP, Swati Agarwal shared a refresher on how a SAS DATA step compiles and executes behind the scenes. She reminded listeners that each SAS DATA step functions as a self-contained mini-program that is compiled and then executed in an implied loop.
Agarwal explains the Program Data Vector this way: it’s a storage place in memory that contains all of the variables encountered by your DATA step. The PDV is where SAS builds the data set, one observation at a time. During processing, the DATA step also generates certain automatic variables that can be used for further processing. She says that when you want to do complex processing, you’ll want “want concrete knowledge of what the PDV is holding and the rules SAS observes in manipulating that information.”
Agarwal emphasized if SAS programmers understand what happens during each of these three important aspects of DATA step processing--compile phase, execute phase and PDV—then they can exercise better control over how data are read and output.
While the PDV is most commonly associated with reading raw data into a SAS data set, a PDV is also created whenever the DATA step contains a MERGE, SET, MODIFY or UPDATE statements. More importantly, default processing may behave differently.
In Anatomy of a Merge Gone Wrong, James Lew and Joshua Horstman shared one programming pitfall that requires concrete knowledge of the PDV. Errors with merging data sets can arise from a number of sources, including not understanding the inner workings of the DATA step.
Their paper explains how the automatic retain in DATA step processing can trip the most wary of SAS programmers. There is a common misconception that the values in the PDV are always reset to missing when processing returns to the top of the DATA step during execution phase. However, when reading data with a SET, MERGE, MODIFY or UPDATE statement, variable values are automatically retained from one iteration to the next. Lew and Horstman suggest two ways to avoid errors caused by the automatic retain:
- merge data sets in a separate DATA step and then perform any additional processing in subsequent DATA steps
- rename one of the input variables
Both papers emphasize why it’s worth learning how the DATA step really works. You’ll want to check these additional resources: