Here I share with you a data preparation approach and SAS coding technique that will significantly simplify, unclutter and streamline your SAS programming life by using data templates.
Dictionary.com defines template as “anything that determines or serves as a pattern; a model.” However, I was flabbergasted when my “prior art research” for the topic of this blog post ended rather abruptly: “No results found for data template.”
What do you mean “no results?!” (Yes, sometimes I talk to the Internet. Do you?) We have templates for everything in the world: MS Word templates, C++ templates, Photoshop templates, website templates, holiday templates, we even have our own PROC TEMPLATE. But no templates for data?
At that point I paused, struggling to accept reality, but then felt compelled to come up with my own definition:
A data template is a well-defined data structure containing a data descriptor but no data.
Therefore, a SAS data template is a SAS dataset (data table) containing the descriptor portion with all necessary attributes defined (variable types, labels, lengths, formats, and informats) and empty (zero observations) data portion.
Less clutter = greater efficiency
When you construct SAS data tables using SAS code or data management tools such as SAS Data Integration Studio, data descriptions which are quite mundane can overshadow the rest of your more sophisticated data processing logic. With tens, and sometimes hundreds, of variables to describe, your data preparation code or process can become unsightly, tedious and bulky.
The main benefit of using SAS data templates is that it allows you to unclutter voluminous data descriptions by separating them into modules apart from the rest of your SAS code. This allows your data processing modules to be lean and concise, and therefore easier to develop, debug, understand and work with.
Data templates also:
- Assure your data consistency across multiple instances.
- Provide greater coding and data management flexibility.
- Ensure your data is always well described.
- Save development time (create once, use multiple times.)
- Unclutter your SAS log, making it easier to read and analyze.
- Self-document your data.
- Can be used for SAS tables as well as other non-SAS databases.
In addition, from a project management perspective, data description and data processing tasks require people with quite different skill sets/levels. Having team members implement them in parallel speeds up the development process. Moreover, you can fully automate building data templates by using data design documentation as a feed to code-generating SAS program.
Unfortunately, despite all these benefits the data template concept is not explicitly and consistently employed and is noticeably absent from data development methodologies and practices.
Let’s try to change that!
How to create SAS data templates from scratch
It is very easy to create SAS data template. Here is an example:
libname PARMSDL 'c:\projects\datatemplates'; data PARMSDL.MYTEMPLATE; label newvar1 = 'Label for new variable 1' newvar2 = 'Label for new variable 2' /* ... */ newvarN = 'Label for new variable N' ; length newvar1 newvar2 $40 newvarN 8 ; format newvarN mmddyy10.; informat newvarN date9.; call missing(of _all_); stop; run;
First, you need to assign a permanent library (e.g. PARMSDL) where you are going to store your SAS dataset template. I usually do not store data templates in the same library as data. Nor do I store it in the same directory/folder where you store your SAS code. Ordinarily, I store data templates in a so-called parameter data library (that is why I use PARMSDL as a libref), along with other data defining SAS code structure.
In the data step, the very first statement LABEL defines all variables’ labels as well as variable position determined by the order in which they are listed.
Statement LENGTH defines variables’ types (numeric or character) and their length in bytes. Here you may group variables of the same length to shorten your code or define them individually to be more explicit. If you don't define a variable type and length here SAS will not place that variable into the data template (well, unless you implicitly define its type and length in optional format or informat statement.)
Statement FORMAT defines variables’ formats as needed. You don’t have to define formats for all the variables; define them only if necessary.
Statement INFORMAT (also optional) defines informats that come handy if you use this data template for creating SAS datasets by reading external raw files. With informats defined on the data template, you won’t have to specify informats in your INPUT statement while reading external file, as the informats will be inherently associated with the variable names. That is why SAS data sets have informat attribute for its variables in the first place (if you ever wondered why.)
Optional CALL MISSING(of _ALL_); statement eliminates a series of annoying NOTES in the SAS log about variables being uninitialized:
NOTE: Variable newvar1 is uninitialized. NOTE: Variable newvar2 is uninitialized. . . . NOTE: Variable newvarN is uninitialized.
Finally, don’t forget the STOP statement at the end of your data step, just before the RUN statement. Otherwise, instead of zero observations, you will end up with a data table that has a single observation with all missing variable values. Not what we want.
It is worth noting that obs=0 system option will not work instead of the STOP statement as it is applied only to the data being read, but we read no data here. For the same reason, (obs=0) data set option will not work either. Try it, and SAS log will dispel your doubts:
data PARMSDL.MYTEMPLATE (obs=0); --- 70 WARNING 70-63: The option OBS is not valid in this context. Option ignored.
Alternatively, you may use ATTRIB statement to assign LABEL, LENGTH, FORMAT and INFORMAT:
data PARMSDL.MYTEMPLATE; attrib newvar1 label='Label for new variable 1' length=$40 newvar2 label='Label for new variable 2' length=$40 /* ... */ newvarN label='Label for new variable N' length=8 format=mmddyy10. informat=date9. ; call missing(of _all_); stop; run;
How to create SAS data templates by inheritance
If you already have some data table with well-defined variable attributes, you may easily create a data template out of that data table by inheriting its descriptor portion:
data PARMSDL.MYTEMPLATE; set SASDL.MYDATA (obs=0); run;
Option (obs=0) does work here as it is applied to the dataset being read, and therefore STOP statement is not necessary.
You can also combine inheritance with defining new variables, as in the following example:
data MYTEMPLATE; set SASDL.MYDATA (obs=0); *<-- inherited template; * variables definition: ; label newvar1 = 'Label for new variable 1' newvarN = 'Label for new variable N' oldvar = 'New Label for OLD variable' ; length newvar1 $40 newvarN 8 oldvar $100 /* careful here, see notes below */ ; format newvarN mmddyy10.; informat newvarN date9.; run;
A word of warning
Be careful when your new variable definition type and length contradicts inherited definition.
You can overwrite/re-define inherited variable attributes such as labels, formats and informats with no problem, but you cannot overwrite type and in some cases length. If you do need to have a different variable type for a specific variable name on your data template, you should first drop that variable on the SET statement and then re-define it in the data step.
With the length attribute the picture is a bit different. If you try defining a different length for some variable, SAS will produce the following WARNING in the LOG:
WARNING: Length of character variable has already been set. Use the LENGTH statement as the very first statement in the DATA STEP to declare the length of a character variable.
You can either use the advice of the WARNING statement and place the LENGTH statement as the very first statement or at least before the SET statement. In this case, you will find that you can increase the length without a problem, but if you try to reduce the length relative to the one on the parent dataset SAS will produce the following WARNING in the LOG:
WARNING: Multiple lengths were specified for the variable by input data set(s). This can cause truncation of data.
In this case, a cleaner way will also be to drop that variable on the SET statement and redefine it with the LENGTH statement in the data step.
Keep in mind that when you drop these variables from the parent data set, besides losing their type and length attributes, you will obviously lose the rest of the attributes too. Therefore, you will need to re-define all the attributes (type, length, label, format, and informat) for the variables you drop. At least, this technique will allow you to selectively inherit some variables from a parent data set and explicitly define others.
How to use SAS data templates
One way to apply your data template to a newly created dataset is to: 1) Copy your data template in that new dataset; 2) Append your data table to that new data set. Here is an example:
/* copying data template into dataset */ data SASDL.MYNEWDATA set PARMSDL.MYTEMPLATE; run; /* append data to your dataset with descriptor */ proc append base=SASDL.MYNEWDATA data=WORK.MYDATA; run;
Your variable types and lengths should be the same on the BASE= and DATA= tables; labels, formats and informats will be carried over from the BASE= dataset/template so you can skip assigning them on the DATA= table.
It is simple, but could be simplified even more to reduce your code to just a single data step:
data SASDL.MYNEWDATA; if 0 then set PARMSDL.MYTEMPLATE; set WORK.MYDATA; /* other statements */ run;
Even though set PARMSDL.MYTEMPLATE; statement has never executed because of the explicitly FALSE condition (0 means FALSE) in the IF statement, the resulting dataset SASDL.MYNEWDATA gets all its variable attributes carried over from the PARMSDL.MYTEMPLATE data template during data step compilation. This effectively injects your data template descriptor into the data step without breaking its natural flow of iterations controlled by set WORK.MYDATA; statement.
This same coding technique can be used to implicitly apply data variable attributes from a well-defined data set by inheritance even though that data set is not technically a data template (has more than 0 observations.) Run the following code to make sure MYDATA table has all the variables and attributes of the SASHELP.CARS data table while data values come from the ABC data set:
data ABC; make='Toyota'; run; data MYDATA; if 0 then set SASHELP.CARS; set ABC; run;
Perhaps the benefits of SAS data templates are best demonstrated when you read external data into SAS data table. Here is an example for you to run (of course, in real life MYTEMPLATE should be a permanent data set and instead of datalines it should be an external file):
data MYTEMPLATE; label fdate = 'Flight Date' count = 'Flight Count' fdesc = 'Flight Description' reven = 'Revenue, $'; length fdate count reven 8 fdesc $22; format fdate date9. count comma12. reven dollar12.2; informat fdate mmddyy10. count comma8. fdesc $22. reven comma10.; stop; run; data FLIGHTS; if 0 then set MYTEMPLATE; input fdate count fdesc & reven; datalines; 12/05/2018 500 Flight from DCA to BOS 120,034 10/01/2018 1,200 Flight from BOS to DCA 90,534 09/15/2018 2,234 Flight from DCA to MCO 1,350 ;
Here is how the output data set looks:
Notice how simple the last data step is. No labels, no lengths, no formats, no informats – no clutter. Yet, the raw data is read in nicely, with proper informats applied, and the resulting data set has all the proper labels and variable formatting. And when you repeat this process for another sample of similar data you can still use the same data template, and your read-in data step stays the same – simple and concise.
Do you find SAS data templates useful? Do you use them in any shape or form in your SAS data development projects? Please share your thoughts.