Simplify data preparation using SAS data templates

20

Using data templates in data management, data development and data preparation
My New Year's resolution: “Unclutter your life” and I hope this post will help you do the same.

Here I share with you a data preparation approach and SAS coding technique that will significantly simplify, unclutter and streamline your SAS programming life by using data templates.

Dictionary.com defines template as “anything that determines or serves as a pattern; a model.” However, I was flabbergasted when my “prior art research” for the topic of this blog post ended rather abruptly: “No results found for data template.”

What do you mean “no results?!” (Yes, sometimes I talk to the Internet. Do you?) We have templates for everything in the world: MS Word templates, C++ templates, Photoshop templates, website templates, holiday templates, we even have our own PROC TEMPLATE. But no templates for data?

At that point I paused, struggling to accept reality, but then felt compelled to come up with my own definition:

A data template is a well-defined data structure containing a data descriptor but no data.

Therefore, a SAS data template is a SAS dataset (data table) containing the descriptor portion with all necessary attributes defined (variable types, labels, lengths, formats, and informats) and empty (zero observations) data portion.

Less clutter = greater efficiency

When you construct SAS data tables using SAS code or data management tools such as SAS Data Integration Studio, data descriptions which are quite mundane can overshadow the rest of your more sophisticated data processing logic. With tens, and sometimes hundreds, of variables to describe, your data preparation code or process can become unsightly, tedious and bulky.

The main benefit of using SAS data templates is that it allows you to unclutter voluminous data descriptions by separating them into modules apart from the rest of your SAS code. This allows your data processing modules to be lean and concise, and therefore easier to develop, debug, understand and work with.

Data templates also:

  1. Assure your data consistency across multiple instances.
  2. Provide greater coding and data management flexibility.
  3. Ensure your data is always well described.
  4. Save development time (create once, use multiple times.)
  5. Unclutter your SAS log, making it easier to read and analyze.
  6. Self-document your data.
  7. Can be used for SAS tables as well as other non-SAS databases.

In addition, from a project management perspective, data description and data processing tasks require people with quite different skill sets/levels. Having team members implement them in parallel speeds up the development process. Moreover, you can fully automate building data templates by using data design documentation as a feed to code-generating SAS program.

Unfortunately, despite all these benefits the data template concept is not explicitly and consistently employed and is noticeably absent from data development methodologies and practices.

Let’s try to change that!

How to create SAS data templates from scratch

It is very easy to create SAS data template. Here is an example:

 
libname PARMSDL 'c:\projects\datatemplates';
data PARMSDL.MYTEMPLATE;
   label
      newvar1 = 'Label for new variable 1'
      newvar2 = 'Label for new variable 2'
      /* ... */
      newvarN = 'Label for new variable N'
      ;
  length
      newvar1 newvar2 $40
      newvarN 8
      ;
   format newvarN mmddyy10.;
   informat newvarN date9.;
   stop;
run;

First, you need to assign a permanent library (e.g. PARMSDL) where you are going to store your SAS dataset template. I usually do not store data templates in the same library as data. Nor do I store it in the same directory/folder where you store your SAS code. Ordinarily, I store data templates in a so-called parameter data library (that is why I use PARMSDL as a libref), along with other data defining SAS code structure.

In the data step, the very first statement LABEL defines all variables’ labels as well as variable position determined by the order in which they are listed.

Statement LENGTH defines variables’ types (numeric or character) and their length in bytes. Here you may group variables of the same length to shorten your code or define them individually to be more explicit. If you don't define a variable type and length here SAS will not place that variable into the data template (well, unless you implicitly define its type and length in optional format or informat statement.)

Statement FORMAT defines variables’ formats as needed. You don’t have to define formats for all the variables; define them only if necessary.

Statement INFORMAT (also optional) defines informats that come handy if you use this data template for creating SAS datasets by reading external raw files. With informats defined on the data template, you won’t have to specify informats in your INPUT statement while reading external file, as the informats will be inherently associated with the variable names. That is why SAS data sets have informat attribute for its variables in the first place (if you ever wondered why.)

Finally, don’t forget the STOP statement at the end of your data step, just before the RUN statement. Otherwise, instead of zero observations, you will end up with a data table that has a single observation with all missing variable values. Not what we want.

It is worth noting that obs=0 system option will not work instead of the STOP statement as it is applied only to the data being read, but we read no data here. For the same reason, (obs=0) data set option will not work either. Try it, and SAS log will dispel your doubts:

 
data PARMSDL.MYTEMPLATE (obs=0);
                        ---
                        70
WARNING 70-63: The option OBS is not valid in this context.  Option ignored.

Alternatively, you may use ATTRIB statement to assign LABEL, LENGTH, FORMAT and INFORMAT:

 
data PARMSDL.MYTEMPLATE;
   attrib
      newvar1  label='Label for new variable 1' length=$40
      newvar2  label='Label for new variable 2' length=$40
      /* ... */
      newvarN  label='Label for new variable N' length=8    format=mmddyy10.  informat=date9.
      ;
   stop;
run;

How to create SAS data templates by inheritance

If you already have some data table with well-defined variable attributes, you may easily create a data template out of that data table by inheriting its descriptor portion:

 
data PARMSDL.MYTEMPLATE;
   set SASDL.MYDATA (obs=0);
run;

Option (obs=0) does work here as it is applied to the dataset being read, and therefore STOP statement is not necessary.

You can also combine inheritance with defining new variables, as in the following example:

 
data MYTEMPLATE;
   set SASDL.MYDATA (obs=0); *<-- inherited template;
   * variables definition: ;
   label
      newvar1 = 'Label for new variable 1'
      newvarN = 'Label for new variable N'
      oldvar  = 'New Label for OLD variable' 
      ;
   length
      newvar1 $40
      newvarN 8
      oldvar  $100 /* careful here, see notes below */
      ;
   format newvarN mmddyy10.;
   informat newvarN date9.;
run;

A word of warning

Be careful when your new variable definition type and length contradicts inherited definition.
You can overwrite/re-define inherited variable attributes such as labels, formats and informats with no problem, but you cannot overwrite type and in some cases length. If you do need to have a different variable type for a specific variable name on your data template, you should first drop that variable on the SET statement and then re-define it in the data step.

With the length attribute the picture is a bit different. If you try defining a different length for some variable, SAS will produce the following WARNING in the LOG:

 
WARNING: Length of character variable  has already been set.
Use the LENGTH statement as the very first statement in the DATA STEP to declare the length of a character variable.

You can either use the advice of the WARNING statement and place the LENGTH statement as the very first statement or at least before the SET statement. In this case, you will find that you can increase the length without a problem, but if you try to reduce the length relative to the one on the parent dataset SAS will produce the following WARNING in the LOG:

 
WARNING: Multiple lengths were specified for the variable  by input data set(s). This can cause truncation of data.

In this case, a cleaner way will also be to drop that variable on the SET statement and redefine it with the LENGTH statement in the data step.

Keep in mind that when you drop these variables from the parent data set, besides losing their type and length attributes, you will obviously lose the rest of the attributes too. Therefore, you will need to re-define all the attributes (type, length, label, format, and informat) for the variables you drop. At least, this technique will allow you to selectively inherit some variables from a parent data set and explicitly define others.

How to use SAS data templates

One way to apply your data template to a newly created dataset is to: 1) Copy your data template in that new dataset; 2) Append your data table to that new data set. Here is an example:

 
/* copying data template into dataset */
data SASDL.MYNEWDATA
   set PARMSDL.MYTEMPLATE;
run;
 
/* append data to your dataset with descriptor */
proc append base=SASDL.MYNEWDATA data=WORK.MYDATA;
run;

Your variable types and lengths should be the same on the BASE= and DATA= tables; labels, formats and informats will be carried over from the BASE= dataset/template.

It is simple, but could be simplified even more to reduce your code to just a single data step:

 
 data SASDL.MYNEWDATA;
   if 0 then set PARMSDL.MYTEMPLATE;
   set WORK.MYDATA;
   /* other statements */
run;

Even though set PARMSDL.MYTEMPLATE; statement has never executed because of the explicitly FALSE condition (0 means FALSE) in the IF statement, the resulting dataset SASDL.MYNEWDATA gets all its variable attributes carried over from the PARMSDL.MYTEMPLATE data template during data step compilation. This effectively injects your data template descriptor into the data step without breaking its natural flow of iterations controlled by set WORK.MYDATA; statement.

This same coding technique can be used to implicitly apply data variable attributes from a well-defined data set by inheritance even though that data set is not technically a data template (has more than 0 observations.) Run the following code to make sure MYDATA table has all the variables and attributes of the SASHELP.CARS data table while data values come from the ABC data set:

 
data ABC;
  make='Toyota';
run;
 
data MYDATA;
   if 0 then set SASHELP.CARS;
   set ABC;
run;

Perhaps the benefits of SAS data templates are best demonstrated when you read external data into SAS data table. Here is an example for you to run (of course, in real life MYTEMPLATE should be a permanent data set and instead of datalines it should be an external file):

 
data MYTEMPLATE;
   label
      fdate = 'Flight Date'
      count = 'Flight Count'
      fdesc = 'Flight Description'
      reven = 'Revenue, $';
   length fdate count reven 8 fdesc $22;
   format fdate date9. count comma12. reven dollar12.2;
   informat fdate mmddyy10. count comma8. fdesc $22. reven comma10.;
   stop;
run;
 
data FLIGHTS;
   if 0 then set MYTEMPLATE;
   input fdate count fdesc & reven;
   datalines;
12/05/2018 500   Flight from DCA to BOS  120,034
10/01/2018 1,200 Flight from BOS to DCA  90,534
09/15/2018 2,234 Flight from DCA to MCO  1,350
;

Here is how the output data set looks:
Results of applying data templates
Notice how simple the last data step is. No labels, no lengths, no formats, no informats – no clutter. Yet, the raw data is read in nicely, with proper informats applied, and the resulting data set has all the proper labels and variable formatting. And when you repeat this process for another sample of similar data you can still use the same data template, and your read-in data step stays the same – simple and concise.

Your thoughts

Do you find SAS data templates useful? Do you use them in any shape or form in your SAS data development projects? Please share your thoughts.

Tags
Share

About Author

Leonid Batkhan

Leonid Batkhan, Ph.D. in Computer Science and Automatic Control Systems, has been a SAS user for more than 25 years. He came to work for SAS in 1995 and is currently a Senior Consultant with the SAS Federal Data Management and Business Intelligence Practice. During his career, Leonid has successfully implemented dozens of SAS applications and projects in various industries. All posts by Leonid Batkhan >>>

Related Posts

20 Comments

  1. Leonid,

    Another great blog post!

    Your readers might also be interested in this alternate way to specify inheritance:

    proc sql;
       create table PARMSDL.MYTEMPLATE
       like SASDL.MYDATA;
    quit;
    

    Looking forward to your next blog post!

    ----MMMMIIIIKKKKEEEE

    • Leonid Batkhan

      Thank you, Mike, for your nice comment and lovely SQL code for data template creation by inheritance. Without it, I would use a clunkier SQL version:

      proc sql;
         create table PARMSDL.MYTEMPLATE as select *
         from SASDL.MYDATA where 0;
      quit;

      but your code snippet is just so much more elegant! Does it come from one of your SAS books?

  2. Jonathan McCopp on

    Huge fan of the data templates. I wish I had heard about them 10 years ago. I can't believe I have been coding length statements for this use case like a chump. I feel like climbing on the roof and yelling this so all the SAS programmers can hear!

  3. I have used template datasets for many years (when it made sense). I used it most often when I was a pharma programmer; I would create the template from the data specs, knowing that my final datasets would have the right attributes. It's like a pre-compiled object that encapsulates my desired attributes.

    I'll point out a couple things - hopefully they're at least tangentially related to this post 🙂

    1) Remember, all "data set" variables (as opposed to "data step" variables) have an implied retain. This has bitten me in the past (sometimes badly!), esp. when I use the if 0 then set ... construct in conjunction with using hash objects.

    Example code:

    data template;
       set sashelp.class;
       stop;
    run;
    
    * this is ok ;
    data test;
       if 0 then set template;
       set sashelp.class;
       if sex='M' then NewHeight=height*2;  * this is a "data step" variable ;
    run;
    
    * this is not ok ;
    data test2;
       if 0 then set template;
       set sashelp.class (rename=(height=_height) drop=weight);
       if sex='M' then height=_height*2;
       if sex='F' and name < 'M' then weight=_height*3;
    run;
    
    * but this works as expected ;
    data test3;
       if 0 then set template;
       * set all columns to missing AS OF THIS POINT IN THE DATA STEP COMPLIATION ;
       call missing(of _all_);  
       retain foo 'bar';  * is not part of the _all_ variable list and is not reset ;
       set sashelp.class (rename=(height=_height) drop=weight);
       if sex='M' then height=_height*2;
       if sex='F' and name < 'M' then weight=_height*3;
    run;
    

    These are just examples, and perhaps obvious in these short examples. But when you've got that 3,000 line data step (that someone else wrote) and you've got hard to track down data issues, remember this :). I encounter this most often when that program has multiple source data sets, i.e. set ds1 ds2 ds3, but the structure is not identical - say ds1 has the complete list of variables, but ds2 and ds3 only have a subset of the total variables.

    2) To set the desired PDV order, you'll usually read that you should use the retain statement. As above, this can cause problems if that column is conditionally set via your programming logic. For this reason, I prefer a "naked" format statement.

    Example code:

    * this is not ok ;
    data test4;
       retain foo bar height weight; * the rest of the order is set by statements as encountered in the data step ;
       if 0 then set template;
       set sashelp.class;
       if sex='M' then foo=42;
       if sex='F' then bar='blah';
    run;
    
    * this is ok ;
    data test5;
       * set PDV order ;
       format foo bar height name weight; * the rest of the order is set by statements as encountered in the data step ;
       if 0 then set template;
       set sashelp.class;
       if sex='M' then foo=42;
       if sex='F' then bar='blah';
       * and "last statement wins", so you can still set permanent formats ;
       format height z4. weight 8.3 name $upcase.;
    run;
    

    Hope this helps!

  4. Chris Hemedinger
    Chris Hemedinger on

    For those who are more comfortable with SQL, you can also use the DESCRIBE TABLE statement to generate an SQL version:

    proc sql;
     describe table flights;
    run;

    Generates this in the log:

    create table WORK.FLIGHTS( bufsize=8192 )
      (
       fdate num format=DATE9. informat=MMDDYY10. label='Flight Date',
       count num format=COMMA12. informat=COMMA8. label='Flight Count',
       fdesc char(22) informat=$22. label='Flight Description',
       reven num format=DOLLAR12.2 informat=COMMA10. label='Revenue, $'
      );
    

    • Leonid Batkhan

      Thank you, Chris, for this valuable addition. The resulting SQL code demonstrates how data templates can be created in SQL. Besides, I see this as a convenient way to grab a data template and generate code that can be then modified in order to update an existing or to create a new data template.
      I guess, I will need to follow up with a separate blog post on using data templates with PROC SQL.

    • Leonid Batkhan

      Thank you, Norman, for your feedback and a great question.
      It all depends on how you read your XLSX files. If you read a single sheet Excel file, you can convert it (save as) .CSV file. In this case, when you use PROC IMPORT it will generate and print in the SAS log a data step with all the INFORMAT, FORMAT, and INPUT statements. You can use this data step as a first approximation of your data template; add your labels and lengths to it. Then use it as described in this post to read in your .CSV file.
      If you use PROC IMPORT to read multi-sheet XLSX file, then it'll become more involved. Create a SAS data set first, then strip it's variables from the attributes you want to change (e.g. using PROC DATASETS) and then create new dataset using your data template and the modified dataset.

  5. Analogous to Chris' approach, I wrote a macro %attrib that either 1) generates attrib statements from a (template) dataset, or 2) write those attrib statements to the log (for either cut-and-paste or debugging). You may not like the whitespace I use in the log output - if so, change the macro :).

    I wrote this when I was having locking issues using a template dataset in PROC DS2. IIRC I was also trying to use hash objects in the PROC DS2 code. But the attrib statements worked (i.e. no locking issues). And sample invocations:

    options mprint mrecall;
    options ls=max;
    options sasautos=("\\path\to\your\autocall\macros" sasautos);  * %attrib should be in this path ;
    
    data template;
       %attrib(sashelp.cars);
       stop;
    run;
    
    %attrib(sashelp.cars,show=Y);
    
    

    • Leonid Batkhan

      Hi Scott, thank you again for sharing your implementation of the data templates usage. What are the benefits of using your macro compared to all the approaches described in this blog post and the comments above? I am sure there might be multiple ways of doing the same in SAS.

      • Hi Leonid, Chris's post shows the describe table functionality in SQL, which generates the code for a "template table" ***in the SAS log***. You'd still have to cut-and-paste the code to actually create the table. AFAIK there is no analogous functionality to generate the code for a data step given an existing data set. So I thought I'd share my macro in case someone found it useful. But sure, there are many ways to create a template data set. I guess folks will use the approach that best suits their needs.

Leave A Reply

Back to Top