Simplify data preparation using SAS data templates

33

Using data templates in data management, data development and data preparation
My New Year's resolution: “Unclutter your life” and I hope this post will help you do the same.

Here I share with you a data preparation approach and SAS coding technique that will significantly simplify, unclutter and streamline your SAS programming life by using data templates.

Dictionary.com defines template as “anything that determines or serves as a pattern; a model.” However, I was flabbergasted when my “prior art research” for the topic of this blog post ended rather abruptly: “No results found for data template.”

What do you mean “no results?!” (Yes, sometimes I talk to the Internet. Do you?) We have templates for everything in the world: MS Word templates, C++ templates, Photoshop templates, website templates, holiday templates, we even have our own PROC TEMPLATE. But no templates for data?

At that point I paused, struggling to accept reality, but then felt compelled to come up with my own definition:

A data template is a well-defined data structure containing a data descriptor but no data.

Therefore, a SAS data template is a SAS dataset (data table) containing the descriptor portion with all necessary attributes defined (variable types, labels, lengths, formats, and informats) and empty (zero observations) data portion.

Less clutter = greater efficiency

When you construct SAS data tables using SAS code or data management tools such as SAS Data Integration Studio, data descriptions which are quite mundane can overshadow the rest of your more sophisticated data processing logic. With tens, and sometimes hundreds, of variables to describe, your data preparation code or process can become unsightly, tedious and bulky.

The main benefit of using SAS data templates is that it allows you to unclutter voluminous data descriptions by separating them into modules apart from the rest of your SAS code. This allows your data processing modules to be lean and concise, and therefore easier to develop, debug, understand and work with.

Data templates also:

  1. Assure your data consistency across multiple instances.
  2. Provide greater coding and data management flexibility.
  3. Ensure your data is always well described.
  4. Save development time (create once, use multiple times.)
  5. Unclutter your SAS log, making it easier to read and analyze.
  6. Self-document your data.
  7. Can be used for SAS tables as well as other non-SAS databases.

In addition, from a project management perspective, data description and data processing tasks require people with quite different skill sets/levels. Having team members implement them in parallel speeds up the development process. Moreover, you can fully automate building data templates by using data design documentation as a feed to code-generating SAS program.

Unfortunately, despite all these benefits the data template concept is not explicitly and consistently employed and is noticeably absent from data development methodologies and practices.

Let’s try to change that!

How to create SAS data templates from scratch

It is very easy to create SAS data template. Here is an example:

 
libname PARMSDL 'c:\projects\datatemplates';
data PARMSDL.MYTEMPLATE;
   label
      newvar1 = 'Label for new variable 1'
      newvar2 = 'Label for new variable 2'
      /* ... */
      newvarN = 'Label for new variable N'
      ;
  length
      newvar1 newvar2 $40
      newvarN 8
      ;
   format newvarN mmddyy10.;
   informat newvarN date9.;
   call missing(of _all_);
   stop;
run;

First, you need to assign a permanent library (e.g. PARMSDL) where you are going to store your SAS dataset template. I usually do not store data templates in the same library as data. Nor do I store it in the same directory/folder where you store your SAS code. Ordinarily, I store data templates in a so-called parameter data library (that is why I use PARMSDL as a libref), along with other data defining SAS code structure.

In the data step, the very first statement LABEL defines all variables’ labels as well as variable position determined by the order in which they are listed.

Statement LENGTH defines variables’ types (numeric or character) and their length in bytes. Here you may group variables of the same length to shorten your code or define them individually to be more explicit. If you don't define a variable type and length here SAS will not place that variable into the data template (well, unless you implicitly define its type and length in optional format or informat statement.)

Statement FORMAT defines variables’ formats as needed. You don’t have to define formats for all the variables; define them only if necessary.

Statement INFORMAT (also optional) defines informats that come handy if you use this data template for creating SAS datasets by reading external raw files. With informats defined on the data template, you won’t have to specify informats in your INPUT statement while reading external file, as the informats will be inherently associated with the variable names. That is why SAS data sets have informat attribute for its variables in the first place (if you ever wondered why.)

Optional CALL MISSING(of _ALL_); statement eliminates a series of annoying NOTES in the SAS log about variables being uninitialized:

NOTE: Variable newvar1 is uninitialized.
NOTE: Variable newvar2 is uninitialized.
. . .
NOTE: Variable newvarN is uninitialized.

Finally, don’t forget the STOP statement at the end of your data step, just before the RUN statement. Otherwise, instead of zero observations, you will end up with a data table that has a single observation with all missing variable values. Not what we want.

It is worth noting that obs=0 system option will not work instead of the STOP statement as it is applied only to the data being read, but we read no data here. For the same reason, (obs=0) data set option will not work either. Try it, and SAS log will dispel your doubts:

 
data PARMSDL.MYTEMPLATE (obs=0);
                        ---
                        70
WARNING 70-63: The option OBS is not valid in this context.  Option ignored.

Alternatively, you may use ATTRIB statement to assign LABEL, LENGTH, FORMAT and INFORMAT:

 
data PARMSDL.MYTEMPLATE;
   attrib
      newvar1  label='Label for new variable 1' length=$40
      newvar2  label='Label for new variable 2' length=$40
      /* ... */
      newvarN  label='Label for new variable N' length=8    format=mmddyy10.  informat=date9.
      ;
   call missing(of _all_);
   stop;
run;

How to create SAS data templates by inheritance

If you already have some data table with well-defined variable attributes, you may easily create a data template out of that data table by inheriting its descriptor portion:

 
data PARMSDL.MYTEMPLATE;
   set SASDL.MYDATA (obs=0);
run;

Option (obs=0) does work here as it is applied to the dataset being read, and therefore STOP statement is not necessary.

You can also combine inheritance with defining new variables, as in the following example:

 
data MYTEMPLATE;
   set SASDL.MYDATA (obs=0); *<-- inherited template;
   * variables definition: ;
   label
      newvar1 = 'Label for new variable 1'
      newvarN = 'Label for new variable N'
      oldvar  = 'New Label for OLD variable' 
      ;
   length
      newvar1 $40
      newvarN 8
      oldvar  $100 /* careful here, see notes below */
      ;
   format newvarN mmddyy10.;
   informat newvarN date9.;
run;

A word of warning

Be careful when your new variable definition type and length contradicts inherited definition.
You can overwrite/re-define inherited variable attributes such as labels, formats and informats with no problem, but you cannot overwrite type and in some cases length. If you do need to have a different variable type for a specific variable name on your data template, you should first drop that variable on the SET statement and then re-define it in the data step.

With the length attribute the picture is a bit different. If you try defining a different length for some variable, SAS will produce the following WARNING in the LOG:

 
WARNING: Length of character variable  has already been set.
Use the LENGTH statement as the very first statement in the DATA STEP to declare the length of a character variable.

You can either use the advice of the WARNING statement and place the LENGTH statement as the very first statement or at least before the SET statement. In this case, you will find that you can increase the length without a problem, but if you try to reduce the length relative to the one on the parent dataset SAS will produce the following WARNING in the LOG:

 
WARNING: Multiple lengths were specified for the variable  by input data set(s). This can cause truncation of data.

In this case, a cleaner way will also be to drop that variable on the SET statement and redefine it with the LENGTH statement in the data step.

Keep in mind that when you drop these variables from the parent data set, besides losing their type and length attributes, you will obviously lose the rest of the attributes too. Therefore, you will need to re-define all the attributes (type, length, label, format, and informat) for the variables you drop. At least, this technique will allow you to selectively inherit some variables from a parent data set and explicitly define others.

How to use SAS data templates

One way to apply your data template to a newly created dataset is to: 1) Copy your data template in that new dataset; 2) Append your data table to that new data set. Here is an example:

 
/* copying data template into dataset */
data SASDL.MYNEWDATA
   set PARMSDL.MYTEMPLATE;
run;
 
/* append data to your dataset with descriptor */
proc append base=SASDL.MYNEWDATA data=WORK.MYDATA;
run;

Your variable types and lengths should be the same on the BASE= and DATA= tables; labels, formats and informats will be carried over from the BASE= dataset/template so you can skip assigning them on the DATA= table.

It is simple, but could be simplified even more to reduce your code to just a single data step:

 
 data SASDL.MYNEWDATA;
   if 0 then set PARMSDL.MYTEMPLATE;
   set WORK.MYDATA;
   /* other statements */
run;

Even though set PARMSDL.MYTEMPLATE; statement has never executed because of the explicitly FALSE condition (0 means FALSE) in the IF statement, the resulting dataset SASDL.MYNEWDATA gets all its variable attributes carried over from the PARMSDL.MYTEMPLATE data template during data step compilation. This effectively injects your data template descriptor into the data step without breaking its natural flow of iterations controlled by set WORK.MYDATA; statement.

This same coding technique can be used to implicitly apply data variable attributes from a well-defined data set by inheritance even though that data set is not technically a data template (has more than 0 observations.) Run the following code to make sure MYDATA table has all the variables and attributes of the SASHELP.CARS data table while data values come from the ABC data set:

 
data ABC;
  make='Toyota';
run;
 
data MYDATA;
   if 0 then set SASHELP.CARS;
   set ABC;
run;

Perhaps the benefits of SAS data templates are best demonstrated when you read external data into SAS data table. Here is an example for you to run (of course, in real life MYTEMPLATE should be a permanent data set and instead of datalines it should be an external file):

 
data MYTEMPLATE;
   label
      fdate = 'Flight Date'
      count = 'Flight Count'
      fdesc = 'Flight Description'
      reven = 'Revenue, $';
   length fdate count reven 8 fdesc $22;
   format fdate date9. count comma12. reven dollar12.2;
   informat fdate mmddyy10. count comma8. fdesc $22. reven comma10.;
   stop;
run;
 
data FLIGHTS;
   if 0 then set MYTEMPLATE;
   input fdate count fdesc & reven;
   datalines;
12/05/2018 500   Flight from DCA to BOS  120,034
10/01/2018 1,200 Flight from BOS to DCA  90,534
09/15/2018 2,234 Flight from DCA to MCO  1,350
;

Here is how the output data set looks:
Results of applying data templates
Notice how simple the last data step is. No labels, no lengths, no formats, no informats – no clutter. Yet, the raw data is read in nicely, with proper informats applied, and the resulting data set has all the proper labels and variable formatting. And when you repeat this process for another sample of similar data you can still use the same data template, and your read-in data step stays the same – simple and concise.

Your thoughts

Do you find SAS data templates useful? Do you use them in any shape or form in your SAS data development projects? Please share your thoughts.

Share

About Author

Leonid Batkhan

Leonid Batkhan is a long-time SAS consultant and blogger. Currently, he is a Lead Applications Developer at F.N.B. Corporation. He holds a Ph.D. in Computer Science and Automatic Control Systems and has been a SAS user for more than 25 years. From 1995 to 2021 he worked as a Data Management and Business Intelligence consultant at SAS Institute. During his career, Leonid has successfully implemented dozens of SAS applications and projects in various industries. All posts by Leonid Batkhan >>>

33 Comments

  1. Please check this new variable retain issue ;

    data template;
    set sashelp.class;
    stop;
    run;
    data class;
    set sashelp.class;
    drop height weight ;
    run;

    * this is NOT ok ;
    data test;
    if 0 then set template;
    set class;
    if sex='M' then height=2; /* it's not OK */
    run;

    • Leonid Batkhan

      Hi Lianbo,
      Thank you for your comment. What you experience here is an automatic, implicit RETAIN which is perfectly OK and in line with expectations. Let me explain why. Even though you dropped the HEIGHT variable when created dataset CLASS, this variable is NOT NEW in the last data step (data test;). It is not new because when you apply data template using "if 0 then set template;" it comes to the data step from that dataset TEMPLATE and therefore automatically retained. As explained in READY, SET, RETAIN, AND THEN MAYBE RESET SGF paper, see section CASES IN WHICH A VARIABLE IS AUTOMATICALLY RETAINED at the end of page 6: "there are several types of variables that are automatically retained. These include temporary array variables (variables from arrays named _TEMPORARY_), automatic variables (e.g., _N_, _ERROR_) and SAS data set variables (variables read from a SAS data set using SET, MERGE, UPDATE, and MODIFY). It would be redundant to use the RETAIN statement in these cases where variables are automatically retained or in other words are not initialized before each new observation is read."

      If you do not want this HEIGHT variable to retain its values, you would need to explicitly reset it to missing, e.g.:

      data test;
         if 0 then set template;
         set class;
         if sex='M' then height=2; else height=.;
      run;
      

      or

      data test;
         if 0 then set template;
         set class;
         height = ifn(sex='M',2,.);
      run;
      

      Another option would be to drop this variable from the TEMPLATE dataset; in that case this variable will indeed be NEW in the last data step and therefore not retained automatically by default.

      Hope this helps.
      Best regards,
      Leonid

  2. Great post. MMMMIIIIKKKKEEEE gave an excellent paper at SGF last year that described the use of data templates like this, with the addition of integrity constraints (I think integrity constraints are woefully underused by the SAS community, including by me : ). His paper introduces the "User Data File Template Validation Methodology (UDFTVM)". With that, you get all the benefits of a data template that defines data set attributes, and when you load data into the template structure, the data is automatically validated! Very cool. The only thing I don't like about Mike's talk is that I had to give a validation talk in the same room, following his. : P

    Mike's paper is Validating User-Submitted Data Files with Base SAS®.

  3. Another thing you can do to create uniformity in variables in multiple data sets within a project: Create a master template data set containing all of the variables in your project, then use the KEEP= data set option to pull in only the variables you need for a particular output data set,

    data somestuff;
        if 0 then set perm.mastervars (keep=ID name phone);   
    ...
    
    data morestuff;
        if 0 then set perm.mastervars (keep=ID department ordercount);
    ...
    

    This same technique works with PROC SQL:
    proc sql;
        create table morestuff like master (keep=id department ordercount);
    quit;
    

    Scott Bass makes some very good points. It should be straightforward to extend his %attrib macro to create code for either the data step or SQL.

    • Leonid Batkhan

      Thank you, Jack, for joining this discussion. While your proposed single data template per project approach ensures consistency of variables across different data tables, in my view, it can be practical only when your project uses just few variables. Otherwise, you would end up with having to list dozens of variables in the KEEP statement which by itself presents an “opportunity” for introducing errors (you may end up with correct variable attributes, but incorrect variable set.)

      Besides, this leaves us without a way of knowing which variables belong to each dataset.

      On the other hand, a “one data template per table” approach which I describe in this blog post, data template not only contains variable attributes, but it also defines a set of variables. That set of variables can be easily captured in a macro variable from the data template as in the following examples:

          proc sql noprint;
            select name into :varlist separated by ' '
            from SASHELP.VCOLUMN
            where libname='PARMSDL' and memname='DATATEMPLATE1';
         quit;
      

      or

         proc sql noprint;
            select name into :varlist separated by ' '
            from DICTIONARY.COLUMNS
            where libname='PARMSDL' and memname='DATATEMPLATE1';
         quit;
      

      Then you can use that macro variable in the KEEP statement along with data template injection:

         data NEWDATA (keep=&varlist);
            if 0 then set PARMSDL.DATATEMPLATE1;
            set DATA;
            /* ... */
         run;
      

      This will assure that only the variables from data template are kept while all other (interim) variables are dropped.

  4. At first skeptical as we typically have ddl to create the tables empty which is what yours is doing. However your last piece to reuse the template for multiple incoming tables is a good thing. I'll pass this along. Thank you for sharing.

    • Leonid Batkhan

      Thank you, Dee! You nailed that. Sure, the main benefit is in using and reusing the data templates - create once, use multiple times. And with dozens of variables to create (read in) we don't have to include and re-run their descriptions.
      I was too a bit skeptical of writing some obvious things about creating templates, but I felt compelled to do it for the sake of usage as I myself “sinned” for too long copy/pasting those length/label/format statements from one program to another.

  5. Analogous to Chris' approach, I wrote a macro %attrib that either 1) generates attrib statements from a (template) dataset, or 2) write those attrib statements to the log (for either cut-and-paste or debugging). You may not like the whitespace I use in the log output - if so, change the macro :).

    I wrote this when I was having locking issues using a template dataset in PROC DS2. IIRC I was also trying to use hash objects in the PROC DS2 code. But the attrib statements worked (i.e. no locking issues). And sample invocations:

    options mprint mrecall;
    options ls=max;
    options sasautos=("\\path\to\your\autocall\macros" sasautos);  * %attrib should be in this path ;
    
    data template;
       %attrib(sashelp.cars);
       stop;
    run;
    
    %attrib(sashelp.cars,show=Y);
    
    

    • Leonid Batkhan

      Hi Scott, thank you again for sharing your implementation of the data templates usage. What are the benefits of using your macro compared to all the approaches described in this blog post and the comments above? I am sure there might be multiple ways of doing the same in SAS.

      • Hi Leonid, Chris's post shows the describe table functionality in SQL, which generates the code for a "template table" ***in the SAS log***. You'd still have to cut-and-paste the code to actually create the table. AFAIK there is no analogous functionality to generate the code for a data step given an existing data set. So I thought I'd share my macro in case someone found it useful. But sure, there are many ways to create a template data set. I guess folks will use the approach that best suits their needs.

    • Leonid Batkhan

      Thank you, Norman, for your feedback and a great question.
      It all depends on how you read your XLSX files. If you read a single sheet Excel file, you can convert it (save as) .CSV file. In this case, when you use PROC IMPORT it will generate and print in the SAS log a data step with all the INFORMAT, FORMAT, and INPUT statements. You can use this data step as a first approximation of your data template; add your labels and lengths to it. Then use it as described in this post to read in your .CSV file.
      If you use PROC IMPORT to read multi-sheet XLSX file, then it'll become more involved. Create a SAS data set first, then strip it's variables from the attributes you want to change (e.g. using PROC DATASETS) and then create new dataset using your data template and the modified dataset.

  6. Chris Hemedinger
    Chris Hemedinger on

    For those who are more comfortable with SQL, you can also use the DESCRIBE TABLE statement to generate an SQL version:

    proc sql;
     describe table flights;
    run;

    Generates this in the log:

    create table WORK.FLIGHTS( bufsize=8192 )
      (
       fdate num format=DATE9. informat=MMDDYY10. label='Flight Date',
       count num format=COMMA12. informat=COMMA8. label='Flight Count',
       fdesc char(22) informat=$22. label='Flight Description',
       reven num format=DOLLAR12.2 informat=COMMA10. label='Revenue, $'
      );
    

    • Leonid Batkhan

      Thank you, Chris, for this valuable addition. The resulting SQL code demonstrates how data templates can be created in SQL. Besides, I see this as a convenient way to grab a data template and generate code that can be then modified in order to update an existing or to create a new data template.
      I guess, I will need to follow up with a separate blog post on using data templates with PROC SQL.

  7. I have used template datasets for many years (when it made sense). I used it most often when I was a pharma programmer; I would create the template from the data specs, knowing that my final datasets would have the right attributes. It's like a pre-compiled object that encapsulates my desired attributes.

    I'll point out a couple things - hopefully they're at least tangentially related to this post 🙂

    1) Remember, all "data set" variables (as opposed to "data step" variables) have an implied retain. This has bitten me in the past (sometimes badly!), esp. when I use the if 0 then set ... construct in conjunction with using hash objects.

    Example code:

    data template;
       set sashelp.class;
       stop;
    run;
    
    * this is ok ;
    data test;
       if 0 then set template;
       set sashelp.class;
       if sex='M' then NewHeight=height*2;  * this is a "data step" variable ;
    run;
    
    * this is not ok ;
    data test2;
       if 0 then set template;
       set sashelp.class (rename=(height=_height) drop=weight);
       if sex='M' then height=_height*2;
       if sex='F' and name < 'M' then weight=_height*3;
    run;
    
    * but this works as expected ;
    data test3;
       if 0 then set template;
       * set all columns to missing AS OF THIS POINT IN THE DATA STEP COMPLIATION ;
       call missing(of _all_);  
       retain foo 'bar';  * is not part of the _all_ variable list and is not reset ;
       set sashelp.class (rename=(height=_height) drop=weight);
       if sex='M' then height=_height*2;
       if sex='F' and name < 'M' then weight=_height*3;
    run;
    

    These are just examples, and perhaps obvious in these short examples. But when you've got that 3,000 line data step (that someone else wrote) and you've got hard to track down data issues, remember this :). I encounter this most often when that program has multiple source data sets, i.e. set ds1 ds2 ds3, but the structure is not identical - say ds1 has the complete list of variables, but ds2 and ds3 only have a subset of the total variables.

    2) To set the desired PDV order, you'll usually read that you should use the retain statement. As above, this can cause problems if that column is conditionally set via your programming logic. For this reason, I prefer a "naked" format statement.

    Example code:

    * this is not ok ;
    data test4;
       retain foo bar height weight; * the rest of the order is set by statements as encountered in the data step ;
       if 0 then set template;
       set sashelp.class;
       if sex='M' then foo=42;
       if sex='F' then bar='blah';
    run;
    
    * this is ok ;
    data test5;
       * set PDV order ;
       format foo bar height name weight; * the rest of the order is set by statements as encountered in the data step ;
       if 0 then set template;
       set sashelp.class;
       if sex='M' then foo=42;
       if sex='F' then bar='blah';
       * and "last statement wins", so you can still set permanent formats ;
       format height z4. weight 8.3 name $upcase.;
    run;
    

    Hope this helps!

  8. Jonathan McCopp on

    Huge fan of the data templates. I wish I had heard about them 10 years ago. I can't believe I have been coding length statements for this use case like a chump. I feel like climbing on the roof and yelling this so all the SAS programmers can hear!

  9. Leonid,

    Another great blog post!

    Your readers might also be interested in this alternate way to specify inheritance:

    proc sql;
       create table PARMSDL.MYTEMPLATE
       like SASDL.MYDATA;
    quit;
    

    Looking forward to your next blog post!

    ----MMMMIIIIKKKKEEEE

    • Leonid Batkhan

      Thank you, Mike, for your nice comment and lovely SQL code for data template creation by inheritance. Without it, I would use a clunkier SQL version:

      proc sql;
         create table PARMSDL.MYTEMPLATE as select *
         from SASDL.MYDATA where 0;
      quit;

      but your code snippet is just so much more elegant! Does it come from one of your SAS books?

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top