Automating SAS variable labels creation

12

Automating SAS applications development

SAS variable labels are unique features of SAS data tables (aka data sets) that allow SAS users to enhance reading and interpretation of tables and reports.

Whether you use SAS data table as a data source in any of the reporting procedures or interactive interface such as SAS Visual Analytics, you will benefit from pre-assigning meaningful labels during the data preparation process. Besides being more efficient, such an early label assignment secures consistency of the data elements descriptions (labels) across different developers.

The most direct way of creating column labels is by explicitly assigning them to the data variables. You can do it during the data table creation in a DATA step using either LABEL statement or ATTRIB statement. Alternatively, you can do it after your data table is already created by using PROC DATASETS’ MODIFY statement with the LABEL= option.

However, in many situations there are ways of automating this tedious and voluminous process of column labels creation. Let’s look into one of them that I found useful for bulk column labeling, especially in quick turnaround scenarios like developing an analytics proof of concept (PoC). Plus, we are going to explore SAS coding technique using _DATA_and_LAST_special data sets.

Deriving variable labels from variable names

This method is suitable when variable names are well-formed, for example CUSTOMER_ADDRESS, FIRST_NAME, LAST_NAME, COMPANY_NAME, PLACE_OF_BIRTH, etc. Kudos to data designer!

We can transform these names into labels by replacing underscores with space characters and converting words from upper case to proper case. These are the labels we will get: Customer Address, First Name, Last Name, Company Name, Place Of Birth.

Let’s say our original data table is DEMO:

 
data DEMO;
   input CUSTOMER_CITY $ 1-15 FIRST_NAME $ 16-26 LAST_NAME $27-37 COMPANY_NAME $38-50 COUNTRY_OF_BIRTH $51-65;
   datalines;
Washington     Peter      Birn       Citibank     USA
Denver         Lisa       Roth       IBM          UK
Cary           Antony     Bessen     SAS          Spain
;

Then the following macro will create variable labels out of the variable names as described above:

 
%macro ilabel (dataset);
   %local lbref dsname vname vlabel nvars;
 
   %if %index(&dataset,.) %then
   %do; /* 2-level dataset name */
      %let lbref  = %scan(&dataset,1,'.');
      %let dsname = %scan(&dataset,2,'.');
   %end;
   %else
   %do; /* 1-level dataset name */
      %let lbref  = WORK;
      %let dsname = &dataset;
   %end;
 
   /* get variable names */
   proc contents data=&dataset out=_data_(keep=name) noprint;
   run;
 
   /* create name/label pairs */
   data _null_;
      set _last_ end=eof nobs=n;
      call symput('vname'!!strip(put(_n_,best.)),name);
      lbl = propcase(translate(name,' ','_'));
      call symput('vlabel'!!strip(put(_n_,best.)),trim(lbl));
      if eof then call symputx('nvars',n);
   run;
 
   /* modify variable labels */
   proc datasets lib=&lbref nolist;
      modify &dsname;
         label
            %do i=1 %to &nvars;
               &&vname&i = "&&vlabel&i"
            %end;
         ;
   quit;
 
%mend ilabel;

You can invoke this macro by either one line of code:

%ilabel(DEMO)
or
%ilabel(WORK.DEMO)

Here are how our DEMO table looks before and after %ilabel macro modifies/assigns the labels based on the column names:

BEFORE:

Data table showing column names

AFTER:

Data table showing column labels

Macro code highlights

In this macro, we:

  1. Define local macro variables to make sure their names will not interfere with possible namesakes in the calling program.
  2. Determine libref and one-level data set name for the input data set.
  3. Create a table containing variable names in the input data set using PROC CONTENTS.
  4. Use DATA _NULL_ step to read through the variable names, and derive labels as

    
    lbl = propcase(translate(name,' ','_'));
    
    

    Here, transalate() function replaces underscores with blanks, then propcase() function converts every word in its argument to proper case (upper case for the first character and lower case for the remaining characters). We also create macro variables for each name/label pair (vname1, vlabel1, vname2, vlabel2, …) and macro variable nvars representing the number of such pairs.

  5. Use PROC DATASETS with MODIFY and LABEL statements to assign generated column labels to the source data set.

If some of the labels assigned by this macro are not what you need you may run another PROC DATASETS to individually adjust (re-assign) them according to your wishes or specification. But when you need to label data set columns on a large scale (many tables with dozens or hundreds of columns) this can be a good first draft that can save you time and efforts.

_DATA_ and _LAST_ special data sets

You might notice that I used _data_ data set name in the out= option of the PROC CONTENTS. This is not an explicit data set name; it is a keyword, a special data set that allows SAS to assign one of the available data set names dynamically. The created output data set will have a name that looks something like DATA1 or DATA2, etc. Try running this code:

 
data _data_;
   x=1;
run;

and look in the SAS LOG at what data set is created. I got:

NOTE: The data set WORK.DATA1 has 1 observations and 1 variables.

Special data set name _data_ tells SAS to create a data set in the WORK library from a list of names DATA1, DATA2, … according to the DATAn naming convention. These names (as well as WORK library) are unique for a given SAS session. The first time you use _data_ within a SAS session it will create data set named WORK.DATA1, the second time you use _data_ it will create WORK.DATA2, and so on.

Consequently, I used special data set name _last_ in the SET statement of the DATA step following the PROC CONTENTS. Again, here _last_ is a keyword, not a data set name; it is a special data set that refers to the name of the last created data set during your SAS session. That causes SAS to use the latest data set created prior to the _last_ reference.

Special data sets _data_ and _last_ are reserved names (or SAS keywords) along with special data set _null_ that is used in the DATA _NULL_ statement and causes SAS to execute the DATA step without creating a data set. (By the way, using DATA _NULL_ can increase your code efficiency when you use the DATA step for custom report writing or creating macro variables or other processing for which the output data set is not needed as it does not consume computer resources for writing and storing the output data set.)

If I were using an explicit table name in this macro instead, and your calling program accidentally were using the same table name, then the macro would overwrite your table which would wreak havoc to your program. Using _data_ and _last_ special data sets protect your SAS program from a possibility of inadvertently overwriting your other data set with the same name by executing the %ilabel macro. It is similar to using %LOCAL for macro variable names for protecting from possible overwrites of your %GLOBAL macro variables with the same names.

A WORD OF CAUTION: Remember, that the _data_ keyword creates table names that are unique only within a SAS session, so it works perfectly for the WORK data library which itself is a unique instance for a SAS session. While it is syntactically correct to use special data set notation _data_ for creating permanent data sets such as libref._data_ (including SASUSER._data_), I have to warn you against using it as it will not guarantee the name uniqueness in the permanent data library, and you may end up overwriting data sets that already exist there.

Your thoughts?

Do you find this post useful? How do you handle the task of assigning variable labels on a mass scale? Do you use _data_ and _last_ special data sets in your SAS coding? Please share in the comments section below.

Share

About Author

Leonid Batkhan

Leonid Batkhan is a long-time SAS consultant and blogger. Currently, he is a Lead Applications Developer at F.N.B. Corporation. He holds a Ph.D. in Computer Science and Automatic Control Systems and has been a SAS user for more than 25 years. From 1995 to 2021 he worked as a Data Management and Business Intelligence consultant at SAS Institute. During his career, Leonid has successfully implemented dozens of SAS applications and projects in various industries. All posts by Leonid Batkhan >>>

12 Comments

  1. I use the _DATA_ trick a lot, especially for temporary datasets that I create inside macros and then delete when the macro has been executed.

    I have found a problem with that though: it seems to me that the number in the DATAn convention rolls over at some point, and n starts over from 1. This causes a lot of problems, and my programs stop working. I don't see in the documentation when the rollover is.

    • Leonid Batkhan

      Thank you, Marcello, for your very useful feedback. I ran the following test:

      %macro test;
         %do i=1 %to 20000;
            data _DATA_;
               x=&i;
            run;
         %end;
      %mend test;
      %test;
      

      and found that indeed data sets have been created from DATA1 ... DATA9999, then numbering resets to 1 and the process continues thus overwriting the previously created data sets.
      To get to the bottom of it I opened a ticket with SAS Technical Support, and will let you know of any new information regarding this matter.
      Thank you again for reporting this unexpected and undocumented behavior.

  2. Dr. Muhammad Shoaib on

    Dear Professor
    I use the label statement to make the name of variable more descriptive. However, modified labels only appear in x and y axis of certain subsequent statistics like 'distribution' and 't tests' in GLM proc but not in the title of the these statistics as title used the pre-modified variable name . Plz guide me.

    • Leonid Batkhan

      Your question contains the answer: if you want to have a proper "title" you should use TITLE statement before your PROC GLM (or any other proc). Labels are used for variable descriptions. TITLE is a global SAS statement that controls the title of the output. You can have up to 10 lines of title.

  3. Hello Prof, i found the article helpful, as i was reading the article i kept on thinking if there is an easier way of writing the labels, when given a raw data, where one is mapping to sdtm or even adams, bearing in mind that the variable labels are in the IG, thanks

    • Leonid Batkhan

      Hi Daniel,
      For such highly standardized data models like SDTM and ADAM, you won't use the method described in this post. In your scenario, labels are defined in Implementation Guide (IG) and most likely already exist in previously created data sets. Then you can use those data sets with labels as data templates as described in Simplify data preparation using SAS data templates. Using SAS data templates you can easily "inject" existing labels (as well as column names, types, formats and informats) into a new data set, for example:

      data NEWDATA;
         if 0 then set DATA_WITH_LABELS;
         /* statements to define variable values */
      run;

      If you create your data sets from scratch, based solely on IG, there might be other ways of automating label creation. What document format is your IG?

      • Traditionally, these Implementation guides have provided data set and variable metadata in PDF files and Excel spreadsheet. Recently this metadata has been made available through the so-called CDISC Library, To quote their website: "CDISC Library uses linked data and a REST API to deliver CDISC standards metadata to software applications that automate standards-based processes. CDISC Library provides access to new relationships between standards as well as a substantially increased number of versioned CDISC standards and controlled terminology packages."
        With the CDISC Library users can now extract standard metadata in SAS using PROC HTTP, and then utilize this metadata to create data set templates, either as permanent SAS data sets or on the fly.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top