Changing variable type and variable length in SAS datasets

12
See also: Expanding lengths of all character variables in SAS data sets

There is a well-known Russian saying that goes “Если нельзя, но очень хочется, то можно.” The English translation of it can span anywhere from “If you can’t, but want it badly, then you can” to “If you shouldn’t, but want it badly, then you should” to “If you may not, but want it badly, then you may.” Depending on your situation, any possible combination of “may,” “can,” or “should” may apply. You can even replace “want” with “need” to get a slightly different flavor.

There are known means of modifying variable attributes with PROC DATASETS, but they are limited to variable name, format, informat, and label. But what if we want/need to modify a variable length, or change a variable type? And I am not talking about creating a new variable with a different length or converting a numeric variable value into a character value of another variable. We want to change variable type and/or variable length in place, without adding new variables. If you believe it can’t be done, read the first paragraph again.

Imagine that we have two data tables that we need to concatenate into one table. However, there is one common variable that is of different type in each table – in the first table it is numeric, but in the second table it is character.

Sample data

Let’s create some sample data to emulate our situation by running the following SAS code:

libname sasdl 'C:\PROJECTS\_BLOG_SAS\changing_variable_type_and_length_in_sas_datasets';
 
/* create study2016 data table */
data sasdl.study2016;
   length subjid dob 8 state $2 start_date end_date 8;
   infile datalines truncover;
   input subjid dob : mmddyy10. state start_date : mmddyy10. end_date : mmddyy10.;
   format dob start_date end_date mmddyy10.;
   datalines;
123456 08/15/1960 MD 01/02/2016
234567 11/13/1970 AL 05/12/2016 12/30/2016
;
 
/* create study2017 data table */
data sasdl.study2017;
   length subjid $6 dob 4 state $2 start_date end_date 4;
   infile datalines truncover;
   input subjid dob : mmddyy10. state start_date : mmddyy10. end_date : mmddyy10.;
   format dob start_date end_date mmddyy10.;
   datalines;
987654 03/15/1980 VA 02/13/2017
876543 11/13/1970 NC 01/11/2017 01/30/2017
765432 12/15/1990 NY 03/14/2017
;

The produced data tables will look as follows:

Table STUDY2016:
SAS data table 1

Table STUDY2017:
SAS data table 2

If we look at the tables’ variable properties, we will see that the subjid variable is of different type in these two data tables: it is of type Numeric (length of 8) in STUDY2016 and of type Character (length of 6) in STUDY2017:

SAS variables properties

Also, notice that variables dob, start_date, and end_date, although of the same Numeric type, have different length attributes - 8 in the STUDY2016 table, and 4 in the STUDY2017 table.

Data table concatenating problem

If we try to concatenate these two tables using PROC APPEND, SAS will generate an ERROR:

proc append base=sasdl.study2016 data=sasdl.study2017;
run;
 
NOTE: Appending SASDL.STUDY2017 to SASDL.STUDY2016.
WARNING: Variable subjid not appended because of type mismatch.
WARNING: Variable dob has different lengths on BASE and DATA
         files (BASE 8 DATA 4).
WARNING: Variable start_date has different lengths on BASE and
         DATA files (BASE 8 DATA 4).
WARNING: Variable end_date has different lengths on BASE and
         DATA files (BASE 8 DATA 4).
ERROR: No appending done because of anomalies listed above.
       Use FORCE option to append these files.
NOTE: 0 observations added.

Even if we do use the FORCE option as the ERROR message suggests, the result will be disappointing:

proc append base=sasdl.study2016 data=sasdl.study2017 force;
run;
 
NOTE: Appending SASDL.STUDY2017 to SASDL.STUDY2016.
WARNING: Variable subjid not appended because of type mismatch.
WARNING: Variable dob has different lengths on BASE and DATA
         files (BASE 8 DATA 4).
WARNING: Variable start_date has different lengths on BASE and
         DATA files (BASE 8 DATA 4).
WARNING: Variable end_date has different lengths on BASE and
         DATA files (BASE 8 DATA 4).
NOTE: FORCE is specified, so dropping/truncating will occur.

The resulting data table will have missing values for the appended subjid:

Missing values in SAS table after PROC APPEND

Solution

In order to concatenate these tables, we must make the mismatching variable subjid of the same type in both data tables, either Character or Numeric. Making them both of Character type seems more robust, since it would allow for the values to contain both digit and non-digit characters. But if you know for sure that the value contains only digits, making them both Numeric works just as well.

Let’s say we decide to make them of Character type. Also note that our numeric variables representing dates (dob, start_date and end_date) are of different lengths: they are length 8 in STUDY2016 and length 4 in STUDY2017. Let’s make them the same length as well. From the standpoint of numerical accuracy in SAS for the dates, a length of 4 seems to be quite adequate to represent them accurately.

Let’s apply all our modifications to the STUDY 2016 dataset. Even though we are going to re-build the dataset in order to modify variable type and length, we are going to preserve the variable order so it feels like we just modified those variable attributes.

Here is how it can be done.

/* create macrovariable varlist containing a list of variable names */
proc sql noprint;
   select name into :varlist separated by ' '
   from sashelp.vcolumn
   where upcase(libname) eq 'SASDL' and upcase(memname) eq 'STUDY2016';
quit;
 
/* modify variable type and length */
data sasdl.study2016 (drop=v1-v4);
   retain &varlist; *<-- preserve variable order ;
   length subjid $6 dob start_date end_date 4; *<-- define new types/lengths ;
   format dob start_date end_date mmddyy10.;   *<-- recreate formats ;
   set sasdl.study2016 (rename=(subjid=v1 dob=v2 start_date=v3 end_date=v4));
   subjid = put(v1,6.); *<-- redefine subjid variable ;
   dob = v2;            *<-- redefine dob variable ;
   start_date = v3;     *<-- redefine start_date variable ;
   end_date = v4;       *<-- redefine end_date variable ;
run;
 
/* make sure new concatenated file (study_all) does not exist */
proc datasets library=sasdl nolist;
   delete study_all;
quit;
 
/* append both (study2016 and study2017) to study_all */
proc append base=sasdl.study_all data=sasdl.study2016;
run;
proc append base=sasdl.study_all data=sasdl.study2017;
run;

In this code, first, using proc sql and SAS view sashelp.vcolumn, we create a macro variable varlist to hold the list of all the variable names in our table, sasdl.study2016.

Then in the data step, we use a retain statement to preserve the variable order. When we read the sasdl.study2016 dataset using the set statement, we rename our variables-to-be-modified to some temporary names (e.g. v1 – v4) which we eventually drop in the data statement.

Then we re-assign the values of those temporary variables to the original variable names, thereby essentially creating new variables with new type and length. Since these new variables are named exactly as the old ones, the effect is as if their type and length attributes where modified, while in fact the whole table was rebuilt and replaced. Problem solved.

When we concatenate the data tables we create a new table sasdl.study_all. Before concatenating our two tables using proc append twice, we use proc datasets to delete that new table first. Even if the table does not exist, proc datasets will at least attempt to delete it. With all the seeming redundancy of this step, you will definitely appreciate it when you try running this code more than one time.

See also: Expanding lengths of all character variables in SAS data sets
Share

About Author

Leonid Batkhan

Leonid Batkhan is a long-time SAS consultant and blogger. Currently, he is a Lead Applications Developer at F.N.B. Corporation. He holds a Ph.D. in Computer Science and Automatic Control Systems and has been a SAS user for more than 25 years. From 1995 to 2021 he worked as a Data Management and Business Intelligence consultant at SAS Institute. During his career, Leonid has successfully implemented dozens of SAS applications and projects in various industries. All posts by Leonid Batkhan >>>

12 Comments

  1. So I have 12 different datasets that I am trying to add together, with about 40 observations each but 251 variables. Will I have to repeat this step multiple times?

    • Leonid Batkhan

      Hi Kavya,
      Thank you for your question. It's a good one. This blog post outlines the idea of dealing with data sets that have variables with different attributes. In case of multiple data sets I envision some iterative process that incrementally reconcile variables to accommodate their values.

  2. Anders Sköllermo on

    Nice solution and Very well written text! This is a common SAS problem.
    About performance: Study_2016 is read once. Proc Append is a procedure, that copies large blocks of data. It is much faster than SET.

    I do think that it is an advantage to split the solution up in different steps.
    If you have a much more complex situation, then you can use the solution above (perhaps with an extra data-step).
    Many thanks! / Anders Sköllermo (retired SAS Specialist. User since 1981)

  3. Hi Leonid.

    Nicely done! I especially like that you created a macro variable and then used it in a RETAIN statement to define variable ordering.

  4. Deborah Reese on

    Very well written Leonid! I had this exact issue a few months ago and couldn’t identify a solution. Fortunately, the datasets were small so I exported them, combined the two, and then imported the combined file back into SAS. Not at all efficient or desirable. I bookmarked the post for future reference.

    • Leonid Batkhan

      Thank you, Deborah, for your feedback. I am glad you found this post useful. Have to admit, your solution was quite creative, in mine I was not thinking outside of SAS box ☺.

  5. Great tip for appending large datasets with different variable attributes.

    Why can't you do it in one step though?

    data sasdl.study_all (drop=v1-v4);
       retain &varlist; *< -- preserve variable order ;
       length subjid $6 dob start_date end_date 4; *<-- define new types/lengths ;
       format dob start_date end_date mmddyy10.;   *<-- recreate formats ;
       set sasdl.study2016 (in=a rename=(subjid=v1 dob=v2 start_date=v3 end_date=v4)) sasdl.study2017 ;
       if a then do;
          subjid = put(v1,6.); *<-- redefine subjid variable ;
          dob = v2;            *<-- redefine dob variable ;
          start_date = v3;     *<-- redefine start_date variable ;
          end_date = v4;       *<-- redefine end_date variable ;
       end;
    run;
    

    • Leonid Batkhan

      Thank you, Tatiana, for your comment. Of course, it can be done in one step as you suggested.
      However, in that single step you read both data sets in their entirety. PROC APPEND, on the other hand, does not read the whole dataset and therefore runs much faster. You will see noticeable time savings when your sasdl.study2017 dataset is large.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top