How to compare SAS data tables for common/uncommon columns

14

Building on my last post, How to create checklist tables in SAS®, this one shows you how to compare SAS data Check mark and cross mark sets that include common and uncommon columns. You'll learn how to visualize side-by-side columns commonalities and differences in data tables.

As before, we're working with a comparison matrix (aka checklist table) where check-marks / x-marks indicate included / excluded columns.

Data tables will be comparable products while their columns (variables) will represent product features. We'll add background color to highlight which attributes are different in the common columns. Since there might be several different attributes for a given column, we will use a hierarchy typelengthlabel to indicate only the highest mismatched level of hierarchy. For example:

  • If same-named columns have different type (Numeric vs. Character), their corresponding check-mark will be shown on a light-red background, which indicates the highest degree of mismatch.
  • If same-named columns have the same type, a yellow background will indicate any difference in variables length.
  • When same-named variables type and length match, a light-blue background marks any difference in variables label.

SAS code to create color-enhanced comparison matrix

Let’s compare variable attributes in two data tables: one is SAS-supplied SASHELP.CARS, and another WORK.NEWCARS that I derive from the first one, slightly scrambling its column definitions:

data WORK.NEWCARS (drop=temp:);
   set SASHELP.CARS (rename=(Origin=Region EngineSize=temp1 Make=temp2));
   length EngineSize $3 Make $20;
   EngineSize = put(temp1,3.1);
   Make = temp2; 
   label Type='New Car Type';
run;

In this NEWCARS data table, I did the following:

  • Replaced column name Origin with Region
  • Changed type of column EngineSize from Numeric to Character
  • Changed length of column Make from $13 to $20
  • Changed label of column Type from blank to “New Car Type”

Now let’s build the comparison matrix:

proc contents data=SASHELP.CARS noprint out=DS1(keep=Name Type Length Label);
run;
 
proc contents data=WORK.NEWCARS noprint out=DS2(keep=Name Type Length Label);
run;
 
data comparison_matrix;
   merge
      DS1(in=in1 rename=(Type=Typ1 Length=Len1 Label=Lab1))
      DS2(in=in2 rename=(Type=Typ2 Length=Len2 Label=Lab2));
   by Name;
 
   /* set symbol shape: 1=V; 0=X */
   ds1 = 1; ds2 = 1;
   if in1 and not in2 then ds2 = 0; else
   if in2 and not in1 then ds1 = 0;
 
   /* add background color */
   if ds1=ds2=1 then
   select;
      when(Typ1^=Typ2) do; ds1=2; ds2=2; end;
      when(Len1^=Len2) do; ds1=3; ds2=3; end;
      when(Lab1^=Lab2) do; ds1=4; ds2=4; end;
      otherwise; 
   end;
 
   label
      Name = 'Column Name'
      ds1 = 'SASHELP.CARS'
      ds2 = 'WORK.NEWCARS'
      ;
run;
 
proc format;
   value chmark
      0   = '(*ESC*){unicode "2718"x}'
      1-4 = '(*ESC*){unicode "2714"x}'
      ;
   value chcolor
      0   = red
      1-4 = green
      ;
   value bgcolor
      2 = 'cxffccbb'
      3 = 'cxffe177'
      4 = 'cxd4f8d4' 
      ;
run;
 
ods html path='c:\temp' file='comp_marix.html' style=Seaside;
ods escapechar='^';
title 'Data set columns comparison matrix';
 
proc odstext;
   p '<div align="center">Mismatch Legend:'||
     '<span style="background-color:#ffccbb;margin-left:17px">^_^_^_^_</span> Type'||
     '<span style="background-color:#ffe177;margin-left:17px">^_^_^_^_</span> Length'||
     '<span style="background-color:#d4f8d4;margin-left:17px">^_^_^_^_</span> Label</div>'
   / style=[fontsize=9pt];
run;
 
title; 
proc print data=comparison_matrix label noobs;
   var Name / style={fontweight=bold width=100px};
   var ds1 ds2 / style={color=chcolor. backgroundcolor=bgcolor. just=center fontweight=bold width=120px};
   format ds1 ds2 chmark.;
run;
 
ods html close;

Here is a brief explanation of the code:

  1. Two PROC CONTENTS produce alphabetical lists (as datasets) of the data table column names, as well as their attributes (type, length, label)
  2. The DATA STEP merges these 2 lists and creates DS1 and DS2 variables indicating common name (values 1, 2, 3, 4) or uncommon name (value 0).
  3. PROC FORMAT creates 3 user-defined formats chmark, chcolor, bgcolor responsible for checkmark shape, checkmark color, and background color respectively. For checkmark shape, we use Unicode characters, and for colors we use both, color names (e.g. red, green) and hexadecimal RGB color notations (e.g. 'cxFFCCBB').
  4. PROC ODSTEXT’s P statement is used to display color legend for the comparison matrix.
  5. Finally, PROC PRINT with user-defined formats produces our color-enhanced comparison matrix.

Data tables comparison matrix – OUTPUT

The above code will generate the following HTML output with the comparison matrix for variables in two data sets:

Comparison matrix for common/uncommon variables in 2 datasets

Adding more detail to the comparison matrix chart

We can further enhance our output comparison matrix by adding detailed descriptive information about differences between variable attributes. For comprehensive view, we can add a COMMENTS column that spells out differences (attributes mismatches). In addition to the hierarchical logic defining only one mismatch of the highest degree indicated by color highlighting above, comments can include all found discrepancies. Simply add the following two pieces of SAS code:

1. Add the following group of statements to the above DATA Step (right after SELECT statement):

 length Comments $200;
   if ds1>1 then
   do;
      if Typ1^=Typ2 then Comments = catx(' ', Comments, 'Type1=',   Typ1, '; Type2=',   Typ2, ';');
      if Len1^=Len2 then Comments = catx(' ', Comments, 'Length1=', Len1, '; Length2=', Len2, ';');
      if Lab1^=Lab2 then Comments = catx(' ', Comments, 'Label1=',  Lab1, '; Label2=',  Lab2, ';');
   end;

Depending on your needs this Comments can be added unconditionally – you would just need to remove IF-THEN logic keeping only:

length Comments $200;
Comments = catx(' ', Comments, 'Type1=',   Typ1, '; Type2=',   Typ2, ';');
Comments = catx(' ', Comments, 'Length1=', Len1, '; Length2=', Len2, ';');
Comments = catx(' ', Comments, 'Label1=',  Lab1, '; Label2=',  Lab2, ';');

2. Add the following statement to the above PROC PRINT (right before the FORMAT statement):

var comments / style={width=250px};

Then your HTML output will look as follows:

Detailed comparison matrix for common/uncommon variables in 2 datasets

Conclusion

Comparison matrix charts are a convenient tool for data development and metadata validation when you're comparing a data table’s metadata against requirements descriptions.

It allows us to quickly identify tables’ common and uncommon variables, as well as common variable inconsistencies by type, length and other attributes, such as labels and formats.

We can easily add detailed descriptive information when needed.

On a related note

While this post focused on visualizing SAS data sets comparison vis-à-vis common and uncommon columns, it's worth noting SAS websites have plenty of info on finding common variables (or columns) in data sets. For example:

Your thoughts?

Do you find this material useful? What other usages of the checklist tables and color-enhanced comparison matrices can you suggest?

Related blog post: How to create checklist tables in SAS®

Share

About Author

Leonid Batkhan

Leonid Batkhan, Ph.D. in Computer Science and Automatic Control Systems, has been a SAS user for more than 25 years. He came to work for SAS in 1995 and is currently a Senior Consultant with the SAS Federal Data Management and Business Intelligence Practice. During his career, Leonid has successfully implemented dozens of SAS applications and projects in various industries. All posts by Leonid Batkhan >>>

14 Comments

  1. This is a brilliant use of SAS abilities, should be a great comparison tool! Thank you very much Leonid for sharing this with us, your contribution is always greatly appreciated.

  2. Very nice article..... Now could this be applied to actual data rather than the structure. For example check content of all variables between two datasets for the same unique key? Think Proc Compare but visually

    • Leonid Batkhan

      Thank you, Colm, for your great question.

      While this blog post addresses comparison of the metadata (variable attributes) the method can be applied to the data as well (as you suggested). If you look at the code in this post, reference to the metadata only exists at the PROC CONTENTS steps. After that, metadata becomes just "data" (DS1, DS2) and it can be any data regardless whether it relates to the metadata or not. For example, if you want to compare two datasets (DS1, DS2) for unique key values of variable KEY you can still use similar logic:

      data comparison_matrix;
         merge
            DS1(in=in1)
            DS2(in=in2);
         by KEY;
      . . .
      

      However, for large datasets, visualization of all rows may not be practical as you would have to visually browse too much data. For large datasets, more practical solution would be to only keep those rows with different KEY using the following subsetting if statement:

         if in1 ne in2;
      

      and then outputting them (and possibly displaying) in a table.

      And, of course, you can wrap this code in a macro so you can apply it whenever you need to compare any key variable between 2 datasets.

      • Hi Leonid

        Thanks for your reply, and yes I agree that this would not be practical for large sets of data, but think about narrowing down the search. Imagine you have two datasets, one of the production and one the qc dataset. A programmer runs a proc compare and there are differences within some values for some parameters.

        Consider the below (hopefully it formats okay). A programmer could call a macro based on imputed values to pull out a section of the results.

        SUBJECT     TESTCD    VISIT         RESULT_MAIN    RESULT_QC     STATUS
        101-001     HEIGHT    SCREENING     183            184           ✘
        101-001     WEIGHT    SCREENING     76.4           77            ✘
        101-001     PULSE     SCREENING     90             90            ✔
        101-001     BMA       SCREENING     18             18            ✔
        

        So for the above, imagine a macro call such as:

        %showme(ds=VITALS, param=TESTCD, where=IF VISIT='SCREENING', show=SUBJECT TESTCD VISIT, resultvar=RESULT);

        The code would use the SHOW and RESULTVAR variables as essentially the key/unique vars and produce the above. I do this now but it shows a count of the values per the keys rather than a check/cross. Ultimately this would be really just a wait to produce a fancy report for perhaps less technical people in departments, more visual.

        Best

        Colm

        • Leonid Batkhan

          Thank you, Colm, for sharing your example. Indeed, it would have been a convincing visual to compare values. All you need to do is to calculate STATUS as

          STATUS = (RESULT_MAIN=RESULT_QC);
          

          Then STATUS will be equal to 1 when the results are equal, and equal to 0 when the results are not equal. Applying formats chmark and chcolor will display STATUS as ✘ and ✔ .

          An alternative could be instead of adding STATUS column, just displaying equal results on a light green background, while displaying non-equal results on a light-red background using technique described in this blog post.

  3. Henrik Forsell on

    Very nice example, thanks. (I made it into a autocall macro!) I have changed it a bit, Mainly it was the format bgcolor that I changed.The colours does not work with me as I'm slightly colourblind, so now it is red, orange and yellow. I also added 0 = pink for missing variables as I did not think that they popped out as much as they should.

  4. This is genius on so many levels! This fits my need (with some modifications & add-on) so well and I also learnt a great many things!
    Million thanks to you!

      • I was rerunning some old codes with new data and needed to compare the old vs new datasets. But since the output from proc compare is a bit too wordy and difficult to read(I am a SAS novice), I was looking for a more visual approach and this was a perfect starter and gave an idea.

        Next, I started coding for certain summaries printed by proc compare by making them a little more concise (e.g. print descriptive stats instead of row wise comparison for numeric variable). I also color coded the output to show differences in the 2 datasets (e.g. green for higher base value, red for higher compare value and yellow for equal)

        I made everything into a macro variable in which I only need to feed the base and compare dataset names, just like the proc compare. I didn't do much, but it certainly satisfies my requirement!

        Overall, my intention is to write an upgraded proc compare!

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top