How to compare SAS data tables for common/uncommon columns

By Leonid Batkhan on SAS Users December 9, 2019 Topics | Data Visualization Learn SAS Programming Tips

Building on my last post, How to create checklist tables in SAS®, this one shows you how to compare SAS data Check mark and cross mark sets that include common and uncommon columns. You'll learn how to visualize side-by-side columns commonalities and differences in data tables.

As before, we're working with a comparison matrix (aka checklist table) where check-marks / x-marks indicate included / excluded columns.

Data tables will be comparable products while their columns (variables) will represent product features. We'll add background color to highlight which attributes are different in the common columns. Since there might be several different attributes for a given column, we will use a hierarchy type➜length➜label to indicate only the highest mismatched level of hierarchy. For example:

If same-named columns have different type (Numeric vs. Character), their corresponding check-mark will be shown on a light-red background, which indicates the highest degree of mismatch.
If same-named columns have the same type, a yellow background will indicate any difference in variables length.
When same-named variables type and length match, a light-blue background marks any difference in variables label.

SAS code to create color-enhanced comparison matrix

Let’s compare variable attributes in two data tables: one is SAS-supplied SASHELP.CARS, and another WORK.NEWCARS that I derive from the first one, slightly scrambling its column definitions:

data WORK.NEWCARS (drop=temp:);
   set SASHELP.CARS (rename=(Origin=Region EngineSize=temp1 Make=temp2));
   length EngineSize $3 Make $20;
   EngineSize = put(temp1,3.1);
   Make = temp2; 
   label Type='New Car Type';
run;

In this NEWCARS data table, I did the following:

Replaced column name Origin with Region
Changed type of column EngineSize from Numeric to Character
Changed length of column Make from $13 to $20
Changed label of column Type from blank to “New Car Type”

Now let’s build the comparison matrix:

proc contents data=SASHELP.CARS noprint out=DS1(keep=Name Type Length Label);
run;
 
proc contents data=WORK.NEWCARS noprint out=DS2(keep=Name Type Length Label);
run;
 
data comparison_matrix;
   merge
      DS1(in=in1 rename=(Type=Typ1 Length=Len1 Label=Lab1))
      DS2(in=in2 rename=(Type=Typ2 Length=Len2 Label=Lab2));
   by Name;
 
   /* set symbol shape: 1=V; 0=X */
   ds1 = 1; ds2 = 1;
   if in1 and not in2 then ds2 = 0; else
   if in2 and not in1 then ds1 = 0;
 
   /* add background color */
   if ds1=ds2=1 then
   select;
      when(Typ1^=Typ2) do; ds1=2; ds2=2; end;
      when(Len1^=Len2) do; ds1=3; ds2=3; end;
      when(Lab1^=Lab2) do; ds1=4; ds2=4; end;
      otherwise; 
   end;
 
   label
      Name = 'Column Name'
      ds1 = 'SASHELP.CARS'
      ds2 = 'WORK.NEWCARS'
      ;
run;
 
proc format;
   value chmark
      0   = '(*ESC*){unicode "2718"x}'
      1-4 = '(*ESC*){unicode "2714"x}'
      ;
   value chcolor
      0   = red
      1-4 = green
      ;
   value bgcolor
      2 = 'cxffccbb'
      3 = 'cxffe177'
      4 = 'cxd4f8d4' 
      ;
run;
 
ods html path='c:\temp' file='comp_marix.html' style=Seaside;
ods escapechar='^';
title 'Data set columns comparison matrix';
 
proc odstext;
   p '<div align="center">Mismatch Legend:'||
     '<span style="background-color:#ffccbb;margin-left:17px">^_^_^_^_</span> Type'||
     '<span style="background-color:#ffe177;margin-left:17px">^_^_^_^_</span> Length'||
     '<span style="background-color:#d4f8d4;margin-left:17px">^_^_^_^_</span> Label</div>'
   / style=[fontsize=9pt];
run;
 
title; 
proc print data=comparison_matrix label noobs;
   var Name / style={fontweight=bold width=100px};
   var ds1 ds2 / style={color=chcolor. backgroundcolor=bgcolor. just=center fontweight=bold width=120px};
   format ds1 ds2 chmark.;
run;
 
ods html close;

Here is a brief explanation of the code:

Two PROC CONTENTS produce alphabetical lists (as datasets) of the data table column names, as well as their attributes (type, length, label)
The DATA STEP merges these 2 lists and creates DS1 and DS2 variables indicating common name (values 1, 2, 3, 4) or uncommon name (value 0).
PROC FORMAT creates 3 user-defined formats chmark, chcolor, bgcolor responsible for checkmark shape, checkmark color, and background color respectively. For checkmark shape, we use Unicode characters, and for colors we use both, color names (e.g. red, green) and hexadecimal RGB color notations (e.g. 'cxFFCCBB').
PROC ODSTEXT’s P statement is used to display color legend for the comparison matrix.
Finally, PROC PRINT with user-defined formats produces our color-enhanced comparison matrix.

Data tables comparison matrix – OUTPUT

The above code will generate the following HTML output with the comparison matrix for variables in two data sets:

Adding more detail to the comparison matrix chart

We can further enhance our output comparison matrix by adding detailed descriptive information about differences between variable attributes. For comprehensive view, we can add a COMMENTS column that spells out differences (attributes mismatches). In addition to the hierarchical logic defining only one mismatch of the highest degree indicated by color highlighting above, comments can include all found discrepancies. Simply add the following two pieces of SAS code:

1. Add the following group of statements to the above DATA Step (right after SELECT statement):

 length Comments $200;
   if ds1>1 then
   do;
      if Typ1^=Typ2 then Comments = catx(' ', Comments, 'Type1=',   Typ1, '; Type2=',   Typ2, ';');
      if Len1^=Len2 then Comments = catx(' ', Comments, 'Length1=', Len1, '; Length2=', Len2, ';');
      if Lab1^=Lab2 then Comments = catx(' ', Comments, 'Label1=',  Lab1, '; Label2=',  Lab2, ';');
   end;

Depending on your needs this Comments can be added unconditionally – you would just need to remove IF-THEN logic keeping only:

length Comments $200;
Comments = catx(' ', Comments, 'Type1=',   Typ1, '; Type2=',   Typ2, ';');
Comments = catx(' ', Comments, 'Length1=', Len1, '; Length2=', Len2, ';');
Comments = catx(' ', Comments, 'Label1=',  Lab1, '; Label2=',  Lab2, ';');

2. Add the following statement to the above PROC PRINT (right before the FORMAT statement):
var comments / style={width=250px};

Then your HTML output will look as follows:

Conclusion

Comparison matrix chart is a convenient data visualization tool useful for data stewardship, development and metadata validation when you're comparing a data table’s metadata against requirements descriptions.

It allows us to quickly identify tables’ common and uncommon variables, as well as common variable inconsistencies by type, length and other attributes, such as labels and formats.

We can easily add detailed descriptive information when needed.

On a related note

While this post focused on visualizing SAS data sets comparison vis-à-vis common and uncommon columns, it's worth noting SAS websites have plenty of info on finding common variables (or columns) in data sets. For example:

Determining variables that data sets have in common (SAS Note)
An easy way to find variables in common to data sets (Blog post)
Find variables common to multiple data sets (Blog post)

Your thoughts?

Do you find this material useful? What other usages of the checklist tables and color-enhanced comparison matrices can you suggest?

About Author

Leonid Batkhan

Leonid Batkhan is a long-time SAS consultant and blogger. He holds a Ph.D. in Computer Science and Automatic Control Systems and has been a SAS user for more than 25 years. From 1995 to 2021 he worked as a Data Management and Business Intelligence consultant at SAS Institute. During his career, Leonid has successfully implemented dozens of SAS applications and projects in various industries. All posts by Leonid Batkhan >>>

32 Comments

Hamed on April 17, 2024 7:52 am

Does your code account for uppercase variations? for example, if we have STATE in one dataset and State in another dataset. If not, how can I modify the code to consider these two the same column name?
Thanks!
- Leonid Batkhan on April 17, 2024 11:00 am
  
  Great question, Hamed. The way PROC CONTENTS creates its output depends on the SAS system option VALIDAVARNAME. If you want your code to treat columns STATE and State as the same name, you can have the following statement before PROC CONTENTS:
  options validvarname=upcase;
Doss on September 22, 2021 12:58 pm

Hi Leonid,

I want to do validation between 2 datasets based on every column values. It has more than 200 columns and millions of records. Kindly need ur suggestion code
- Leonid Batkhan on September 22, 2021 1:27 pm
  The quickest and simplest way to make sure your datasets are equal (I assume that is what you mean by "to do validation between 2 datasets") is to use PROC COMPARE, e.g.
```
proc compare base=DATASET1 compare=DATASET2;
run; 
```
- Poundoss on September 27, 2021 7:36 am
  
  Thanks Leonid. My concern is how to highlight a values of each variable in both tables that looks different. Eg id in ds1 is 1, in ds2 if it's different 01 then how to highlight the difference. Kindly suggest a code for this
  - Leonid Batkhan on September 27, 2021 11:27 am
    You can use PROC COMPARE with NOPRINT and OUT= options, for example
    
    proc compare base=DATASET1 compare=DATASET2 noprint out=DATADIFS; run;
    
    NOPRINT will suppress printed report and your variable value differences will be captured in the dataset DATADIFFS. Then you can use that dataset to conditionally apply color-coding technique described in this blog post.
    - Poundoss on October 1, 2021 9:25 am
      
      Thanks Leonid. Few concerns, columns more than 100 in 2 tables from Hadoop needs to be transposed and dynamically need to check and display sample values in output, for 3 scenarios. In table A, not in B, in B not in A, by macro. Based on key column, will join 2 tables and need to display the values column by column in output that are mismatch in both tables. How do I do this?
      - Leonid Batkhan on October 1, 2021 12:16 pm
        
        For comparing column values your best bet would be using PROC COMPARE. You can also place your problems/questions on SAS Communities at https://communities.sas.com/.

Shoaib shaikh on April 30, 2021 4:28 am

I want to compare two data sets using macro on some condition like first it will check names of columns, and 2nd it will check order of column in both, if matched then compare files. How do I do this?

Leonid Batkhan on April 30, 2021 9:57 am

Hi Shoaib,
For your scenario, column order (position) does not matter. PROC COMPARE will compare two data tables by variable names, not by their positions.

Shoaib shaikh on May 3, 2021 6:33 am

But for my task I am not allowed to use proc compare, instead I should use parametrized macro.

Leonid Batkhan on May 3, 2021 9:38 am

Hi Shoaib, regardless of what you are allowed to use, here is a code example showing how you can compare column names and their orders in two data tables SASLIB1.DATA1 and SASLIB2.DATA2:

proc sql noprint;
   select name into : vlist1 separated by ' '
   from SASHELP.VCOLUMN
   where libname='SASLIB1' and memname='DATA1';
 
   select name into : vlist2 separated by ' '
   from SASHELP.VCOLUMN
   where libname='SASLIB2' and memname='DATA2'; 
quit;
 
%put &=vlist1;
%put &=vlist2;
 
%if &vlist1=&vlist2 %then
%do;
   %put COLUMN ORDER MATCHED;
%end;
%else
%do;
   %put COLUMN ORDER DOES NOT MATCH;
%end;

You can adapt this to your program accordingly. Hope, this helps.

Harsheel Rathod on April 6, 2021 9:37 pm

Hi Leonid, This is a great example to learn. However, I am stuck with creating my report:

I need to create a similar Status check report, but I have to analyze the entire variable. So say for example the entire X variable is blank then I should have a Green tick mark in my table. Can you please help with how to do that?
- Leonid Batkhan on April 7, 2021 9:30 am
  
  Hi Harsheel,
  Thank you for reaching out. In order for me to be able helping you could you please clarify your task a little more? Do you need to flag each variable that has blank values across all observations in a table? Do you need to display all variables or only those with all blank values? How about numeric missing values? Do you need to cover them as well? Please provide a detail description of what and why you are trying to do as well as the context of this report.
Elena on April 2, 2021 11:36 am

Nice, thank you! I created some time ago code for comparing my data sets. The table was quite wide since included all info about old metadata and new one, plus comparing. I used proc sql (instead of merge in data step), my colorful table was produced by proc report, styles I used conditionally in macro (instead of proc format). I like your neat result! In my current project comparing data sets is the first stage before report about comparing tables by values (deleted, inserted or modified).
- Leonid Batkhan on April 2, 2021 11:50 am
  
  Great! Thank you for your feedback, Elena. I am glad that you find this useful, and even more so if you are going to use it in a real-life project.
  - Elena on April 2, 2021 11:59 am
    
    Yes, I've decided to use big part of your code - I definitely like merging. You manually feel what's going on when combining data, and can pull out needed variables which are "failed". 🙂
    - Leonid Batkhan on April 2, 2021 12:15 pm
      
      👍
      Please keep sharing good stories!
Sarthak M on July 26, 2020 12:43 pm

This is genius on so many levels! This fits my need (with some modifications & add-on) so well and I also learnt a great many things!
Million thanks to you!
- Leonid Batkhan on July 27, 2020 8:51 am
  
  Thank you, Sarthak, for your feedback. Would you mind sharing your use case, modifications and add-on?
  - Sarthak M on July 27, 2020 1:39 pm
    
    I was rerunning some old codes with new data and needed to compare the old vs new datasets. But since the output from proc compare is a bit too wordy and difficult to read(I am a SAS novice), I was looking for a more visual approach and this was a perfect starter and gave an idea.
    
    Next, I started coding for certain summaries printed by proc compare by making them a little more concise (e.g. print descriptive stats instead of row wise comparison for numeric variable). I also color coded the output to show differences in the 2 datasets (e.g. green for higher base value, red for higher compare value and yellow for equal)
    
    I made everything into a macro variable in which I only need to feed the base and compare dataset names, just like the proc compare. I didn't do much, but it certainly satisfies my requirement!
    
    Overall, my intention is to write an upgraded proc compare!
    - Leonid Batkhan on July 27, 2020 3:46 pm
      
      Great, I am glad this blog post sparked your creativity !
Henrik Forsell on January 7, 2020 7:48 pm

Very nice example, thanks. (I made it into a autocall macro!) I have changed it a bit, Mainly it was the format bgcolor that I changed.The colours does not work with me as I'm slightly colourblind, so now it is red, orange and yellow. I also added 0 = pink for missing variables as I did not think that they popped out as much as they should.
- Leonid Batkhan on January 7, 2020 11:38 pm
  
  Great! I am glad you adapted it for your purpose.
Colm Smyth on January 3, 2020 11:05 am

Very nice article..... Now could this be applied to actual data rather than the structure. For example check content of all variables between two datasets for the same unique key? Think Proc Compare but visually
- Leonid Batkhan on January 3, 2020 11:43 am
  Thank you, Colm, for your great question.
  
  While this blog post addresses comparison of the metadata (variable attributes) the method can be applied to the data as well (as you suggested). If you look at the code in this post, reference to the metadata only exists at the PROC CONTENTS steps. After that, metadata becomes just "data" (DS1, DS2) and it can be any data regardless whether it relates to the metadata or not. For example, if you want to compare two datasets (DS1, DS2) for unique key values of variable KEY you can still use similar logic:
```
data comparison_matrix;
   merge
      DS1(in=in1)
      DS2(in=in2);
   by KEY;
. . .
```
  However, for large datasets, visualization of all rows may not be practical as you would have to visually browse too much data. For large datasets, more practical solution would be to only keep those rows with different KEY using the following subsetting if statement:
```
   if in1 ne in2;
```
  and then outputting them (and possibly displaying) in a table.
  
  And, of course, you can wrap this code in a macro so you can apply it whenever you need to compare any key variable between 2 datasets.
  - Colm Smyth on January 5, 2020 6:31 am
    Hi Leonid
    
    Thanks for your reply, and yes I agree that this would not be practical for large sets of data, but think about narrowing down the search. Imagine you have two datasets, one of the production and one the qc dataset. A programmer runs a proc compare and there are differences within some values for some parameters.
    
    Consider the below (hopefully it formats okay). A programmer could call a macro based on imputed values to pull out a section of the results.
    
    SUBJECT TESTCD VISIT RESULT_MAIN RESULT_QC STATUS 101-001 HEIGHT SCREENING 183 184 ✘ 101-001 WEIGHT SCREENING 76.4 77 ✘ 101-001 PULSE SCREENING 90 90 ✔ 101-001 BMA SCREENING 18 18 ✔
    
    So for the above, imagine a macro call such as:
    
    %showme(ds=VITALS, param=TESTCD, where=IF VISIT='SCREENING', show=SUBJECT TESTCD VISIT, resultvar=RESULT);
    
    The code would use the SHOW and RESULTVAR variables as essentially the key/unique vars and produce the above. I do this now but it shows a count of the values per the keys rather than a check/cross. Ultimately this would be really just a wait to produce a fancy report for perhaps less technical people in departments, more visual.
    
    Best
    
    Colm
    - Leonid Batkhan on January 5, 2020 12:51 pm
      
      Thank you, Colm, for sharing your example. Indeed, it would have been a convincing visual to compare values. All you need to do is to calculate STATUS as
      
      STATUS = (RESULT_MAIN=RESULT_QC);
      
      Then STATUS will be equal to 1 when the results are equal, and equal to 0 when the results are not equal. Applying formats chmark and chcolor will display STATUS as ✘ and ✔ .
      
      An alternative could be instead of adding STATUS column, just displaying equal results on a light green background, while displaying non-equal results on a light-red background using technique described in this blog post.
Susan Slaughter on December 13, 2019 2:00 pm

This is a great example of a practical solution to a common problem. Thanks!
- Leonid Batkhan on December 13, 2019 3:33 pm
  
  Thank you, Susan! Maybe that "example of a practical solution to a common problem" will make it to your next edition of The Little SAS Book 🙂
Rafi Rahi on December 10, 2019 10:35 pm

This is a brilliant use of SAS abilities, should be a great comparison tool! Thank you very much Leonid for sharing this with us, your contribution is always greatly appreciated.
- Leonid Batkhan on December 11, 2019 10:25 am
  
  Thank you, Rafi, for your nice comment!

Blogs