Your data ain't good looking, but it has a nice cardinality

Data cardinality is an important topic for for data analysis. In mathematical terms, cardinality is simply the number of elements in a set. But in data applications, cardinality signifies the number of unique values for a given field of data. Related terms include number of levels (thank you, PROC FREQ and NLEVELS), percent unique, as well as discrete data -- data with a finite number of different values. (That's not to be confused with discreet data, which is when you're careful not to share too much personal information. For example, we can count the Kardashians in discrete increments. But that doesn't mean that they behave discreetly.)

But first, this

Since I've been working with cardinality recently, I've developed this persistent earworm that I just can't shake. I decided to write it down and present it to you. (You're welcome.)

(I've got) Cardinality (Apologies to Lloyd Price)
>> Click here for suggested tune.

Oh-oh-over and over
I'll iterate through each row
Over and over, tell me why it's so
Over and over, each value that's unique
But oh-oh-over and over
I wish I ran my PROC FREQ

'Cause I've got - (cardinality)
High - (cardinality) N - (cardinality)
Discrete - (cardinality) HUGE - (cardinality)
Unique - (cardinality)
'Cause I got a big data set
So over ... and over
There's a high percent unique
Well, well, well over and over
I wish my data were discrete

Cardinal importance of cardinality

SAS Visual Analytics -- which I've been using for recent projects -- makes a big deal out of cardinality. The displayed list of variables shows the cardinality values front-and-center (see the inset screenshot image). Knowing how many different values a column can take on should influence the type of visualization you design. For example, it might make sense to group a set of charts by GENDER, which typically has a cardinality of 2 (or at most 3, if "unknown" is permitted). However, it would make less sense to group your charts by CUSTOMER_ID when that field has a cardinality in the thousands.

Many SAS users already know how to calculate cardinality. sasCommunity.org hosts a useful reference on "Cardinality Ratio" (more-or-less curated by Ron Fehd), complete with sample programs, SAS macros, and explanations.

I have a variation of my own, developed mostly by my colleague Hans-Joachim Edert. It produces a report like this:

What can you learn from this report? In this example, we can see that Origin (where the model of car comes from) has just 3 possible values, making it less than 1% unique across the 428 records. That might make it a good grouping variable. Model, on the other hand, has a unique value for almost every record. The fact that it's so close to 100% unique might indicate a data quality problem. (Do we expect all Model names to be unique?) And the numeric variable Cylinders has only 8 different values, which might indicate that it's more of a category than a measure. (Would you want to SUM or AVERAGE the Cylinders across these records?)

Here's a link to the program code if you want to try it yourself. Here's a sample call to the macro in the program:

%getcardinality(SASHELP.CARS, WORK.CARDS, 1);

Coming up...

In a future article I'll present a SAS Enterprise Guide custom task that makes the "check cardinality" process easy -- no programming. (Here's a sneak peak.) But you do not have to wait to learn more about your data: you can already glean much of this (and more) by using the Characterize Data task.

7 Comments

Dave on October 16, 2013 4:10 pm

Cool! Thanks for sharing.
Pingback: How to store data about your data in your data - The SAS Dummy
Pingback: A custom task to check your data cardinality - The SAS Dummy
Ronald J. Fehd on October 22, 2013 11:13 am

Thanks for the name-cheque, Chris.

In my fall paper on this topic for the regionals I show code to create a variable using nlevels and cardinality ratio to identify the 3 major categories of variables: unique, discrete, and continuous.
Check this page which will soon have a calculation of the mean of CR,
which can be used for later list processing.

http://www.sascommunity.org/wiki/Cardinality_Ratio

Ron Fehd macro maven
- Carole Jesse on January 13, 2015 3:20 pm
  
  Hi Ron,
  Thanks for bringing light to this important concept. Both Chris and I picked up on your work. See 'props' (page 6) :-)
  http://minnsug.org/wrp/wp-content/uploads/2013/10/Beyond_NLEVELS_MinnSUG_20131017_CAJ_final.pdf
  Great work Ron.
  Carole
Carole Jesse on January 13, 2015 2:59 pm

Hi Chris,

Great post, albeit I am late to the party. Someone from MinnSUG recently sent me a link to your post from 10/16/2013, they had attended my presentation on the same subject on 10/17/2013.
They saw your post and made the connection a year later!

Interesting that we were both investigating this topic at the same time. Here is a link to my presentation from 10/17/2013:

http://minnsug.org/wrp/wp-content/uploads/2013/10/Beyond_NLEVELS_MinnSUG_20131017_CAJ_final.pdf

The biggest issue I run into is when FREQ fails due to data size and field granularity. Additionally, I like to track %missing in such a report, so I add that into the mix. Yes, it's great to know that there are missing values (misslevels=1), but how sparse is the field exactly (percentmissing) ?

Best regards, and great post!
Carole Jesse
Ronald J. Fehd on October 17, 2016 10:30 pm

Here is my 2016 update on cardinality ratio and cardinality-type in (few,many,unique)

tags= code book, data dictionary, extended attributes

Blogs