As far as numbers go, the number zero is rather mysterious for data. Is it something or is it nothing?
What happens when you have missing data but enter 0?
This topic triggered an intriguing discussion in my recent Programming 2: Data Manipulation Techniques class. In this post I’d like to compare various industries to show you their universal take on the 0.
To start off, see if you agree with the analysts in different industries who entered zero in place of missing values below:
Can you have a 0 for blood glucose levels?
A health analyst is missing blood glucose level records for several patients and decides to enter 0 as a value. But health policy makers know that a 0 blood glucose level is highly unlikely as the patient would go into insulin shock and need immediate medical attention.
Can you have a 0% interest rate?
A customer is currently negotiating interest rates and nothing is confirmed yet. Meanwhile a mortgage specialist enters 0% with the intention to come back later and change it to a valid value. But he goes away on extended leave. What we have now is a very happy customer who scans her mortgage statements with glee thanking her stars for the day she met this banker. From the bank’s perspective however, banks might need to declare bankruptcy if they started doling money out without charging any interest.
Can you have a 0 degree Celsius temperature recording in Toronto on July 10th?
A weather specialist doesn’t have records for several cities on July 10th. She goes ahead and enters 0 for all the missing values. What she’s effectively done by entering a 0 value is to possibly mislead the general public. Just picture all the summer Toronto visitors who take this at face value and come prepared with warm clothing for a cold July!
Can you have 0 as the date a customer signed up for a home phone account?
A Telecom Sales Rep fills out a phone order for Joe Blow, very happy about completing an excellent sale that’s going to make Q2 results look terrific, but forgot to input the order date. He later realizes his mistake, but doesn’t remember the transaction date and decides to enter 0. But you know for calculations dates are stored as numbers-0 is really January 1, 1960 for SAS. When order dates are queried for the last quarter, Joe Blow would definitely not qualify in this list, making for a skewed reporting that Execs are not going to be happy about either.
With the questions posed above, I’m sure you can all understand how data wrongly stored as 0 can seriously compromise your reporting. If I were to succinctly sum up the distinction between missing and zero it would be probably look like this:
A missing value occurs when no data is available-just like the missing puzzle piece.
A Zero value occurs when data is available and it’s a valid value like the 0 degree Celsius temperature.
How can you display missing values as 0?
Ok, let’s look at flight data for rows where freight value is missing.
Now let’s modify this with an options statement so that missing values are set to 0.
The second piece of output should confirm that the missing value is just displayed as a zero but stored as missing. That’s why you still get the exact same observation returned in the second query. So this trick is a superficial one, it does nothing to change the internally stored value of missing.
Why, in the world of data can you not substitute 0 for a missing value?
Consider a human resource specialist looking at employee quarterly contributions. John Doe’s contributions look like this: $20, $0, $40, $30 – he didn’t contribute in Q2. What about Joe’s Average donations =90/4=22.5. Is that correct? I’m sure your answer is an emphatic NO as this division by four quarters throws off the count. Since he only contributed for three quarters his average needs to be calculated across three quarters only. So you would calculate 90/3=30.
Count on SAS to help with calculation of missing values. The MEAN function would be handy here, MEAN(q1,q2,q3,q4) ignores missing values in calculating the mean.
I hope this post gave you somewhat of a better appreciation of the nuances between missing and zero. John Mayer sings “Something's Missing” but doesn’t know why. I hope you do at the end of this post! I’d love to hear your comments on treatment of missing values, or anything in this blog that sparked a thought for you.
8 Comments
Thank Charu, very nice and important blog article. This can potentially bring in another problem - if one enters 0 in place of missing values by mistake and later we have to correct it, if the original data genuinely had few valid 0s, then it'll be almost impossible to replace the missing values with something like a mean if that is decided as a replacer for data cleansing. We won't be able to differentiate between the valid 0s and the missing values now.
Pingback: 3 gifts to magically transform your data - The SAS Training Post
I am more distressed by datasets where true zeroes are coded as missing. For example, if someone doesn't work their earnings are zero, not missing. I understand that their wage rate is missing, but why code earnings as zero? Perhaps because it reflects the path through the quesionaire? Still, it bothers me.
Yes, Thats right .
But instead of missing we can put the average value upto constant mean .
suppose let me tell you one example
i have sample with size 5 and 1 is missing there
4,8, . ,2,6
first find the mean
mean =20/5 = 4
put the mean as missing
so sample become 4,8,4,2,6
find mean = 24/5 = 4.8
sample become 4,8,4.8,2,6
same way find the mean until we get the constant mean
which is a imputation method in statistics for missing value.
thanks for your comment Cody. Since this is such a large & complex topic, I focussed this blog only on missing values being stored as Zero. Yes analysis gets thrown off when zeros are stored as missing in your well documented example. Maybe a future blog post.
You are ignoring the more common case, when SAS treats a value as missing which is actually a zero.
For example, I work with hospital discharge data. Often, there will be zero hospitalizations for a specific cause (say HIV infection, which is fairly uncommon) in a small county. PROC TABULATE will treat this as missing, rather than zero. Even using options such as PRELOADFORMAT and MISSTEXT still create the problem that the output dataset contains missing values when zero observations did occur.
In practice, a far more common data entry for missing is a value such as 99. These often induce another set of analysis problems (analyze all observations; analyze all observations where this key variable is not missing; analyze all observations where all variables not missing, even if variable not included in this specific analysis, etc.).
Thanks Rick, you are correct. I should have indicated that HR didn't have values but entered them as 0 for missing values. As you rightly pointed out if he didn't contribute in Q2 then the value is 0, if there are no available values then the value is missing. Thanks again!
In your example of John/Joe's average donation, his contribution for Q2 really is zero, not missing. His average quarterly contribution really is $22.5. However, if the records were lost for Q2 and HR doesn't know what the contribution was, then you can use a missing value and compute the average contribution based on the three contributions that are known.