The other day I encountered a SAS Knowledge Base article that shows how to count the number of missing and nonmissing values for each variable in a data set. However, the code is a complicated macro that is difficult for a beginning SAS programmer to understand. (Well, it was hard for me to understand!) The code not only counts the number of missing values for each variable, but also creates a SAS data set with the complete results. That's a nice bonus feature, but it contributes to the complexity of the macro.
This article simplifies the process and shows an alternative way to count the number of missing and nonmissing values for each variable in a data set.
The easy case: Count missing values for numeric variables
If you are only interested in the number of missing values for numeric variables, then a single call to the MEANS procedure computes the answer:
/* create sample data */ data one; input a $ b $ c $ d e; cards; a . a 1 3 . b . 2 4 a a a . 5 . . b 3 5 a a a . 6 a a a . 7 a a a 2 8 ; run; proc means data=one NMISS N; run; |
In many SAS procedures, including PROC MEANS, you can omit the VAR statement in order to operate on all relevant variables. For the MEANS procedure, "relevant" means "numeric."
Count missing values for all variables
The MEANS procedure computes statistics for numeric variables, but other SAS procedures enable you to count the number of missing values for character and numeric variables.The FREQ procedure is a SAS workhorse that I use almost every day. To get the FREQ procedure to count missing values, use three tricks:
- Specify a format for the variables so that the missing values all have one value and the nonmissing values have another value. PROC FREQ groups a variable's values according to the formatted values.
- Specify the MISSING and MISSPRINT options on the TABLES statement.
- Use the _CHAR_ and _NUMERIC_ keywords on the TABLES statement to specify that the FREQ procedure should compute statistics for all character or all numeric variables.
The following statements count the number of missing and nonmissing values for every variable: first the character variables and then the numeric ones.
/* create a format to group missing and nonmissing */ proc format; value $missfmt ' '='Missing' other='Not Missing'; value missfmt . ='Missing' other='Not Missing'; run; proc freq data=one; format _CHAR_ $missfmt.; /* apply format for the duration of this PROC */ tables _CHAR_ / missing missprint nocum nopercent; format _NUMERIC_ missfmt.; tables _NUMERIC_ / missing missprint nocum nopercent; run; |
Using the SAS/IML language to count missing values
In the SAS/IML Language, you can use the COUNTN and COUNTMISS functions that were introduced in SAS/IML 9.22. Strictly speaking, you need to use only one of the functions, since the result of the other is determined by knowing the number of observations in the data set. For the sake of the example, I'll be inefficient and use both of the functions.
As is the case for the PROC FREQ example, the trick is to use the _CHAR_ and _NUM_ keywords to read in and operate on the character and numeric variables in separate steps:
proc iml; use one; read all var _NUM_ into x[colname=nNames]; n = countn(x,"col"); nmiss = countmiss(x,"col"); read all var _CHAR_ into x[colname=cNames]; close one; c = countn(x,"col"); cmiss = countmiss(x,"col"); /* combine results for num and char into a single table */ Names = cNames || nNames; rNames = {" Missing", "Not Missing"}; cnt = (cmiss // c) || (nmiss // n); print cnt[r=rNames c=Names label=""]; |
This is similar to the output produced by the macro in the SAS Knowledge Base article. You can also write the cnt matrix to a data set, if necessary:
create MissOut from cnt[r=rNames c=Names]; append from cnt[r=rNames]; close; |
WANT MORE GREAT INSIGHTS MONTHLY? | SUBSCRIBE TO THE SAS TECH REPORT
46 Comments
Nice tip. Could I suggest that for the numeric format, it's a good idea to treat Special Missing values as missing as well? So:
value missfmt
._ - .Z = 'Missing'
other='Not Missing'
;
I think special missing values are very under-utilized. Like to encourage their use as much as possible. For anyone interested, I wrote a NESUG paper a few years back.
Regards,
--Q.
Sometimes special missings (or a subset thereof) are used as vehicles for non-quantitative information, and should be counted with the non-missings.
Great post -- I really learned a lot from it. However, I feel that PROC IML may be memory- demanding for this job. I ran a test and the result is http://www.sasanalysis.com/2011/09/test-to-count-missing-values-for-large.html
Thanks for the subsequent analysis. Yes, I didn't write the SAS/IML program to be competitive for large data sets. PROC FREQ, which only passes through the data once, should always beat SAS/IML. Here's one way to rewrite the SAS/IML program so that it doesn't run out of memory for 1e8 observations. See if this code is more competitive with PROC FREQ:
Hi Rick:
if I want to use this program as a macro , how to avoid the log error once input dataset only has numeric variables or character variables, or dataset contains zero records?
The first step of your macro would be to count the number of character variables, the number of numeric variables, and the number of observations. You would then write macro code to handle the degenerate cases. For example:
If you have further questions, please post to the SAS Support Communities.
I think we should add "if 0 then" in front of the "set " , You can compare the result
data _NULL_;
if 0 then set have nobs=nobs;
array char[*} $ _CHAR_;
array num[*} _NUMERIC_;
call symputx("numChar", dim(char));
call symputx("numNumer", dim(num));
call symputx("numObs", nobs);
stop;
run;
Pingback: Count missing values in observations - The DO Loop
And now a few years later from when this post was written, SAS adds the CMISS function in SAS 9.2
http://support.sas.com/kb/36/480.html
;)
Hi Jared,
The CMISS function in the DATA step is an awesome function. It was developed before my blog posts. I wrote about it in an article about counting missing values in observations and linked to the KB article to which you refer. I also used it in my book on statistical programming (published 2010). The COUNTMISS function is useful because it returns a vector, whereas calling CMISS on a matrix returns a 0/1 matrix and you then need to sum up the 1's in that matrix.
Hi Rick,
Your proc iml code is great for getting all the missing information into one table - thanks!.
But is there a way to have the table turned around so that missing and not missing are columns and each variable is a row? When you have a lot of variables it would be much easier to read that way.
Liz
Sure. Use the T function, as shown in this blog post.
Hi, I am a beginner in SAS and need help in counting the number of missing and non-missing values at each column level. I am looking at your second solution but not sure how to create a output dataset with the proc freq result. I need these count details for further processing.. appreciate your help..
Welcome to SAS. That's a good question. The answer appears to be more complicated than I can easily describe here. Post questions like this to the SAS Support Communities. (I'd ask this question in the SAS procedure community.) Lots of very smart and very helpful people there, and a great resource for someone new to SAS.
Hi all,
I see different kind of problem like output format (e.g. program with format), kind of variables (char or num),.. so I decided to do a program that count all over the variables, all the missing value.
Enjoy:
proc contents data=&INPUT..&TAB.
out=cont_entity(keep=name)
noprint;
run;
proc sql noprint;
select count(*) into:NOBS
from cont_entity
;
quit;
%let dsId = %Sysfunc(open(cont_entity));
%let N_NOME_CAMPO = %Sysfunc(varnum(&dsId.,name));
%let I=1;
proc sql;
create table stat_missing as
select
%do %while (%Sysfunc(fetch(&dsId.)) = 0);
%let NOME_CAMPO = %Sysfunc(getvarc(&dsId.,&N_NOME_CAMPO));
%if &I ne &NOBS %then %do;
coalesce(sum(missing(&NOME_CAMPO.)),0) as &NOME_CAMPO. ,
%end;
%else %do;
coalesce(sum(missing(&NOME_CAMPO.)),0) as &NOME_CAMPO.
%end;
%let I= %eval(&I + 1);
%end;
from &INPUT..&TAB.
;
quit;
%let RC= %sysfunc(close(&DsId));
Data New;
input ID month;
cards;
001 3
002 5
001 .
001 10
001 .
001 .
003 12
005 11
006 1
005 .
005 .
;
proc print noobs;
run;
Hi,
Can any one help me to find out the count of number of missing values by variable ID in above dataset.
Many Thanks
Rabi Shaw
You can ask questions like this at the SAS Support Communities:
proc means data=new NMISS;
class ID;
var month;
run;
Hi, I wanted to output the matrix with missing and non missing counts to a dataset. I added an extra line to the code so that this could be done but my output dataset don't seem to have any variable values, only variable headers. I'm attaching my code here.
proc iml;
use Semi_cleaned nobs nobs;
read point 1 var _NUM_ into x[colname=nNames];
read point 1 var _CHAR_ into x[colname=cNames];
Names = nNames || cNames;
nmiss = j(1,ncol(Names));
n = nmiss;
do i = 1 to ncol(Names);
read all var (Names[i]) into x;
nmiss[i] = countmiss(x,"col");
n[i] = nobs - nmiss[i];
end;
close Semi_cleaned;
rNames = {" Missing", "Not Missing"};
cnt = nmiss // n;
create test from cnt[colname=Names rowname=rNames];
print cnt[r=rNames c=Names label=""];
quit;
Thanks
You can ask questions like this at the SAS/IML Support Community.
In your case, you need to use the APPEND statement to actually write the data:
append from cnt[rowname=rNames];
Thank you for this interesting post.
I want to add the rownames as a new variable in the sas dataset (one column with the two modalities, "Missing" and "not missing". However I don't understand it is possible with append ?
Thank you
Pingback: Examine patterns of missing data in SAS - The DO Loop
Hi, Rick:
It is a great idea to count missing number, especially using format in proc freq step.However, i have a little confused about the use of MISSING and MISSPRINT options after TABLES/. MISSING option may suppress MISSPRINT option.
After this format use, all missing and non-missing values are re-assigned with "Missing" or "Not Missing". Then why we have to keep options of MISSING and MISSPRINT.
Best Regards
Don
You are correct: If all (formatted) values are nonmussing, then you do not need to use the MISSING and MISSPRINT options. They do nothing. However, I kept them because they are useful options to know about when you want to tabulate the number of nonmissing categories AND the number of missing values.
Thank you!!!!!
Pingback: Use SAS formats to bin numerical variables - The DO Loop
HI, I am trying to find out the missing patterns in a dataset. But I could not find a working code for that. Following are the sample dataset which I used for my coding.
Area Alchohol_Consumbtion Gender Weather_Condition
1 0 1 1
1 0 1 1
1 0 1 1
2 0 1 0
2 1 1 1
I need to find out the missing patterns in each Area.
I used
proc mi data=AB nimpute=0;
var Segment_No Age Gender Weather_Condition;
run;
but it takes Segment_No as a variable. If I remove the Segment_No, it gives only missing patterns, but do not provide the missing patterns ion each segment.
If you can give me a solution, I would be more thankful to you.
Your data and syntax do not match. To ask questions about SAS programming, post your data and code at the SAS Statistical Support Community.
The variable that you have defined in the dataset doesn't match the variables that you have defined in VAR statement like (Segment_No Age) doesn't available in the defined table.
I don't understand your statement. In PROC FREQ, the _CHAR_ and _NUMERIC_ keywords mean "use all character variables" and "use all numeric variables." In PROC IML I use the _CHAR_ and _NUM_ keywords similarly.
Here is my problem and I am hoping somebody can answer this.
I have an Oracle table that has 127 million records and 200 variables. Wrote a proc freq for _char_and _num_ to count missing and non-missing. And then used some manipulation to create a nice report that has variables and the missing and non-missing information. One cut and paste into Excel and my report is ready. But, I did it on a random sample of 5 million records. Analyst wants all 127 million records in the report. Proc Freq is timing out.
Pass-thru SQL is the way out, me thinks. But I am not sure nmiss or cmiss will work in a pass-thru SQL. Any thoughts?
Thanks
You can ask questions like this at the SAS Support Communities for Base Procedures.
Can the proc iml method be used to generate nmissing stratified by a categorical variable? Thank you!
Yes. Use the UNIQUE-LOC technique or the UNIQUEBY technique to stratify the data by a categorical variable.
Hi
please someone helps me.
i ma analyzing a data for a research but i confronted a problem which is I have 258 observations but when i conducted the proc i found out that the observations are only 257. I tried many ways to figure out the reason.
so please help me
Best Regards
You can ask questions like this at the SAS support communities. The most likely reason is that one of the variables in your model contains a missing value. SAS regression procedures do not include observations that contain a missing value when fitting the model. Most regression procedures display an 'NObs' table that displays how many variables were read and used in the analysis. If this explanation does not make sense, post your question and code to the Support Communities.
My technique (for numeric variables!) is to use the CALCULATED feature of SAS Proc SQL.
In this case, I want to count the missing/non-missing values of the variable 'ratio_girls_to_boys' by country.
The data set is 'ratio_girls_to_boys_long', which contains values by country*year,
proc sql;
create table ratio_counts_by_country_raw as SELECT
country,
count(country) as N_Obs,
count(ratio_girls_to_boys) as N,
CALCULATED N_Obs - CALCULATED N as N_Miss
from ratio_girls_to_boys_long
group by country
order by country
;
quit;
Hello, is there a way to output all variables from the data into rows and then the missing and not missing values in columns. transposing output of this part?
proc iml;
use one;
read all var _NUM_ into x[colname=nNames];
n = countn(x,"col");
nmiss = countmiss(x,"col");
read all var _CHAR_ into x[colname=cNames];
close one;
c = countn(x,"col");
cmiss = countmiss(x,"col");
/* combine results for num and char into a single table */
Names = cNames || nNames;
rNames = {" Missing", "Not Missing"};
cnt = (cmiss // c) || (nmiss // n);
print cnt[r=rNames c=Names label=""];
cntTrans = T(cnt);
print cntTrans[c=rNames r=Names label=""];
Thank you!! It worked! :)
Hi Rick
I was going through the blog ..it's very informative. Thanks for all the answers..
In the above code if I have multiple id and i want to identify missing vs non missing within each id column wise for numeric and character variable
What would be the additional function i need to mention ?
Similar to class statement in proc mean is there any function that I can use for the same?
Sort the data by your ID variables, then list the ID variables on the BY statement in PROC FREQ. You'll get results for each distinct combination of your ID variables.
Hi Rick,
with the code provided, if _ALL_was used instead of _numeric_ or _char_ the percentage and counts of missing /non missing for each variable in the dataset are generated at once.
How would you save the outputs in a sas dataset without using an ODS statement. Output option is not working.
proc format;
value $missfmt ' '='Missing' other='Not Missing';
value missfmt . ='Missing' other='Not Missing';
run;
proc freq data=one;
tables _all_ / missing missprint nocum nopercent;
format _NUMERIC_ missfmt. _CHAR_ $missfmt.;
run;
When you say "output option is not working," I assume that you mean the OUT= option on the TABLES statement does not write all the tables. Yes, that is why I switched to the PROC IML solution. To get the output from PROC FREQ into a SAS data set requires using multiple TABLES statements, which is what the KnowledgeBase article shows how to do. The link is in the first sentence of the article.
Thank you Rick, you have given so many helpful tips here in this blog. Would appreciate one more help:
With the code provided, I understand we use IML to output in dataset. However can we create a excel or pdf output using ODS with all the variables in row and 4 columns - variable name, missing count, non-missing count and percentage missing. and What if I need to order the output with the sequence of questions asked in the questionnaire?
proc format;
value $missfmt ' '='Missing' other='Not Missing';
value missfmt . ='Missing' other='Not Missing';
run;
proc freq data=one;
tables _all_ / missing missprint nocum nopercent;
format _NUMERIC_ missfmt. _CHAR_ $missfmt.;
run;
I believe the answer to your question is yes. Post your data and program to the SAS Support Communities, which has many experts who can work with you to answer all your questions.