Count the number of missing values for each variable

30

The other day I encountered a SAS Knowledge Base article that shows how to count the number of missing and nonmissing values for each variable in a data set. However, the code is a complicated macro that is difficult for a beginning SAS programmer to understand. (Well, it was hard for me to understand!) The code not only counts the number of missing values for each variable, but also creates a SAS data set with the complete results. That's a nice bonus feature, but it contributes to the complexity of the macro.

This article simplifies the process and shows an alternative way to count the number of missing and nonmissing values for each variable in a data set.

The easy case: Count missing values for numeric variables

If you are only interested in the number of missing values for numeric variables, then a single call to the MEANS procedure computes the answer:

/* create sample data */
data one;
  input a $ b $ c $ d e;
cards;
a . a 1 3
. b . 2 4
a a a . 5
. . b 3 5
a a a . 6
a a a . 7
a a a 2 8
;
run;
 
proc means data=one NMISS N; run;

In many SAS procedures, including PROC MEANS, you can omit the VAR statement in order to operate on all relevant variables. For the MEANS procedure, "relevant" means "numeric."

Count missing values for all variables

The MEANS procedure computes statistics for numeric variables, but other SAS procedures enable you to count the number of missing values for character and numeric variables.

The FREQ procedure is a SAS workhorse that I use almost every day. To get the FREQ procedure to count missing values, use three tricks:

  1. Specify a format for the variables so that the missing values all have one value and the nonmissing values have another value. PROC FREQ groups a variable's values according to the formatted values.
  2. Specify the MISSING and MISSPRINT options on the TABLES statement.
  3. Use the _CHAR_ and _NUMERIC_ keywords on the TABLES statement to specify that the FREQ procedure should compute statistics for all character or all numeric variables.

The following statements count the number of missing and nonmissing values for every variable: first the character variables and then the numeric ones.

/* create a format to group missing and nonmissing */
proc format;
 value $missfmt ' '='Missing' other='Not Missing';
 value  missfmt  . ='Missing' other='Not Missing';
run;
 
proc freq data=one; 
format _CHAR_ $missfmt.; /* apply format for the duration of this PROC */
tables _CHAR_ / missing missprint nocum nopercent;
format _NUMERIC_ missfmt.;
tables _NUMERIC_ / missing missprint nocum nopercent;
run;

Using the SAS/IML language to count missing values

In the SAS/IML Language, you can use the COUNTN and COUNTMISS functions that were introduced in SAS/IML 9.22. Strictly speaking, you need to use only one of the functions, since the result of the other is determined by knowing the number of observations in the data set. For the sake of the example, I'll be inefficient and use both of the functions.

As is the case for the PROC FREQ example, the trick is to use the _CHAR_ and _NUM_ keywords to read in and operate on the character and numeric variables in separate steps:

proc iml;
use one;
read all var _NUM_ into x[colname=nNames]; 
n = countn(x,"col");
nmiss = countmiss(x,"col");
 
read all var _CHAR_ into x[colname=cNames]; 
close one;
c = countn(x,"col");
cmiss = countmiss(x,"col");
 
/* combine results for num and char into a single table */
Names = cNames || nNames;
rNames = {"    Missing", "Not Missing"};
cnt = (cmiss // c) || (nmiss // n);
print cnt[r=rNames c=Names label=""];

This is similar to the output produced by the macro in the SAS Knowledge Base article. You can also write the cnt matrix to a data set, if necessary.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

30 Comments

  1. Quentin McMullen on

    Nice tip. Could I suggest that for the numeric format, it's a good idea to treat Special Missing values as missing as well? So:

    value missfmt
    ._ - .Z = 'Missing'
    other='Not Missing'
    ;

    I think special missing values are very under-utilized. Like to encourage their use as much as possible. For anyone interested, I wrote a NESUG paper a few years back.

    Regards,
    --Q.

    • Howard Schreier on

      Sometimes special missings (or a subset thereof) are used as vehicles for non-quantitative information, and should be counted with the non-missings.

    • Rick Wicklin

      Thanks for the subsequent analysis. Yes, I didn't write the SAS/IML program to be competitive for large data sets. PROC FREQ, which only passes through the data once, should always beat SAS/IML. Here's one way to rewrite the SAS/IML program so that it doesn't run out of memory for 1e8 observations. See if this code is more competitive with PROC FREQ:

      proc iml;
         use test nobs nobs;
         read point 1 var _NUM_ into x[colname=nNames]; 
         read point 1 var _CHAR_ into x[colname=cNames]; 
         Names = nNames || cNames;
         nmiss = j(1,ncol(Names));
         n = nmiss;
         do i = 1 to ncol(Names);
            read all var (Names[i]) into x;
            nmiss[i] = countmiss(x,"col");
            n[i] = nobs - nmiss[i];
         end;
         close test;
         rNames = {"    Missing", "Not Missing"};
         cnt = nmiss // n;
         print cnt[r=rNames c=Names label=""];
      quit;
      

      • Hi Rick:
        if I want to use this program as a macro , how to avoid the log error once input dataset only has numeric variables or character variables, or dataset contains zero records?

        • Rick Wicklin

          The first step of your macro would be to count the number of character variables, the number of numeric variables, and the number of observations. You would then write macro code to handle the degenerate cases. For example:

          data _NULL_;
          set have nobs=nobs;
          array char[*} $ _CHAR_;
          array num[*} _NUMERIC_;
          call symputx("numChar", dim(char));
          call symputx("numNumer", dim(num));
          call symputx("numObs", nobs);
          stop;
          run;

          If you have further questions, please post to the SAS Support Communities.

          • I think we should add "if 0 then" in front of the "set " , You can compare the result

            data _NULL_;
            if 0 then set have nobs=nobs;
            array char[*} $ _CHAR_;
            array num[*} _NUMERIC_;
            call symputx("numChar", dim(char));
            call symputx("numNumer", dim(num));
            call symputx("numObs", nobs);
            stop;
            run;

  2. Pingback: Count missing values in observations - The DO Loop

    • Rick Wicklin

      Hi Jared,
      The CMISS function in the DATA step is an awesome function. It was developed before my blog posts. I wrote about it in an article about counting missing values in observations and linked to the KB article to which you refer. I also used it in my book on statistical programming (published 2010). The COUNTMISS function is useful because it returns a vector, whereas calling CMISS on a matrix returns a 0/1 matrix and you then need to sum up the 1's in that matrix.

  3. Hi Rick,
    Your proc iml code is great for getting all the missing information into one table - thanks!.

    But is there a way to have the table turned around so that missing and not missing are columns and each variable is a row? When you have a lot of variables it would be much easier to read that way.
    Liz

  4. Hi, I am a beginner in SAS and need help in counting the number of missing and non-missing values at each column level. I am looking at your second solution but not sure how to create a output dataset with the proc freq result. I need these count details for further processing.. appreciate your help..

  5. Fabio Gangemi on

    Hi all,
    I see different kind of problem like output format (e.g. program with format), kind of variables (char or num),.. so I decided to do a program that count all over the variables, all the missing value.
    Enjoy:

    proc contents data=&INPUT..&TAB.
    out=cont_entity(keep=name)
    noprint;
    run;

    proc sql noprint;
    select count(*) into:NOBS
    from cont_entity
    ;
    quit;

    %let dsId = %Sysfunc(open(cont_entity));
    %let N_NOME_CAMPO = %Sysfunc(varnum(&dsId.,name));
    %let I=1;

    proc sql;
    create table stat_missing as
    select
    %do %while (%Sysfunc(fetch(&dsId.)) = 0);
    %let NOME_CAMPO = %Sysfunc(getvarc(&dsId.,&N_NOME_CAMPO));
    %if &I ne &NOBS %then %do;
    coalesce(sum(missing(&NOME_CAMPO.)),0) as &NOME_CAMPO. ,
    %end;
    %else %do;
    coalesce(sum(missing(&NOME_CAMPO.)),0) as &NOME_CAMPO.
    %end;

    %let I= %eval(&I + 1);
    %end;
    from &INPUT..&TAB.
    ;
    quit;

    %let RC= %sysfunc(close(&DsId));

  6. Data New;
    input ID month;
    cards;
    001 3
    002 5
    001 .
    001 10
    001 .
    001 .
    003 12
    005 11
    006 1
    005 .
    005 .
    ;
    proc print noobs;
    run;

    Hi,

    Can any one help me to find out the count of number of missing values by variable ID in above dataset.

    Many Thanks
    Rabi Shaw

  7. Sangeetha Gallagher on

    Hi, I wanted to output the matrix with missing and non missing counts to a dataset. I added an extra line to the code so that this could be done but my output dataset don't seem to have any variable values, only variable headers. I'm attaching my code here.

    proc iml;
    use Semi_cleaned nobs nobs;
    read point 1 var _NUM_ into x[colname=nNames];
    read point 1 var _CHAR_ into x[colname=cNames];
    Names = nNames || cNames;
    nmiss = j(1,ncol(Names));
    n = nmiss;
    do i = 1 to ncol(Names);
    read all var (Names[i]) into x;
    nmiss[i] = countmiss(x,"col");
    n[i] = nobs - nmiss[i];
    end;
    close Semi_cleaned;
    rNames = {" Missing", "Not Missing"};
    cnt = nmiss // n;
    create test from cnt[colname=Names rowname=rNames];
    print cnt[r=rNames c=Names label=""];
    quit;

    Thanks

  8. Pingback: Examine patterns of missing data in SAS - The DO Loop

  9. Donghua Zeng on

    Hi, Rick:
    It is a great idea to count missing number, especially using format in proc freq step.However, i have a little confused about the use of MISSING and MISSPRINT options after TABLES/. MISSING option may suppress MISSPRINT option.

    After this format use, all missing and non-missing values are re-assigned with "Missing" or "Not Missing". Then why we have to keep options of MISSING and MISSPRINT.

    Best Regards
    Don

    • Rick Wicklin

      You are correct: If all (formatted) values are nonmussing, then you do not need to use the MISSING and MISSPRINT options. They do nothing. However, I kept them because they are useful options to know about when you want to tabulate the number of nonmissing categories AND the number of missing values.

  10. Pingback: Use SAS formats to bin numerical variables - The DO Loop

  11. HI, I am trying to find out the missing patterns in a dataset. But I could not find a working code for that. Following are the sample dataset which I used for my coding.

    Area Alchohol_Consumbtion Gender Weather_Condition
    1 0 1 1
    1 0 1 1
    1 0 1 1
    2 0 1 0
    2 1 1 1

    I need to find out the missing patterns in each Area.
    I used
    proc mi data=AB nimpute=0;
    var Segment_No Age Gender Weather_Condition;
    run;

    but it takes Segment_No as a variable. If I remove the Segment_No, it gives only missing patterns, but do not provide the missing patterns ion each segment.
    If you can give me a solution, I would be more thankful to you.

  12. Rabindra Kumar on

    The variable that you have defined in the dataset doesn't match the variables that you have defined in VAR statement like (Segment_No Age) doesn't available in the defined table.

    • Rick Wicklin

      I don't understand your statement. In PROC FREQ, the _CHAR_ and _NUMERIC_ keywords mean "use all character variables" and "use all numeric variables." In PROC IML I use the _CHAR_ and _NUM_ keywords similarly.

  13. Here is my problem and I am hoping somebody can answer this.

    I have an Oracle table that has 127 million records and 200 variables. Wrote a proc freq for _char_and _num_ to count missing and non-missing. And then used some manipulation to create a nice report that has variables and the missing and non-missing information. One cut and paste into Excel and my report is ready. But, I did it on a random sample of 5 million records. Analyst wants all 127 million records in the report. Proc Freq is timing out.

    Pass-thru SQL is the way out, me thinks. But I am not sure nmiss or cmiss will work in a pass-thru SQL. Any thoughts?
    Thanks

Leave A Reply

Back to Top