Ensuring that key variables are numeric not character

5

One of the frustrating outcomes of the data import process is when a variable that you need to be numeric is imported as character. This often happens because the column of data contains non-numeric data, for example, where blanks in a database are exported as “NULL” instead of a true blank. This blog presents an efficient data cleaning solution for this problem.

There are various solutions for such problems in SAS, ranging from complex coding on the import side (e.g. in SQL) to post-import cleaning. Because many users do not have access to or knowledge of all SAS products, in many cases we need to do post-import cleaning. My solution is to use the following data snippet immediately after importing. The user simply fills in only the first three lines, the name of the data set, the currently existing target variables, and a list of new variable names. Note that the user does not need to change anything else.

%let Dataset= <Insert data set name here>;
%let TargetVariables = <Insert actual target variables here>;
%let NewNumericals = <Give new variable names – list must have same number as old variables>;
Data &Dataset;
	set &Dataset;
	Length &NewNumericals 8.;
	Format &NewNumericals Best12.;
	array TargetVariables (*) &TargetNumericals;
	array NewNumericals (*) &NewNumericals;
    do i = 1 to DIM(TargetVariables);
    	if 	CountC(TargetVariables {i},"0 1 2 3 4 5 6 7 8 9 .",'vt') = 0 then 
			NewNumericals{i} = TargetVariables {i};
		else NewNumericals{i} = .;
    end;
run;


The code creates the new variables, ensures that they are numeric, and removes all non-numeric data from the old target variables before copying their contents into the new variables.

It is wise for the user to run a PROC FREQ on the old variables to see what non-numeric entries were in there. For instance, my snippet will delete entries like “_3” or “44#” (because of the non-numeric characters), whereas you may want to retain the numeric portion of those entries. How to retain numeric portions is the topic for another blog.

Share

About Author

Gregory Lee

Research Director

Professor Gregory John Lee is currently the Research Director and an Associate Professor in Research Methodology and Decision Sciences at the AMBA-rated Wits Business School, Johannesburg, South Africa. He has authored books on HR Metrics, including his newest title, Business Statistics Made Easy in SAS. Lee focuses on issues in human resource management, notably HR metrics, in which he has established himself as a leading expert, and other areas such as training, employee turnover, and the employee-customer link. He has served in many capacities within the international academic field and has sat on the GMAC Advisory Council, the editorial board of the Journal of Organizational and Occupational Psychology, and he engages in frequent reviewing for many journals. In addition, Lee is a well-known consultant, writer, and speaker in the corporate and practical management arenas, notably in the area of HR metrics but extending to other areas such as human resources strategy and foresight.

5 Comments

  1. avoiding the proc import with all the guessing and use proper input-processing by a data step will also help a lot. Why let a proc do guessing?

  2. Actually I think there are a few much easier ways to do that. First, when we get data "pulled" from our SQL programmers I always request they convert all cells with the word NULL (the default in SQL) to blanks. Otherwise, also you can take a field and multiple it by 1. For example if you have a field named BUGS and it looks like numbers except the missing values coming in as the word NULL... create a new field like this:

    BUGS2=BUGS*1;

    That will automatically convert the cells in BUGS2 to a correct missing value and the field BUGS2 will be a numeric field with the same exact numbers.
    ~Jean

    • Hi Jean
      Thank you – yes I mentioned the SQL option in the blog. Certainly as per Bob’s comment it’s definitely true that input coding is a good idea. This blog is referring to post-import transformations. I tested your suggestion on the multiplying by 1: yes, that is more efficient, thanks!”
      Regards,
      Gregory Lee

  3. bob mcconnaughey on

    the input function, assuming the char. variables "look" like numbers is about as easy a way to do this as i can imagine? Or am i missing something?

    • Thanks Bob – you are quite right. I should have been a little more precise. I usually use this snippet when I’ve already got the data in SAS and want to transform it there, not generally during input – sometimes I get the already corrupted data in SAS format. Having said that, consider the common case of receiving Excel data from a client. Many users will import using the Wizard which does not do input transformations. Your suggested solution of using the infile and input statements is efficient, but requires the Excel file to be available for access (which generally means being open). This can be cumbersome in situations where the source file is large or on slow to work shared folders or the like. I also find that writing the DDE triplet can be tricky for some users, although the DDE triplet writer in SAS 9.4 (in the Solutions menu) makes it a lot easier now. For readers keen on exploring Bob’s solution, here is a sample of the input code:

      filename Original dde 'Excel|C:\$$!Sales.xlsx]Sheet1!R1C1:R6C2';
      data Orig;
      infile Original missover;
      input 8.2;
      run;

      So, the infile & input code works great, but has some cons. Finally, there is the SAS/ACCESS option, but not all users license this. So, yeah, plenty of options, the snippet in the blog is a last resort after import has given you data in the wrong format and you don’t want to re-import.

      Regards,
      Gregory Lee

Leave A Reply

Back to Top