Ensuring that key variables are numeric, not character

One of the frustrating outcomes of the data import process is when a variable that you need to be numeric is imported as character. This often happens because the column of data contains non-numeric data, for example, where blanks in a database are exported as “NULL” instead of a true blank. This blog presents an efficient data cleaning solution for this problem.

There are various solutions for such problems in SAS, ranging from complex coding on the import side (e.g. in SQL) to post-import cleaning. Because many users do not have access to or knowledge of all SAS products, in many cases we need to do post-import cleaning. My solution is to use the following data snippet immediately after importing. The user simply fills in only the first three lines, the name of the data set, the currently existing target variables, and a list of new variable names. Note that the user does not need to change anything else.

%let Dataset= <Insert data set name here>;
%let TargetVariables = <Insert actual target variables here>;
%let NewNumericals = <Give new variable names – list must have same number as old variables>;
Data &Dataset;
	set &Dataset;
	Length &NewNumericals 8.;
	Format &NewNumericals Best12.;
	array TargetVariables (*) &TargetNumericals;
	array NewNumericals (*) &NewNumericals;
    do i = 1 to DIM(TargetVariables);
    	if 	CountC(TargetVariables {i},"0 1 2 3 4 5 6 7 8 9 .",'vt') = 0 then 
			NewNumericals{i} = TargetVariables {i};
		else NewNumericals{i} = .;
    end;
run;

The code creates the new variables, ensures that they are numeric, and removes all non-numeric data from the old target variables before copying their contents into the new variables.

It is wise for the user to run a PROC FREQ on the old variables to see what non-numeric entries were in there. For instance, my snippet will delete entries like “_3” or “44#” (because of the non-numeric characters), whereas you may want to retain the numeric portion of those entries. How to retain numeric portions is the topic for another blog.

5 Comments

Ja Karman on December 6, 2015 8:47 am

avoiding the proc import with all the guessing and use proper input-processing by a data step will also help a lot. Why let a proc do guessing?
Jean Slosek on December 1, 2015 10:24 am

Actually I think there are a few much easier ways to do that. First, when we get data "pulled" from our SQL programmers I always request they convert all cells with the word NULL (the default in SQL) to blanks. Otherwise, also you can take a field and multiple it by 1. For example if you have a field named BUGS and it looks like numbers except the missing values coming in as the word NULL... create a new field like this:

BUGS2=BUGS*1;

That will automatically convert the cells in BUGS2 to a correct missing value and the field BUGS2 will be a numeric field with the same exact numbers.
~Jean
- Gregory Lee on December 2, 2015 10:04 am
  
  Hi Jean
  Thank you – yes I mentioned the SQL option in the blog. Certainly as per Bob’s comment it’s definitely true that input coding is a good idea. This blog is referring to post-import transformations. I tested your suggestion on the multiplying by 1: yes, that is more efficient, thanks!”
  Regards,
  Gregory Lee
bob mcconnaughey on November 27, 2015 12:39 am

the input function, assuming the char. variables "look" like numbers is about as easy a way to do this as i can imagine? Or am i missing something?
- Gregory Lee on December 2, 2015 10:04 am
  
  Thanks Bob – you are quite right. I should have been a little more precise. I usually use this snippet when I’ve already got the data in SAS and want to transform it there, not generally during input – sometimes I get the already corrupted data in SAS format. Having said that, consider the common case of receiving Excel data from a client. Many users will import using the Wizard which does not do input transformations. Your suggested solution of using the infile and input statements is efficient, but requires the Excel file to be available for access (which generally means being open). This can be cumbersome in situations where the source file is large or on slow to work shared folders or the like. I also find that writing the DDE triplet can be tricky for some users, although the DDE triplet writer in SAS 9.4 (in the Solutions menu) makes it a lot easier now. For readers keen on exploring Bob’s solution, here is a sample of the input code:
  
  filename Original dde 'Excel|C:\$$!Sales.xlsx]Sheet1!R1C1:R6C2';
  data Orig;
  infile Original missover;
  input 8.2;
  run;
  
  So, the infile & input code works great, but has some cons. Finally, there is the SAS/ACCESS option, but not all users license this. So, yeah, plenty of options, the snippet in the blog is a last resort after import has given you data in the wrong format and you don’t want to re-import.
  
  Regards,
  Gregory Lee

Blogs

Blogs

Ensuring that key variables are numeric not character

About Author

Related Posts

5 Comments

Blogs

About Author

Related Posts

QPSOLVE: A new SAS IML function for quadratic optimization

How to use keyword-value pairs when calling SAS IML subroutines

Isotonic regression: An application of quadratic optimization

5 Comments