For several releases, SAS has supported a cryptographic hash function called MD5, or "message digest". In SAS 9.4 Maintenance 1, the new SHA256 function can serve the same purpose with a better implementation.
UPDATE: Since first writing this in 2014, SAS has added a richer set of hashing functions/methods. They include MD5, SHA1, SHA256, SHA384, SHA512, CRC32 and HMAC methods. See "Hashing functions and hash-based message functions in SAS" in the documentation.
The job of a hash function is to take some input (of any type and of any size) and distill it to a fixed-length series of bytes that we believe should be unique to that input. As a practical example, systems use this to check the integrity of file downloads. You can verify that the published hash matches the actual hash after downloading.
Sometimes a hash is used to track changes in records within a database. You first calculate a hash value for each data record based on all of the fields. Periodically, you recheck those calculations. If a hash value changes for a data record, you know that some part of that record has changed since the last time you looked.
Here's another common use: storing passwords in a database. Because you can't (theoretically) reverse the hash process, you can use a hash function to verify that a supplied password is the same as a value you've stored, without having to store the original clear-text version of the password. It's not the same as encryption, because there is no decryption method that would compromise the original supplied password value.
MD5 has known vulnerabilities, especially with regard to uniqueness. A malicious person can use a relatively low-powered computer to compute an input that produces an identical hash to one you've stored, thus compromising the algorithm's usefulness.
Enter the SHA256 algorithm. It's the same idea as MD5, but without the known vulnerabilities. Here's a program example:
data _null_; format hash $hex64.; hash = sha256("SHA256 is part of SAS 9.4m1!"); put hash; run; |
Output (formatted as hexadecimal so as to be easier on the eyes than 256 ones-and-zeros):
876CF270E81BA3E6219F9518AD9CBE303D8EEC734D4B5966F8D4FD9E89449C6C
As the name implies, it produces a value that is 256 bits (32 bytes) in size, as compared to 128 bits from MD5. Here's a useful article that compares the effectiveness of hash algorithms.
If you've been wanting to hash your data in SAS, but you've been poo-pooing the MD5 function -- well, now is your chance!
26 Comments
Thanks Chris. Wouldn't a length of $32 for the 'hash' variable in your last example be more appropriate?
Tom, you're correct! I've wasted 32 precious bytes (per record) in my output data set. 32 bytes are stored, but I formatted as $hex64 to represent the two-hex-characters-per-byte in the displayed output. The optimized version would be:
Thought I'd chime in and let people know that MD5 is no longer considered secure and hasn't been for awhile (http://www.zdnet.com/article/md5-password-scrambler-no-longer-safe/)
Most folks are using SHA256/512 and others.
So just be careful when you decide to use MD5 hash. There are perfectly legit cases to still use it, such as checking for tampering when downloading your favourite Linux ISO ;)
Thanks Jared! I'm glad to see this post was a bit ahead of the official curve...
Pingback: Anonymization for data managers | SAS Users
Chris - any idea why the SHA256 implementation SAS uses is so slow, compared to its MD5 implementation? Various tests performed by SAS-L members showed that the SHA256 algorithm seems to be about 10 to 30 times slower than MD5. While it's expected for it to be a bit slower, 10 to 30 times is a lot more than would be expected (3 times is more reasonable). 30 times is with a 'small' message, 10 times is with a 'large' message, in this case.
Joe, I don't know the answer, but I can check with the development team. I have been watching the thread on SAS-L and I see the interest in using this to create a hash for every record in a data set. This function might not be optimized for that sort of volume. I haven't noticed on the thread: are folks trying the SHA256HEX and the SHA256MAHEX functions too? These are new in 9.4m3.
Does the university edition have Sha256 encryption?
No, it does not -- the SAS/Secure components required are not part of the SAS University Edition. The MD5 function is available.
I guess that the crystallographic part of the algorithm eats some resources.
But id you wish to use hashing for observation comparisons, or just want to use for building surrogate keys (instead of sequences), then there is no need for encryption.
Do SAS have any plans for implementing non-cryptographic has functions (like MurMur3, Fowler-Noll-Vo)?
Not to my knowledge, Linus. SHA256 and others were added (I think) in part because other underlying parts of SAS use them, so adding them as user functions was easy. Rick Langston wrote a paper about how to code your own algorithms for hashing in SAS. Care to try your hand at one of the functions you mentioned?
Sounds like opportunity for user-defined function demos :
SHA256 for SAS UE
For Linus, non-cryptographic hash functions (like MurMur3, Fowler-Noll-Vo)
Is there going to be functionalities like rowhash and colhash? The cats() implementation is slow and really limited when the dataset becomes even moderately sized. Rowhash function would do a sha256/512 for one row from all variables listed as arguments. Colhash on the other hand would give a sha256/512 hash from the columns given as arguments. That would help a lot in some basic integrity checks and stamping tasks.
Good feedback, Bob. I don't know if there are plans for a row-wise or column-wise hash, but I know that many users have done what you hint -- created their own row hash with concatenation followed by the hash function.
If the goal is checking integrity and differences, don't forget PROC COMPARE and its utility in that department.
SHA256 Is NOT meant for storing password!!!!!!!
SHA256 + a "salt' is still commonly used, as well as a multipass hashing technique that you can easily implement with the newer SHA256HMACHEX function.
Again completely wrong, scrypt bcrypt PBKDF2 etc are proven algorithms that can be used for password hashing.
SHA256 + a "salt" is wrong, you need a strong random salt in every hash, you need to make your own function that creates the salt etc.
From a security point of view it's always a bad idea to make this / invent this yourself. that's why things like bcrypt exist.
https://security.stackexchange.com/questions/90064/how-secure-are-sha256-salt-hashes-for-password-storage
https://security.stackexchange.com/questions/211/how-to-securely-hash-passwords/31846
Hi Chris,
I am currently tasked with calculating an MD5 Checksum for an entire text file, so that a third party vendor can calculate that same value when the file is transmitted to them to ensure the file was not altered in any way during the process. So what I've been trying to do is use the MD5 function for an entire file path, rather than just one variable or set of variables. I thought it was working, but I think I just realized that it is only hashing the name of the file, not the file contents. I'm not sure if SHA256 is any different in that regard. Any thoughts on this and what could possibly help me in SAS? I have to believe there is a way.
Hi Luke, if you need the MD5 signature for an entire file, you might need to resort to an external tool. The MD5 and SHA256 tools are limited in that they can take only the max size of a character value in SAS -- many files would be larger than that. These tools are built into Linux, and available for install on Windows.
Hi Chris-
Thanks for the engagement on this topic.
What informat is best for reading in the result of SHA256 among IB/PIB, PD, and RB?
I'm using md5/sha256 to anonymize data, and would prefer to use a 9-digit number to a HEX string....
Thanks,
Jed
Maybe I should know what these are: "IB/PIB, PD, and RB" -- but I don't and I'm having trouble guessing. Can you provide a little more context?
Whoops, sorry.
IB, PIB, PD, and RB are informats that can be applied to binary data.
IB (integer binary) and PIB (positive integer binary) are similar and the high-level description reads, "Reads [positive] integer binary (fixed-point) values."
RB (real binary) "[r]eads numeric data that is stored in real binary (floating-point) notation."
PD is for reading an IBM packed-decimal format, so maybe that rules itself out here.
As I said, but did not elaborate on, I am using MD5 to anonymize identifiers, and was trying to create a numeric ID that I could use instead of a beastly hex. I know hex is prettier than binary, but that's sort of like saying a cow is prettier than a pig (individual tastes may vary).
So I'm experimenting with using these informats on the results of MD5 (and plan to try SHA256) and finding the results a bit unwieldy. Maybe trying to use the hash like this is a fundamentally flawed premise? I don't know.
I see, but I'm not sure you'll be able to get the "hash-like" experience this way. One advantage of a hash is that you can't reverse-engineer the original value, but you can verify that an original value is (or is not) the source for the hash (by reapplying the hash function + salt if needed). Numeric IDs can be used to replace a field, but I don't think you'll be able to come up with an integer that represents a unique hash that you can use in the same way. You'd have to keep a table of integer->value mappings, and if data privacy was the concern there would have to be a lot of process around that. Happy to be corrected on this though -- it's not my area of expertise.
OK, thanks Chris.
I was trying to create a unique ID number based on an input rather than create and maintain a crosswalk file, so the specifics of hash-verification aren't as important.
Hi Chris
We have SAS EG, version 7.1 at our company. Any idea if this supports the SHA256 function? My log would suggest not but would appreciate any clarification you could provide...
23 data temp;
24 set sample_0001;
25 format hash_email $hex64.;
26 hash_email = SHA256(email);
______
68
ERROR 68-185: The function SHA256 is unknown, or cannot be accessed.
Thanks
Byron
Hi Byron, this function was added in SAS 9.4 Maint 1. You could be using SAS Enterprise Guide 7.1 with an earlier release of SAS -- you would need to check. You can run "proc product_status; run;" to check the version of SAS.