SHA256 and other hashing functions in SAS

26

For several releases, SAS has supported a cryptographic hash function called MD5, or "message digest". In SAS 9.4 Maintenance 1, the new SHA256 function can serve the same purpose with a better implementation.

UPDATE: Since first writing this in 2014, SAS has added a richer set of hashing functions/methods. They include MD5, SHA1, SHA256, SHA384, SHA512, CRC32 and HMAC methods. See "Hashing functions and hash-based message functions in SAS" in the documentation.

The job of a hash function is to take some input (of any type and of any size) and distill it to a fixed-length series of bytes that we believe should be unique to that input. As a practical example, systems use this to check the integrity of file downloads. You can verify that the published hash matches the actual hash after downloading.

Sometimes a hash is used to track changes in records within a database. You first calculate a hash value for each data record based on all of the fields. Periodically, you recheck those calculations. If a hash value changes for a data record, you know that some part of that record has changed since the last time you looked.

Here's another common use: storing passwords in a database. Because you can't (theoretically) reverse the hash process, you can use a hash function to verify that a supplied password is the same as a value you've stored, without having to store the original clear-text version of the password. It's not the same as encryption, because there is no decryption method that would compromise the original supplied password value.

MD5 has known vulnerabilities, especially with regard to uniqueness. A malicious person can use a relatively low-powered computer to compute an input that produces an identical hash to one you've stored, thus compromising the algorithm's usefulness.

Enter the SHA256 algorithm. It's the same idea as MD5, but without the known vulnerabilities. Here's a program example:

data _null_;
  format hash $hex64.;
  hash = sha256("SHA256 is part of SAS 9.4m1!");
  put hash;
run;

Output (formatted as hexadecimal so as to be easier on the eyes than 256 ones-and-zeros):

876CF270E81BA3E6219F9518AD9CBE303D8EEC734D4B5966F8D4FD9E89449C6C

As the name implies, it produces a value that is 256 bits (32 bytes) in size, as compared to 128 bits from MD5. Here's a useful article that compares the effectiveness of hash algorithms.

If you've been wanting to hash your data in SAS, but you've been poo-pooing the MD5 function -- well, now is your chance!

Share

About Author

Chris Hemedinger

Director, SAS User Engagement

+Chris Hemedinger is the Director of SAS User Engagement, which includes our SAS Communities and SAS User Groups. Since 1993, Chris has worked for SAS as an author, a software developer, an R&D manager and a consultant. Inexplicably, Chris is still coasting on the limited fame he earned as an author of SAS For Dummies

26 Comments

  1. Thanks Chris. Wouldn't a length of $32 for the 'hash' variable in your last example be more appropriate?

    • Chris Hemedinger
      Chris Hemedinger on

      Tom, you're correct! I've wasted 32 precious bytes (per record) in my output data set. 32 bytes are stored, but I formatted as $hex64 to represent the two-hex-characters-per-byte in the displayed output. The optimized version would be:

      data cars;
        length hash $ 32;
        format hash $hex64.;
        set sashelp.cars;
        hash = sha256(cats(make,model));
      run;
    • Chris Hemedinger
      Chris Hemedinger on

      Thanks Jared! I'm glad to see this post was a bit ahead of the official curve...

  2. Pingback: Anonymization for data managers | SAS Users

  3. Chris - any idea why the SHA256 implementation SAS uses is so slow, compared to its MD5 implementation? Various tests performed by SAS-L members showed that the SHA256 algorithm seems to be about 10 to 30 times slower than MD5. While it's expected for it to be a bit slower, 10 to 30 times is a lot more than would be expected (3 times is more reasonable). 30 times is with a 'small' message, 10 times is with a 'large' message, in this case.

    • Chris Hemedinger
      Chris Hemedinger on

      Joe, I don't know the answer, but I can check with the development team. I have been watching the thread on SAS-L and I see the interest in using this to create a hash for every record in a data set. This function might not be optimized for that sort of volume. I haven't noticed on the thread: are folks trying the SHA256HEX and the SHA256MAHEX functions too? These are new in 9.4m3.

    • Chris Hemedinger
      Chris Hemedinger on

      No, it does not -- the SAS/Secure components required are not part of the SAS University Edition. The MD5 function is available.

  4. I guess that the crystallographic part of the algorithm eats some resources.
    But id you wish to use hashing for observation comparisons, or just want to use for building surrogate keys (instead of sequences), then there is no need for encryption.
    Do SAS have any plans for implementing non-cryptographic has functions (like MurMur3, Fowler-Noll-Vo)?

  5. Is there going to be functionalities like rowhash and colhash? The cats() implementation is slow and really limited when the dataset becomes even moderately sized. Rowhash function would do a sha256/512 for one row from all variables listed as arguments. Colhash on the other hand would give a sha256/512 hash from the columns given as arguments. That would help a lot in some basic integrity checks and stamping tasks.

    • Chris Hemedinger
      Chris Hemedinger on

      Good feedback, Bob. I don't know if there are plans for a row-wise or column-wise hash, but I know that many users have done what you hint -- created their own row hash with concatenation followed by the hash function.

      If the goal is checking integrity and differences, don't forget PROC COMPARE and its utility in that department.

  6. Hi Chris,
    I am currently tasked with calculating an MD5 Checksum for an entire text file, so that a third party vendor can calculate that same value when the file is transmitted to them to ensure the file was not altered in any way during the process. So what I've been trying to do is use the MD5 function for an entire file path, rather than just one variable or set of variables. I thought it was working, but I think I just realized that it is only hashing the name of the file, not the file contents. I'm not sure if SHA256 is any different in that regard. Any thoughts on this and what could possibly help me in SAS? I have to believe there is a way.

    • Chris Hemedinger
      Chris Hemedinger on

      Hi Luke, if you need the MD5 signature for an entire file, you might need to resort to an external tool. The MD5 and SHA256 tools are limited in that they can take only the max size of a character value in SAS -- many files would be larger than that. These tools are built into Linux, and available for install on Windows.

  7. Hi Chris-

    Thanks for the engagement on this topic.

    What informat is best for reading in the result of SHA256 among IB/PIB, PD, and RB?

    I'm using md5/sha256 to anonymize data, and would prefer to use a 9-digit number to a HEX string....

    Thanks,

    Jed

    • Chris Hemedinger
      Chris Hemedinger on

      Maybe I should know what these are: "IB/PIB, PD, and RB" -- but I don't and I'm having trouble guessing. Can you provide a little more context?

      • Whoops, sorry.

        IB, PIB, PD, and RB are informats that can be applied to binary data.

        IB (integer binary) and PIB (positive integer binary) are similar and the high-level description reads, "Reads [positive] integer binary (fixed-point) values."

        RB (real binary) "[r]eads numeric data that is stored in real binary (floating-point) notation."

        PD is for reading an IBM packed-decimal format, so maybe that rules itself out here.

        As I said, but did not elaborate on, I am using MD5 to anonymize identifiers, and was trying to create a numeric ID that I could use instead of a beastly hex. I know hex is prettier than binary, but that's sort of like saying a cow is prettier than a pig (individual tastes may vary).

        So I'm experimenting with using these informats on the results of MD5 (and plan to try SHA256) and finding the results a bit unwieldy. Maybe trying to use the hash like this is a fundamentally flawed premise? I don't know.

        • Chris Hemedinger
          Chris Hemedinger on

          I see, but I'm not sure you'll be able to get the "hash-like" experience this way. One advantage of a hash is that you can't reverse-engineer the original value, but you can verify that an original value is (or is not) the source for the hash (by reapplying the hash function + salt if needed). Numeric IDs can be used to replace a field, but I don't think you'll be able to come up with an integer that represents a unique hash that you can use in the same way. You'd have to keep a table of integer->value mappings, and if data privacy was the concern there would have to be a lot of process around that. Happy to be corrected on this though -- it's not my area of expertise.

  8. OK, thanks Chris.

    I was trying to create a unique ID number based on an input rather than create and maintain a crosswalk file, so the specifics of hash-verification aren't as important.

  9. Hi Chris

    We have SAS EG, version 7.1 at our company. Any idea if this supports the SHA256 function? My log would suggest not but would appreciate any clarification you could provide...

    23 data temp;
    24 set sample_0001;
    25 format hash_email $hex64.;
    26 hash_email = SHA256(email);
    ______
    68
    ERROR 68-185: The function SHA256 is unknown, or cannot be accessed.

    Thanks
    Byron

    • Chris Hemedinger
      Chris Hemedinger on

      Hi Byron, this function was added in SAS 9.4 Maint 1. You could be using SAS Enterprise Guide 7.1 with an earlier release of SAS -- you would need to check. You can run "proc product_status; run;" to check the version of SAS.

Back to Top