How to work with emojis in SAS

13

When I was a computer science student in the 1980s, our digital alphabet was simple and small. We could express ourselves with the letters A..Z (and lowercase a..z) and numbers (0..9) and a handful of punctuation and symbols. Thanks to the ASCII standard, we could represent any of these characters in a single byte (actually just 7 bits). This allowed for a generous 128 different characters, and we had character slots to spare. (Of course for non-English and especially non-latin characters we had to resort to different code pages...but that was before the Internet forced us to work together. Before Unicode, we lived in a digital Tower of Babel.)

Even with the limited character set, pictorial communication was possible with ASCII through the fun medium of "ASCII art." ASCII art is basically the stone-age version of emojis. For example, consider the shrug emoji: 🤷

Its text-art ancestor (not strictly ASCII as a sharp reader pointed out) is this: ¯\_(ツ)_/¯ While ASCII and text art currently enjoys a retro renaissance, the emoji has become indispensable in our daily communications.

Emojis before Unicode

Given the ubiquity of emojis in every communication channel, it's sometimes difficult to remember that just a few years ago emoji characters were devised and implemented in vendor-specific offerings. As the sole Android phone user in my house, I remember a time when my iPhone-happy family could express themselves in emojis that I couldn't read in the family group chat. Apple would release new emojis for their users, and then Android (Google) would leap frog with another set of their own fun symbols. But if you weren't trading messages with users of the same technology, then chunks of your text would be lost in translation.

Enter Unicode. A standard system for encoding characters that allows for multiple bytes of storage, Unicode has seemingly endless runway for adding new characters. More importantly, there is a standards body that sets revisions for Unicode characters periodically so everyone can use the same huge alphabet. In 2015, emoji characters were added into Unicode and have been revised steadily with universal agreement.

This standardization has helped to propel emojis as a main component of communication in every channel. Text messages, Twitter threads, Venmo payments, Facebook messages, Slack messages, GitHub comments -- everything accepts emojis. (Emojis are so ingrained and expected that if you send a Venmo payment without using an emoji and just use plain text, it could be interpreted as a slight or at the least as a miscue.)

For more background about emojis, read How Emjois Work (source: How Stuff Works).

Unicode is essential for emojis. In SAS, the use of Unicode is possible by way of UTF-8 encoding. If you work in a modern SAS environment with a diverse set of data, you should already be using ENCODING=UTF8 as your SAS session encoding. If you use SAS OnDemand for Academics (the free environment for any learner), this is already set for you. And SAS Viya offers only UTF-8 -- which makes sense, because it's the best for most data and it's how most apps work these days.

Emojis as data and processing in SAS

Emojis are everywhere, and their presence can enrich (and complicate) the way that we analyze text data. For example, emojis are often useful cues for sentiment (smiley face! laughing-with-tears face! grimace face! poop!). It's not unusual for a text message to be ALL emojis with no "traditional" words.

The website Unicode.org maintains the complete compendium of emojis as defined in the latest standards. They also provide the emoji definitions as data files, which we can easily read into SAS. This program reads all of the data as published and adds features for just the "basic" emojis:

/* MUST be running with ENCODING=UTF8 */
filename raw temp;
proc http
 url="https://unicode.org/Public/emoji/13.1/emoji-sequences.txt"
 out=raw;
run;
 
ods escapechar='~';
data emojis (drop=line);
length line $ 1000 codepoint_range $ 45 val_start 8 val_end 8 type $ 30 comments $ 65 saschar $ 20 htmlchar $ 25;
infile raw ;
input;
line = _infile_;
if substr(line,1,1)^='#' and line ^= ' ' then do;
 /* read the raw codepoint value - could be single, a range, or a combo of several */
 codepoint_range = scan(line,1,';');
 /* read the type field */
 type = compress(scan(line,2,';'));
 /* text description of this emoji */
 comments = scan(line,3,'#;');
 
 /* for those emojis that have a range of values */
 val_start = input(scan(codepoint_range,1,'. '), hex.);
 if find(codepoint_range,'..') > 0 then do;
  val_end = input(scan(codepoint_range,2,'.'), hex.);
 end;
 else val_end=val_start;
 
 if type = "Basic_Emoji" then do;
  saschar = cat('~{Unicode ',scan(codepoint_range,1,' .'),'}');
  htmlchar = cats('<span>&#x',scan(codepoint_range,1,' .'),';</span>');
 end;
 output;
end;
run;
 
proc print data=emojis; run;

(As usual, all of the SAS code in this article is available on GitHub.)

The "features" I added include the Unicode representation for an emoji character in SAS, which could then be used in any SAS report in ODS or any graphics produced in the SG procedures. I also added the HTML-encoded representation of the emoji, which uses the form &#xNNNN; where NNNN is the Unicode value for the character. Here's the raw data view:

When you PROC PRINT to an HTML destination, here's the view in the results browser:

In search of structured emoji data

The Unicode.org site can serve up the emoji definitions and codes, but this data isn't exactly ready for use within applications. One could work through the list of emojis (thousands of them!) and tag these with descriptive words and meanings. That could take a long time and to be honest, I'm not sure I could accurately interpret many of the emojis myself. So I began the hunt for data files that had this work already completed.

I found the GitHub/gemoji project, a Ruby-language code repository that contains a structured JSON file that describes a recent collection of emojis. From all of the files in the project, I need only one JSON file. Here's a SAS program that downloads the file with PROC HTTP and reads the data with the JSON libname engine:

filename rawj temp;
 proc http
  url="https://raw.githubusercontent.com/github/gemoji/master/db/emoji.json"
  out=rawj;
run;
 
libname emoji json fileref=rawj;

Upon reading these data, I quickly realized the JSON text contains the actual Unicode character for the emoji, and not the decimal or hex value that we might need for using it later in SAS.

I wanted to convert the emoji character to its numeric code. That's when I discovered the UNICODEC function, which can "decode" the Unicode sequence into its numeric values. (Note that some characters use more than one value in a sequence).

Here's my complete program, which includes some reworking of the tags and aliases attributes so I can have one record per emoji:

filename rawj temp;
 proc http
  url="https://raw.githubusercontent.com/github/gemoji/master/db/emoji.json"
  out=rawj;
run;
 
libname emoji json fileref=rawj;
 
/* reformat the tags and aliases data for inclusion in a single data set */
data tags;
 length ordinal_root 8 tags $ 60;
 set emoji.tags;
 tags = catx(', ',of tags:);
 keep ordinal_root tags;
run;
 
data aliases;
 length ordinal_root 8 aliases $ 60;
 set emoji.aliases;
 aliases = catx(', ',of aliases:);
 keep ordinal_root aliases;
run;
 
/* Join together in one record per emoji */
proc sql;
 create table full_emoji as 
 select  t1.emoji as emoji_char, 
    unicodec(t1.emoji,'esc') as emoji_code, 
    t1.description, t1.category, t1.unicode_version, 
    case 
     when t1.skin_tones = 1 then  t1.skin_tones
	 else 0
	end as has_skin_tones,
    t2.tags, t3.aliases
  from emoji.root t1
  left join tags t2 on (t1.ordinal_root = t2.ordinal_root)
  left join aliases t3 on (t1.ordinal_root = t3.ordinal_root)
 ;
quit;
 
proc print data=full_emoji; run;

Here's a snippet of the report that includes some of the more interesting sequences:

The diversity and inclusion aspect of emoji glyphs is ever-expanding. For example, consider the emoji for "family":

  • The basic family emoji code is \u0001F46A (👪)
  • But since families come in all shapes and sizes, you can find a family that better represents you. For example, how about "family: man, man, girl, girl"? The code is \u0001F468\u200D\u0001F468\u200D\u0001F467\u200D\u0001F467, which includes the codes for each component "member" all smooshed together with a "zero-width joiner" (ZWJ) code in between (👨‍👨‍👧‍👧)
  • All of the above, but with a dark-skin-tone modifier (\u0001F3FF) for 2 of the family members: \u0001F468\u0001F3FF\u200D\u0001F468\u200D\u0001F467\u200D\u0001F467\u0001F3FF (👨🏿‍👨‍👧‍👧🏿)

Note: I noticed that not all browsers have caught up on rendering that last example. In my browser it looks like this:

Conclusion: Emojis reflect society, and society adapts to emojis

As you might have noticed from that last sequence I shared, a single concept can call for many different emojis. As our society becomes more inclusive around gender, skin color, and differently capable people, emojis are keeping up. Everyone can express the concept in the way that is most meaningful for them. This is just one way that the language of emojis enriches our communication, and in turn our experience feeds back into the process and grows the emoji collection even more.

As emoji-rich data is used for reporting and for training of AI models, it's important for our understanding of emoji context and meaning to keep up with the times. Already we know that emoji use differs among different age generations and across other demographic groups. The use and application of emojis -- separate from the definition of emoji codes -- is yet another dimension to the data.

Our task as data scientists is to bring all of this intelligence and context into the process when we parse, interpret and build training data sets. The mechanics of parsing and producing emoji-rich data is just the start.

If you're encountering emojis in your data and considering them in your reporting and analytics, please let me know how! I'd love to hear from you in the comments.

Share

About Author

Chris Hemedinger

Director, SAS User Engagement

+Chris Hemedinger is the Director of SAS User Engagement, which includes our SAS Communities and SAS User Groups. Since 1993, Chris has worked for SAS as an author, a software developer, an R&D manager and a consultant. Inexplicably, Chris is still coasting on the limited fame he earned as an author of SAS For Dummies

13 Comments

  1. barefootguru on

    The ツ character in the middle of your ‘ASCII-art ancestor’ is Tsu, which is definitely not part of ASCII :]

    • Chris Hemedinger
      Chris Hemedinger on

      Oh 💩! You're right! That's what I get for lazy Googling.

      ( )    ( )
         0  0
           @
      

  2. Great work as usual Chris. And just in time for World Emoji Day (July 17) 🥳🥳🥳

    I'm finally hoping that we can use emojis in place of the red traffic light signal on dashboards. We need something that describes the situation more clearly.
    Like this one is great 🔞 but only if the goal was 18.

    Here's my suggestions ranked by seriousness of missed goal ...
    🤪
    😱
    🤢
    🤬
    💩

    • Chris Hemedinger
      Chris Hemedinger on

      It may or may not be a coincidence that I wrote this post near World Emoji Day. Did you know that it's July 17 because that's the date depicted (📅) on the calendar emoji?

  3. I wish you had included code for how to set UTF8 as the _system_ option, or even code to expose the current session option. This seems to be a different operation than using the ENCODING= in the FILENAME or other statement.

    • Chris Hemedinger
      Chris Hemedinger on

      ENCODING=UTF8 is a startup or config option. Setting it requires an admin (usually) in a central SAS environment. Sometimes people create a "SASApp UTF8" workspace definition that points to the ../nl/u8/sasv9.cfg file in the SAS application install path. PROC OPTIONS GROUP=LANGUAGECONTROL; should show you current values.

  4. Carola Röttig on

    Hi,
    maybe it's stupid - but how can i show the same table in a visual analytic report?
    for me it looks like this:
    00A9 FE0F copyright © ~{Unicode 00A9} Basic_Emoji 169 169
    00AE FE0F registered ® ~{Unicode 00AE} Basic_Emoji 174 174
    1F004 mahjong red dragon 🀄 ~{Unicode 1F004} Basic_Emoji 126980 126980
    1F0CF joker 🃏 ~{Unicode 1F0CF} Basic_Emoji 127183 127183
    1F170 FE0F A button (blood type) 🅰 ~{Unicode 1F170} Basic_Emoji 127344 127344

    i know that i can show emojis by copying/pasting the emoji in a calculated variable, but that's only a single solution.
    any help greatly appreciated
    best regards.
    Caro

    • Chris Hemedinger
      Chris Hemedinger on

      I'm not sure how to get the same in Visual Analytics. ODS will automatically encode in HTML when using the ODS HTML or HTML5 destination. SAS Visual Analytics might require a calculated value with a function to display the character.

  5. Carola Röttig on

    ps. in my report i don't see the emojis like here, only the html/unicode values

  6. Shubham Arora on

    People are using emojis for so long. They are wonderful innovation in terms of expressing expressions in a very short and correct way. Use them in SAS has blown my mind. Thanks Chris, you wrote a very nice blog.

  7. Hi,
    I wonder if there is a way to recognise emoijs using Visual Text Analytics?
    Actually I see them as a picture like 🙂 in searched documents, but I can't define any rule to match them.
    Or better solution would be import the data with some encoding which would change the 🙂 into "U+1F600" and then to use this phrase in LITI rules? But the user would lose the information about - for example - emotions.
    Best regards,
    Sophie

Back to Top