Jedi SAS Tricks - These aren't the labels...

In the first Star Wars movie, Obi-wan uses Jedi mind tricks to convince the stormtroopers that the droids they see are not the droids they're looking for.  A colleague at SAS passed along a question from a SAS user where the column labels they were seeing were NOT the labels they were looking for:

Does anybody have code to modify SAS column labels so that the variable name and label are simultaneously displayed when viewing the table?  For example, we would like to create a new label that has the value 'VarName: Label for VarName' instead of just 'Label for VarName'.

I started thinking about how I would customize the labels using SAS code. I'd want the process to be as automated as possible, and I'd want to accomplish the task without rewriting all of the data in the table just to change the labels. I settled on using a little PROC SQL with its dictionary table magic, a pinch of SAS macro, and finally the raw power of PROC DATASETS to do the deed.  Let's start by creating a dataset to play with:

data haircuts;
  input Name $ Amt;
  label name='Person' Amt='Price of Last Haircut';
datalines;
Luke 25
Obi-wan 15
Darth 0
;

The dataset (with labels displayed) looks like this:

jedi_label_magic_hair_original

Now, we'll use PROC SQL and the dictionary.columns table to construct the ATTRIB statements we'll deploy in PROC DATASETS:

proc sql noprint;
select cat('attrib ', strip(Name), ' Label="',catx(': ',Name,Label),'";')
  into :code separated by ' '
  from dictionary.columns
  where memname='HAIRCUTS' and LIBNAME='WORK'
;
quit;

The macro variable CODE now holds all of the ATTRIB statements we will need in the subsequent PROC DATASETS step:

attrib Name Label="Name: Person"; 
attrib Amt Label="Amt: Price of Last Haircut";

Finally, we'll submit the PROC DATASETS code to make the changes to our dataset descriptor:

proc datasets lib=work memtype=data nolist;
  modify haircuts;
  &code
run;

And violá! Labels just like the customer ordered:

jedi_label_magic_hair_final

So now, these are the labels we were looking for and my work here is done (for today, at any rate).

Until next time - may the SAS be with you!
Mark

Post a Comment

US public schools projected to be majority-minority in 2014

Being that today is a Throwback Thursday, I decided to re-create one of my favorite things from elementary school -- show-and-tell! And of course I'll be showing and telling you about a graph ... using school data! :)

chalkboard

OK - that SAS graph certainly is "Throwback" but it's not the one I'm going to show-and-tell today! LOL

The end of August is the traditional time for all the kids to head back to school in the US, and I saw an interesting graph in a Pew Research article about the racial/ethnic composition of US public schools. This fall, for the first time, the combined minority students will outnumber white students.

Pew's graph was interesting, but there were a few things about it that I didn't like. For example, they only showed bars for ever other year (even numbered years), except for the first two bars (1997 and 1998). Also, the "percent total minority enrollment" numbers were placed on the "black" bar segment, which could be misleading (unless you carefully read the labeling). There were also a lot of labels on the graph - for example they scattered labels around the graph showing what colors represented what racial/ethnic group, rather than using a color legend.

I decided to try to create my own version of the graph using SAS, to see if I could "clean it up" and make things a little more straightforward.

I did some checking and found a table of the data (p. 33) in a paper published by the National Center for Education Statistics (NCES). I copy-n-pasted it into the datalines section of my SAS program, and then used Proc Transpose to rearrange the data so it would be easier to work with. I used Proc Gchart's subgroup option so that each race was represented by a bar segment with a different color (using the exact same colors as the Pew graph). The bar segments would normally be ordered alphabetically, therefore I manually assigned numeric values in the data to produce the desired order, and then created a user-defined-format so that the numbers would print in the legend as the desired text.

public_school_ethnicity

I made several changes from the original. Since there was data for every year, I plotted all the data (rather than just the even years). I show the total majority and minority percent values to the extreme left and right, rather than in the middle (highlighting the 2014 values in red, since they are the important ones). I added reference lines at 50% since that is the important/transition value the graph was created to show. And if you click my graph (above) to see the full-size interactive version, you'll notice it has html hover-text for each bar segment, in case you want to see the exact values.

What do you think of the new graph? Any suggestions to make it even better? :)

Post a Comment

Turnout vs popular party in European elections

In the last edition of the European elections, the main question was not only “Who is going to win the elections?” but also “What is going to be the turnout?” As it has taken a month and half to establish the real figure (the previous figure was based on exit polls so was preliminary) we can finally dive deeper into it.

The relationship between voter participation and party fortunes has received much attention in the voting behavior literature. Specifically, a number of studies on the advanced industrial democracies postulate that left-of-center parties benefit from higher turnout.

Turnout is seen as a critical test for European democracy, and it has fallen steadily and consistently since 1979, from 62% in the first election in 1979 to 43% in the 2009 election.The final turnout figure for this year’s European Parliament elections has been revealed as 42.54% – the lowest-ever turnout and lower than the number communicated immediately after the elections. This means that the eighth legislature of the European Parliament will not be more representative for the European population than the previous one.

In statistics we make a difference between the population and a sample. The population is the entire collection of individual members of a group of interest. When the entire population can’t be measured, we make use of a sample. A sample is a subset of a population that is drawn to enable inferences to the population. A sample should be representative of the population, meaning that the sample’s characteristics are similar to the population’s characteristics. A representative sample should be an unbiased indication of what the population is like.

Four EU countries have a compulsory voting law for the European Parliament elections; Belgium, Luxembourg, Greece and Cyprus. Theoretically the votes in these countries would reflect the entire population. While Belgium and Luxembourg show a turnout of above 85%, the figure is still low in the two other countries (57% in Greece and 42% in Cyprus).

Will the eighth legislature of the European Parliament be representative for the entire population? This is a question that we can’t answer, however statistics allow us to calculate if there is an association between the turnout in the different member states and the outcomes of the elections.

A commonly used test that examines whether there is an association between two categorical variables is the Pearson chi-square test. According to this method an association exists between two categorical variables if the distribution of one variable changes when the value of the other variable changes. If there is no association, the distribution of the first variable is the same regardless of the level of the other variable. In other words, the chi-square test measures the difference between the observed cell frequencies and the cell frequencies that are expected if there is no association between the variables. If you have a significant chi-square statistic, there is strong evidence that an association exists between your variables.

In order to perform the Pearson chi-square test I combined the turnout data with the winning European party per state and I created a variable ‘Turnout category’.

Country Winning party Turnout Turnout category
Slovakia EPP 13.0% 0-30%
Czech Republic EPP 19.5% 0-30%
Poland EPP 23.8% 0-30%
Croatia EPP 24.3% 0-30%
Slovenia EPP 24.5% 0-30%
Hungary EPP 29.2% 0-30%
Latvia EPP 30.0% 0-30%
Portugal S&D 34.5% 30%-40%
Romania S&D 34.7% 30%-40%
UK EFD 36.0% 30%-40%
Estonia ALDE 36.4% 30%-40%
Netherlands ALDE 37.0% 30%-40%
Bulgaria EPP 40.2% 40%-50%
Finland ALDE 40.9% 40%-50%
Cyprus EPP 42.4% 40%-50%
France NI 42.4% 40%-50%
Spain EPP 44.7% 40%-50%
Austria EPP 45.0% 40%-50%
Lithuania ALDE 47.3% 40%-50%
Germany EPP 47.9% 40%-50%
Sweden S&D 51.0% 50%-60%
Ireland EPP 51.2% 50%-60%
Denmark EFD 55.0% 50%-60%
Italy S&D 57.2% 50%-60%
Greece GUE/NGL 57.4% 50%-60%
Malta EPP 74.8% 60%-100%
Luxembourg EPP 85.5% 60%-100%
Belgium G/EFA 90.0% 60%-100%

Next step was to turn this table into a crosstab. Therefore I used the Table Analysis task in SAS Enterprise Guide. Within the Table Analysis task the Chi-Square tests can be calculated.

chi-square

If we look at the rows of the crosstab below we see that European People’s Party (EPP) is the most popular party in 15 of the 28 European countries. If we look at the columns of the crosstab we conclude that in all 7 countries with a turnout between 0 and 30% European People’s Party was elected. This was not the case in any of the 5 countries with a turnout between 30% and 40%. This is an illustration of the fact that there is an association between the turnout and the winning party.This conclusion is confirmed by the Pearson chi-square as the outcome of the test was significant (χ2=38.09 and Exact P=0.013).

elections1

Based on this we could presume that a right party benefits from the lowest turnout but keep in mind that an association does not necessarily imply causation!

Post a Comment

Will California dry up and fall off?

When the recent earthquake hit California, one of my friends commented that "California is so dry right now, I'm surprised it didn't just break off and fall into the ocean!"

It got me wondering, just how bad and widespread is California's drought? I did a few Web searches and found lots of dramatic before-and-after pictures of their lakes, which are down to about 35% of capacity statewide. Some of the pictures made me think of Southern Peru, and how huge bays and bodies of water there had turned into deserts (millions of years ago), leaving behind fossils of whales which I got to see a few years ago. Here's a picture I took of my buddy Joe taking a picture of one of those whale fossils...

Joe_photo_whale

Of course, we're all glad California didn't fall into the ocean, and I'm sure they'll make a comeback from this drought ... but it got me wondering what data is available, and how I might visually analyze it.

I found that the water.weather.gov site provides shapefiles containing precipitation data. I downloaded the current data for the last 60 days, showing the rainfall in each area of the US as a percent of the average rainfall for that time period. I used Proc Mapimport to get the data into SAS, and found that it was point data (rather than polygons), so I couldn't use Proc Gmap directly on the data. But I was able to use annotate to color-code the points and place them on a blank Gmap. The points are so dense that they have no space between them, and produce what looks like a contour map.

And here is the map! It definitely shows that most of California has had dramatically less rainfall during the past 60 days than the historical average. Here's a link to the code, if you want to try creating your own maps.

precipitation_map

 

Post a Comment

How to show recessions (or other ranges) on a time series plot

It's easy to plot events that happened at a certain time, but what about events that extended over a range of dates, such as recessions? ... This blog post teaches you a nice trick to use for that!

Let's say you have a plot of the labor force participation rate over time, and you want to also see when the recessions started and ended. Here's the way I would plot it:

labor_participation

How do you do that in SAS? If you've seen any of my previous blogs, you probably know that the trick I use to add things to graphs is almost always annotate. Annotate gives you a way to programmatically 'draw' extra things on a graph. And in this case, I use it to draw a colored box behind the graph, extending in the left/right direction from the start date to the end date of each recession, and extending in the up/down direction from the bottom of the axis to the top.

Here's a short example that shows how to code it. Notice that I use the data coordinate system for the x direction (xsys='2') so I can specify the dates, and I use a %-based coordinate system for the y direction (ysys='1') so I can simply specify the values 0 and 100%. The when='b' draws the solid-colored bar before (ie, behind) so it doesn't obscure the plot lines.

data anno_recessions;
xsys='2'; ysys='1'; when='b';
x='15apr1960'd; y=0; function='move'; output;
x='15feb1961'd; y=100; function='bar'; color='cxf0f1e6'; style='solid'; output;
run;

What about events other than recessions? - Certainly! It could be any event with a start date and end date. And if it's not obvious what the event is, just stick a label on it!

homes

Or what about which president was in office? Yep - you can use the same trick to show that!

voters

What about "graying out" a special region of a Salmonella chart? Sure! With a slight modification, you can use an alpha-transparent gray color, and annotate it after the graph (when='a' so it is drawn on top).

salmonella_epi_curve

What about showing a season of the year, such as 'winter'? ... You betcha!

snowfall_barchart

You can even use this technique to simply alternate a background color (like the old green bar paper) instead of using reference lines.

dow_unblinged1

 

I've set up each of the graphs above so you can click on them to see the 'info' page which has a link to the SAS code used to create the graph. Feel free to download the code, and modify it to work with your data.

Now it's your turn - do you have any tricks that you'd like to share for enhancing time series plots? If so, leave a comment below!

Post a Comment

Mapping 206 hospitals in 29 states!

If you're a really big company, with many locations around the country, how do you keep track of all that? ... With a great map, of course!

I recently read a CNN article about the Community Health Systems network being hacked - exposing the names, Social Security numbers, physical addresses, birthdays and telephone numbers of 4.5 million patients. That's mind boggling, isn't it?!?

I'm always looking for different ways to use SAS, and I thought maybe we could use it to plot the locations of their 206 hospitals on a map, so people could easily see if their hospital is one of those impacted by the hackers.

The article included a map, but it was small and didn't really provide enough detail to be very useful. My new/improved SAS version is below - you can click it to see the full size map, and it will let you hover your mouse over each marker to see the hospital name. I also included the full text list of all the hospitals (below the map), in case you want to search by city name, etc.

community_health_systems

That's a nice looking SAS map, eh?!? You're probably wondering "How'd he do that?!?" ...

Well, I started with a list of all the hospitals, and the city & state where each was located. I used Proc Geocode to find the latitude & longitude of the cities, and then plotted a marker at each of those locations using annotate. I used Proc SQL to count the number of hospitals and states, and saved the results into macro variables which I also annotated on the map. ODS HTML allowed me embed html tags so that each marker has hover-text with the name of the hospitals. And, most importantly, it's all data driven (no hard-coding of marker locations or numbers) - therefore to change what I'm plotting on the map, all I have to do is change the data!

Have you got some data that you might want to plot on a map like this? Here's a link to the SAS code, in case you want to try your hand at re-using it. Feel free to leave a comment about this map, or a map you might create with your own data!

 

Post a Comment

What's trending in your neck of the woods?

"Trending" is one of those words that's trending these days, as we verbify everything. And if you understood that, then you'll probably love this blog! :-)

For those of you not hip to the lingo ('hip' meaning 'knowing about and following the newest styles, fashions, etc.'), dictionary.com defines trending as "to emerge as a popular trend; be currently popular."

I recently ran across an article describing a new map that shows which YouTube videos are trending in various parts of the US. The map interface allows you to select things like gender and age, to see what videos are most popular (or trending) with that demographic. It's really cool - give it a try! ... but please come back to read the rest of the blog, so you can learn how to create such maps in SAS!

youtube_trending_map

 Like I do with most cool maps & graphs, I decided to come up with a way to create a similar map using SAS! I don't have access to the YouTube viewership data, so I took the liberty of creating some 'plausibly real' data showing what SAS-related YouTube videos are trending in various cities. And whereas the original map appears to use pre-determined cities (arranged in neat rows and columns by hand), I will set mine up so the city locations are data-driven.

I used Proc Geocode to determine the latitude and longitude of each city (note that if you had IP addresses of the viewers rather than city, Proc Geocode can also be used on IP addresses!) I then created an annotate dataset encoding commands to move to each lat/long coordinate, and then switch to a relative-percent coordinate system and then move a certain percentage down and to the left of the center coordinate, and then draw an image icon that extends the same percentage distance up and to the right of the center coordinate. Here's a link to the actual code.

And here's my SAS version of the map! Note that the data is fabricated (not real), but the videos that the icons link to are real.

trending_videos_map

Now that you know how to create such a map in SAS, what data might you have that would be interesting to plot in this way, with little image/icons? Share your ideas in the comments section!

 

Post a Comment

If US state borders were redrawn -- which new state would you live in?

Find out which state you'll live in, if the US state borders are redrawn so we have 50 states with equal population! (Don't worry! - This is just a fun/hypothetical "what if" blog!)

To get you in the mood for this topic, here's a picture of one of the many vintage globes my good friend (and antique dealer) Reggie has in his personal collection. I always like looking at old globes to see if I can find borders & country names that have changed, and then try to guess the age of the globe based on that information. ... Which brings us to the topic at hand - changing the borders and names of the 50 US states!

globe

For those of you not familiar with the US, it is divided into 50 states (represented by the 50 stars on our flag), and the population of some states is huge (such as California, with 38 million people), while the population of other states is small (such as Wyoming, with 600,000 people). This population disparity makes it a bit awkward when trying to determine how much influence each state has in the US government, etc.

I recently saw a map Neil Freeman created, that proposed dividing the US into 50 new states with an equal population in each area. Being a "map guy" myself, this caught my attention. I read the details in his article, and I liked the factors that he had used in coming up with the new groupings - "The map began with an algorithm that grouped counties based on proximity, urban area, and commuting patterns. The algorithm was seeded with the fifty largest cities. After that, manual changes took into account compact shapes, equal populations, metro areas divided by state lines, and drainage basins."

redraw_50_states

But, in scrutinizing his map, I found that I couldn't easily determine which of the current states & counties were included in the new proposed states. Therefore I created my own version of the map with SAS, which could easily answer those questions!

In my SAS version, I use the counties as my basic building blocks, and I add html hover-text so you can hover over any county and see the current state & county, and the proposed new state name. I also annotate the state borders (in white) so you can easily see how the current state borders compare to the new state borders. Therefore, while Neil's map makes a better static poster (which was his goal, by the way), the SAS map provides more analytic capabilities and insight. ... And having both maps gives you the best of both worlds!

Here is my SAS map. Click the static thumbnail below, to see the full-size interactive map with html hover-text.

us_equal_population_map

If this change were to happen, what new state would your current residence be in, and what new state would you prefer to live in (and why)?

Post a Comment

Your SAS Visual Analytics expert is here

Ask the expertWouldn’t it be nice to have an expert to answer your SAS Visual Analytics question?

Now you do and the best part, it’s free!

Beginning September 8, SAS will be hosting a one-hour Ask the Expert session each Monday through November. Each interactive session will focus on a specific Visual Analytics topic area as indicated below:

SAS VA schedule

After each session, a session recap will be posted to the SAS Visual Analytics Community. The session recap will include a link to the session recording as well as a list of question and answers. For each question we will include links to additional details and more information. Make sure to check out the session recap whether you can attend the live session or not.

Please start gathering your questions and plan to attend one or all twelve sessions. Use this link to register for each session that you would like to attend.

We look forward to answering your questions and helping you to get the most out of SAS Visual Analytics!

Post a Comment

Not the same ol’ middle tier

The SAS Middle Tier is all new for SAS 9.4, and you might not recognize it. Gone are the third-party web application servers. Gone is the third-party Java development kit. Missing major components like those may cause you to ask yourself, “Does SAS still have a middle tier?”  With the introduction of the all new SAS Environment Manager the answer is an emphatic, “YES!” The third-party products are not needed because every component is 100 percent SAS homegrown.

The new SAS Middle Tier Architecture looks like this:

middle_tier_1

No need to worry. We understand that even though you see some familiar applications, much of this diagram is likely new to you. To help you navigate the new SAS 9.4 Middle Tier we introduce to you the new SAS Platform Administration: Middle Tier Administration course.

A highlight of the course is the SAS Environment Manager, a web-based administration tool for the SAS environment. The following screen shot shows how the SAS Environment Manager displays the status of your entire environment in one dashboard:

middle_tier_2

middle_tier_3

The availability column is refreshed every minute indicating the health of the resource in question.

While more functionality will be added to the SAS Environment Manager over the SAS 9.4 lifecycle, it is not yet a replacement for its older brother, SAS Management Console. SAS Management Console still rules when it comes to managing metadata access.

You can learn more by checking out the SAS 9.4 Intelligence Platform: Middle-Tier Administration Guide. Or view the administration curriculum path to see what else is offered.

Post a Comment