We will remember them

Every year, on 11 November at 11 am – the eleventh hour of the eleventh day of the eleventh month – we pause to remember those men and women who have died or suffered in all wars, conflicts and peace operations. Therefore November 11 is also known as Remembrance Day, a memorial day observed in Commonwealth of Nations (formally known as the British Commonwealth) member states since the end of the First World War to remember the members of their armed forces who have died in the line of duty.

Remembrance Day has a special meaning for me as I grew up in Flanders Fields, the Belgian region where the First World War saw more than 500,000 soldiers killed. Every morning when I went to school, I passed underneath the Menin Gate Memorial to the Missing.

DSC02607 (2)

The Menin Gate Memorial to the Missing is a war memorial in Ypres, Belgium which bears the names of more than 54,000 officers and men from United Kingdom and Commonwealth Forces (except New Zealand and Newfoundland) who fell in the Ypres Salient before 16th August 1917 and who have no known grave.

Even though I was not a data scientist at that time yet, I have always wanted to know more about those engraved names. As the First World War started now 100 years ago, in 1914, I thought it was time for an investigation and found what I had been looking for on the Commonwealth War Graves Commission website.

DSC02614 (3)

Not only did I find all the names of the casualties, I also learned about their country of origin, their date of death, their age and their rank in the army. I loaded these data into SAS Visual Analytics in order to quickly gain some insights.

We will remember... their nationalities

A simple pie chart teaches us that about 75% of the deceased came from the United Kingdom, there were about 10% Canadians and 10% Australians and about 1% of the engraved names on the Menin Gate are Indian and 1% South African.

Graph_3

We will remember... their date of death

Secondly, I created a line chart with the date of death on the X-axis. What immediately struck me, is the peak on 31/07/1917. Some research told me that the battle of Passchendaele started on that day. It became infamous not only for the scale of casualties, but also for the mud.

VisualAnalytics_4
Another insight this chart is giving us, is that the British were present from the beginning of the war in 1914 until the end while there seems to be a shift for the others: first the Indian troops, than the Canadian forces, followed by the Australians and finally the South Africans.

We will remember... their ranks

The fatalities on the Menin Gate are associated with 63 different ranks but five of them represent 93% of the total: Private (70%), Lance Corporal (8%), Rifleman (7%), Sergeant and Corporal (both 4%). In the bar chart below we see how the countries stack up.

VisualAnalytics_5

We will remember... their age

Although the age of 19 was the legal limit for armed service overseas in the United Kingdom, many young boys served their country in the First World War. When we look at the distribution of the age, we clearly observe a heavily skewed distribution.

VisualAnalytics_6
Although about half of the values of age are missing, the box plot below is a good indicator for the spread of the ages among the different ranks. The youngest victims were the riflemen, with an average age of 25. The sergeants were the “oldest” when they died at the age of 28 on average. The other ranks (private, corporal and lance corporal) were on average 26 years old when they lost their life in the battlefield.

VisualAnalytics_7

I would like to conclude this post with an extract from “For the Fallen”, a Poem by Robert Laurence Binyon (1869-1943), published in The Times newspaper on 21st September 1914.

They shall grow not old, as we that are left grow old:
Age shall not weary them, nor the years condemn.
At the going down of the sun and in the morning
We will remember them.

DSC02608 (2)

Post a Comment

Credit risk modeling: Remove the guess work

What's the probability that a firm will default on its debt? That’s the big question for many financial institutions. One way you can answer it is with credit risk modeling.

Starting today, we’re offering a new Business Knowledge Series course on that topic through our popular e-Learning format. That means you can take the course anywhere, anytime. (Like right now)

The course, Credit Risk Modeling Using SAS, will help you learn how to develop credit risk models in the context of the recent Basel guidelines.

I caught up with one of the instructors, Bart Baesens, to find out more about the course, the benefits, and how it can solve real-world problems.

 

Interested? You can start the course today.

Post a Comment

Hey! Where have you been?!?

There's recently been a "States I've Visited" application going around on Facebook, where users create a map showing all the US states they've visited, and then post it on their page for their friends to see. I wondered if SAS could do a better job?...

Here's a screen-capture of one of my friends' map, created with the m.maploco.com application. It's a pretty simple map, using two colors, and from a visual perspective I have no complaints. It's a good map.

states_ive_visited

But when I thought about creating my own map, I felt a bit limited by the application. For example, there are certain states that I've visited, but only while I was driving through that state to get somewhere else. And I also wanted to provide some details about where I went in the state, or what I did there.

Therefore I used SAS to create my own map, so I could do it the way I wanted. I made the states I had driven through a lighter red, to distinguish them from states I'd actually spent quality time in. And I added html hover-text with more information about my 'visit' to each state. Below is a snapshot of my map - click it to see the full-size map with html hover-text:

states_visited

Feel free to download my code, and modify it to create your own map (and post it to your Facebook page, or wherever you want). Who knows, you might even want to add some changes and enhancements!

 

Post a Comment

Are patent trolls finally on the decline?

Are you a legitimate hard-working company that has been threatened with a lawsuit, by a patent troll? If so, the graphs in this blog should make you happy!

Speaking of 'happy' and 'troll' - here's a picture of a happy Troll Doll from my friend Hannah. Don't you just hate how patent trolls give the word 'troll' a bad name? Hannah commented that, "Patent trolls could learn a thing or two from good luck Trolls," and I wholeheartedly agree! - hahaha!

troll_doll

In recent years, patent trolling has been very profitable.  Perhaps that is one of the reasons that the number of patent legal-case filings had been increasing (in combination with other factors, such as the joinder provisions of the American Invents Act that block a patentee from filing suit against multiple unrelated parties in a single lawsuit).

But enough about the number of patent case filings going up - we're here to talk about them recently going down! I found an interesting article by Lex Machina in which he showed a graph of the number of patent case filings per month since 2011, and the graph showed that the number of filings had gone down considerably during the past few months. I decided to create a similar SAS graph, and I include it below (click the graph to see the full size version, with html hover-text on the plot markers). Note that I suppressed the horizontal axis tick marks and values, and annotated the years along that axis instead (annotate is so handy that way!)

patent_case_filings

Although a line plot is a good way to show trends, I always like to plot data in several different ways to get a more complete mental picture. With quantities, I often like to use bar charts, because the heights of the bars give you a way to visually compare the quantities, and the 'area' of the bars gives you a more direct visual representation of the data. So here's a bar chart of the same data, grouped by year, and using the SAS default colors for the htmlblue style. Note that no annotation was required in the bar chart (the year labels are provided by the group axis).

patent_case_filings1

 So, which do you prefer in this situation - the line chart or the bar chart, and why?

Post a Comment

Why I love the Analytics conference series

Maggie Miller and I were chatting about the reasons for enjoying the Analytics 2014 conference in Las Vegas and I made a comment she thought was peculiar. I said that I liked the conference because I could talk about complicated things. She asked me to explain what I meant by that. I will do my best in this post to explain.

analytics

Here's me presenting about some "complicated things" at the Analytics conference.

I learned how to talk about complicated concepts in graduate school. I imagine many of you learned this skill in graduate school as well. I will never forget the first time I gave a seminar. WHAT A DISASTER!! I still recall my dissertation advisor saying, “Yeah that sucked.” So what happened? Why was my talk a disaster and how did I fix this? Aside from adjustment to the structure of an economics seminar (a free for all), the single biggest change I made was that I learned to talk to my graduate school colleagues about my research. I realized that I had a collection of people in my inner circle who all were smarter than me. It provided me an opportunity to put my ideas to the test. And that is what I did. At each opportunity I talked about my “identification strategy” and potential “endogeneity” issues. And, guess what? The next seminar was a success and I’ve never looked back.

So how does this relate to Analytics 2014? Selling your analytic models isn’t about the quality of your estimates. It isn’t about the complexity of your models. It is entirely determined by your ability to explain your models to other people. Sometimes it is to people who know more statistics than you. Sometimes it is to smart people who don’t know statistics. Acquiring these skills is entirely the result of practice. And in my three years of attending conferences as part of SAS, I have found no better place to talk about complicated models with sophisticated modelers and smart people who don’t have as much statistical experience.

The Analytics conference gives us the opportunity to start a conversation about a complicated econometric or statistical model with people who know more than we do. We can start our conversation without extensive background on the assumptions of the model. The time with SAS employees at the exhibitor booths allow for interaction with developers of models who know the math but perhaps not the applications. These interactions allow us all to be more comfortable talking about complicated statistical models.

So, when Maggie asked me to explain what I meant, perhaps I was reflecting on what I learned early in my graduate school career about presenting complicated material. Practice helps. While it will never be easy to present complicated material, the Analytics Conference Series events provide an additional opportunity to stand and confidently talk about your work in analytics. We hope to see you next year in Las Vegas at Analytics 2015.

Post a Comment

Dr. Mohammad Abbas, SAS UK & Ireland's top data scientist, forecasts the UK's energy consumption in 2020 using SAS

SAS UK & IreTop Data Scientistland recently ran a competition to find the region's 'top data scientist'; the competition challenge was to produce a forecast of energy demand for the UK in the year 2020 based on the data provided.  Competition for this coveted award was fierce; with the winner claiming a trip to SAS Global Forum in the USA and the chance to feature their submission on the SAS Professionals Network.

I recently caught up with Dr. Mohammad Abbas to discuss how he solved the challenge.

Phil:  Could you tell us a bit about your background?

pylonMohammad:  I hold a Master’s Degree in Organic-Geochemistry and a Ph.D. in Inorganic Chemistry.  While working in the public sector as a chemical analyst in an animal health laboratory, I developed a strong interest in how statistical applications and experimental design are used in animal health. I pursued this interest by gaining a Diploma in Statistics from the Open University and I’ve since devoted considerable time experimenting with analytics using data sets drawn from various disciplines.

Phil: Why did you choose to enter this competition?

Mohammad: Well, I saw the Top Data Scientist Competition as an opportunity to test drive my skills in Big Data Analytics.  Tackling a large analytical project in a predefined time scope was a worthy challenge. It offered me the opportunity to constantly re-evaluate my skills and identify ways to achieve a result.

Phil:  The challenge was to forecast energy consumption in 2020, how did you go about tacking the problem?

Mohammad: Having spent some time examining the 47 or so datasets and doing some background reading on energy consumption, I was in a position to develop some approaches to tackling the problem. In essence, it consisted of three key phases: exploratory data analysis, identifying the key model parameters and then selecting a model.

Phil:  An interesting approach, could you tell me a bit more about each phase?

Mohammed: Generally, exploratory data analysis is by far the most important step in any analytical process and I started by investing a significant amount of time in understanding and visualising the data. It was through this step that I was able to build data blocks and make logical connections between data objects.

UK_population_predictionNext, I needed to identify the key model parameters.  With energy data, there are a lot of variables which can be used at a later stage in the modelling process. The task at this stage was to be able to ask questions of the data  and subdivide those answers into clearly defined groups. For example, what impact do economic factors have on energy consumption?  How should factors such as gross domestic product, housing, population and disposable income be taken into account?  How was energy 'intensity' (that is energy consumption across the domestic, industrial, transport and services sectors) calculated and presented in the data sets? What was the relationship between energy consumption in primary equivalents and final energy consumption?

Phil:  What do you mean by energy consumption in primary equivalents and final energy consumption?

Mohammad:  By this I mean, the difference between the amount of energy generated and the final amount consumed.  Some energy is lost in the production and transmission of power; burning coal to generate electicity looses some of the coal's energy in the process and further power is lost when that electricty is transmitted via pylons, for example.domestic_energy_consumption

I needed to answer all of these questions and more to choose the best variables. Based upon these findings, I subdivided the key parameters into three distinct groups:

  1. Economic factors and related energy consumption variables
  2. Energy intensities by sector (domestic, industry, transport and services)
  3. Energy consumption in primary equivalents and final energy consumption.

Phil:  OK, so how did you go about selecting the best model?

Mohammad: SAS offers a wide array of modelling procedures; and choosing which model to use depends upon a clear understanding of the analytical problem and how much you know about the various statistical modelling methods available. Of course, you also need solid diagnostic skills.

To meet the challenge, it was essential to reduce the number of variables analysed to as few as were relevant; this is known in statistical parlance as 'reducing dimensionality'.  I also needed to take data quality into account and also standardisation was needed as some figures were expressed in thousands and others in millions. Also, some energy consumption data was expressed as tonnes of oil equivalents while others as Terawatt-hours so conversion of these units was needed.

Phil:  How did you go about reducing the number of variables, the 'dimensionality' as it's called?

Mohammad:  There are a number of ways to reduce dimensionality, one of which is a model that combines both dimensionality reduction techniques and regression models.  You can use methods such as 'factor analysis' and 'principal component analysis'  which can be applied individually to reduce dimensionality, or combine them with a regression model to obtain obtain a powerful unified approach known as a 'Partial Least Square Regression Model'.  Of course, SAS provides the ability to do all of this.

Phil:  So which fundamental questions were you trying to answer?

Mohammad:  I was trying to address two key questions,  Firstly, how much variation within the predictor variables (those variables which explain the values of other variables, sometimes known as independent variables) could I explain.  For example, atmospheric temperature could explain energy consumption, as it gets colder, more people put on their heating and hence use more power.  Secondly, how much variability in the target variables could be explained by my choice of predictor variables.  In other words, my target variables concerned energy consumption in 2020, so to what extent did the predictor variables I had chosen help to explain, and hence forecast, that?

Phil:  So what results came out of this process?

Mohammad:  My dimensionality reduction techniques reduced the large number of variables into a handful of factors.  Then the partial least square model generated what are known as factor loadings, weights and scores, which helped me to explain how much each factor contributed to the final forecast and how accurate those forecasts would be.  Also, examining the various models' outputs and their associated diagnostic plots helped me to shape the final prediction process.

Actual and Predicted electricity demand in the UKObviously, trying to predict a single value (energy consumption in 2020) has a large amount of uncertainty associated with it.  So, I ran the model a number of times using different inputs.  I tried broad economic factors, electricity consumption and energy intensity (consumption) for each specific economic sector and finally I used randomisation as a means of assessing my model's ability to differentiate between irrelevant (noise) variables and those with real predictive power.  This allowed me to forecast electricity consumption for the UK in 2020 with a difference of approximately 80 TW-h (terawatt hours) between the highest and the lowest predicted value.

Phil:  Amazing, so what did you find out?

Mohammad:  I predict that the overall demand for electricity in the UK in 2020 will be 527 (+/- 30 TW-h).  This represents an increase of 14.6% relative to 2013. Given the potential growth in population, housing and usage of electrical devices in the UK in the next few years, I think this is pretty accurate.

Finally, I would like to say, I am delighted to have been named as the first winner of this competition. From my experience, the most appealing about this competition was the challenge of taming a large volume of data and to be able to draw valuable insights and relate those findings to the real world we live in. This is what Big Data Analytics is all about.

bigdataskillsUK firms are struggling to find the big data skills they need, click here to read new research by SAS and Tech Partnership highlighting the extent of the problem facing British businesses.

 

 

 

Post a Comment

Graphs that make you go hmm... (early voting data)

Back in the 90s, there was a song by the C&C Music Factory about things that just didn't quite make sense - the song was called Things That Make You Go Hmm.... And in that same spirit, this blog post is about Graphs That Make You Go Hmm...

I'm not really into politics, but I look forward to elections just so I can see what they do with all the data. Some of the graphs are good, some are bad, and some just make me go "Hmm... what were they thinking???". Here's an example of such a graph that appeared in our local newspaper:

early_voters1

At first glance, the graph seemed OK, and told me that as working-age voters get older they tend to do more early voting, and then after they retire they do less early voting (I presume that's partly because retired people don't need to worry about their work schedule any more, and partly because there are fewer older people still living).

But then I started looking at the finer details of the graph, such as tick marks along the axes. Why did they choose 17 as the starting point for the age axis? You have to be 18 to vote. And the horizontal axis tick marks were spaced by 11 years ... until it reached age 61, and then they incremented by 10 years, and then went back to 11 years. Why? And why didn't they choose some increment of 5000 for the tick marks on the vertical axis? Why did they include a shadow to the right of the bars? The shadow makes the entire graph be visually biased towards the right. The hover-text for the bars shows values such as 'Count: 36800' - why not format the number with a comma like the axis values, maybe something like 'Voters: 36,800'. Hmm...

But the thing that really just made me SMH - when I re-sized my browser window, the vertical axis got re-drawn with a scale way taller than needed (making the bars very short), and the most significant digit of the highest value on the axis got chopped off (at first glance, I thought it was 30,000 instead of 100,000). I guess whatever dynamic/interactive software they're using to draw the graph is just too fancy. Hmm...

early_voters2

So, I decided to create my own SAS version of this graph, and try to avoid all the problems shown above. Here's what I came up with. Can you name all the improvements? What else would you change?

nc_early_voters_2014

Post a Comment

Sports analytics - visualize your results!

In sports these days, there's a lot more data to keep track of than just the score! How can you make sense of it all? Being the Graph Guy, of course I recommend graphing it!

Here's an example that's up close and personal for me - dragon boat racing... Below is a cool head-on picture of our team paddling a dragon boat back to the dock after a race (note that the boat is over 40 feet long, and there are 22 people in it). We are Raleigh Relentless!

dragonboat_head_on

When our team goes to a race it's an all day event, with each team racing in several heats against various combinations of the other teams. In each heat, you can easily see how how you did against the other boats in that heat, but what's more interesting and important is how your team is performing compared to all the other teams across the entire day. The times are available in tabular form, but I created the following chart that I think provides a much better way to make sense of the data:

dragonboat_richmond_races_2014

A nice visual analysis of sports data can help capture the fans' attention, help teams know which strategies work better than others, or even help with logistics and seating charts in the stadiums.  I've prepared a collection of examples to show you a few of the ways SAS can be used to help visualize sports-related analytics - hopefully you can reuse some of these examples with your own data. Click the screen-capture below to see the samples page:

sports_analytics

 So, what's the most interesting sport to you, from a data perspective? How do you analyze (or wish you could analyze) the data from this sport?

Post a Comment

Are you at risk? Which blood types do vampires prefer?

SAS software has long been used to help analyze 'risk' - what about using it to help determine your risk of being attacked by a vampire?!?

On a previous Halloween, I was the victim of a Vampire attack. Here's the photographic proof...

vampire_attack

Being the most common O+ blood type, I thought I was at low risk of attack (with vampires allegedly preferring the more rare blood types, according to this report) -- but I guess a really hungry vampire will lower her standards from time to time.

So, what is your blood type, and how common or rare is it in the human population? Assuming that vampires prefer the more rare blood types, what is your risk of a vampire attack? And if there is geographical clustering of the rare blood types, might vampires be lured to those areas (assuming they use analytics to determine such things)?

I found some data about blood types by country, and plotted the values on a world map. Perhaps you can analyze these maps to help determine the rarity of your blood type, and whether you live in a vampire-prone area of the world. I've included small thumbnails of my maps below, and you can click on them to see the full size maps with html hover-text for each country:

blood_type_map

blood_type_map1

blood_type_map2

blood_type_map3

blood_type_map4

blood_type_map5

blood_type_map6

blood_type_map7

 

 

 

Post a Comment

Tracking Ebola: Using SAS bubble maps

In a previous blog post, I showed how to layer colored areas on a SAS map to show both countries, and the areas within countries that had cases of Ebola. But as the Ebola epidemic has spread, more data has become available, and in this blog I show how to represent that additional data by annotating bubble markers on the map.

While perusing the Internet for the latest Ebola information, I came across a map that the World Health Organization maintains.  It is interactive, and lets you pan and zoom the map. And as you zoom, it shows progressively more detail. It's a pretty cool map. After zooming-in, I did a screen-capture to include below:

ebola_bubblemap_2014_original

But as I studied the map to try to determine the current status of Ebola, I noticed a few problems. The shades of color for the land were very similar to the colors of the bubbles, even though they were representing slightly different things. Also, the colors used were transparent, and therefore when they overlapped (such as bubbles overlapping other bubbles, or overlapping land) then the transparent colors combined and produced darker colors -- which made it very difficult to determine which legend colors they matched.

Therefore I decided to create my own SAS map, and use solid/non-transparent colors (so there was no color blending), and also use very different/distinct colors for the land and the bubbles (so it is easier to match up to the legend).

For the main bubbles, I used the SAS %centroid() macro to determine the location of the center of each region in the map, and I then used the annotate pie function to draw the bubbles.

I used specific lat/long coordinates for the locations of the Ebola Treatment Centers, and overlaid several pieces of geometry (all using variations of the annotate pie function) to create a white bubble with a red cross.

Below is a snapshot of my SAS version of the map. Click here to see the interactive version, which has html hover-text over all the land areas and bubbles, and drill-down for the Ebola Treatment Centers.

ebola_bubblemap_2014

Here is a link to the SAS code, if you'd like to experiment with the map.

Post a Comment