Credit risk modeling: Remove the guess work

What's the probability that a firm will default on its debt? That’s the big question for many financial institutions. One way you can answer it is with credit risk modeling.

Starting today, we’re offering a new Business Knowledge Series course on that topic through our popular e-Learning format. That means you can take the course anywhere, anytime. (Like right now)

The course, Credit Risk Modeling Using SAS, will help you learn how to develop credit risk models in the context of the recent Basel guidelines.

I caught up with one of the instructors, Bart Baesens, to find out more about the course, the benefits, and how it can solve real-world problems.


Interested? You can start the course today.

Post a Comment

Hey! Where have you been?!?

There's recently been a "States I've Visited" application going around on Facebook, where users create a map showing all the US states they've visited, and then post it on their page for their friends to see. I wondered if SAS could do a better job?...

Here's a screen-capture of one of my friends' map, created with the application. It's a pretty simple map, using two colors, and from a visual perspective I have no complaints. It's a good map.


But when I thought about creating my own map, I felt a bit limited by the application. For example, there are certain states that I've visited, but only while I was driving through that state to get somewhere else. And I also wanted to provide some details about where I went in the state, or what I did there.

Therefore I used SAS to create my own map, so I could do it the way I wanted. I made the states I had driven through a lighter red, to distinguish them from states I'd actually spent quality time in. And I added html hover-text with more information about my 'visit' to each state. Below is a snapshot of my map - click it to see the full-size map with html hover-text:


Feel free to download my code, and modify it to create your own map (and post it to your Facebook page, or wherever you want). Who knows, you might even want to add some changes and enhancements!


Post a Comment

Are patent trolls finally on the decline?

Are you a legitimate hard-working company that has been threatened with a lawsuit, by a patent troll? If so, the graphs in this blog should make you happy!

Speaking of 'happy' and 'troll' - here's a picture of a happy Troll Doll from my friend Hannah. Don't you just hate how patent trolls give the word 'troll' a bad name? Hannah commented that, "Patent trolls could learn a thing or two from good luck Trolls," and I wholeheartedly agree! - hahaha!


In recent years, patent trolling has been very profitable.  Perhaps that is one of the reasons that the number of patent legal-case filings had been increasing (in combination with other factors, such as the joinder provisions of the American Invents Act that block a patentee from filing suit against multiple unrelated parties in a single lawsuit).

But enough about the number of patent case filings going up - we're here to talk about them recently going down! I found an interesting article by Lex Machina in which he showed a graph of the number of patent case filings per month since 2011, and the graph showed that the number of filings had gone down considerably during the past few months. I decided to create a similar SAS graph, and I include it below (click the graph to see the full size version, with html hover-text on the plot markers). Note that I suppressed the horizontal axis tick marks and values, and annotated the years along that axis instead (annotate is so handy that way!)


Although a line plot is a good way to show trends, I always like to plot data in several different ways to get a more complete mental picture. With quantities, I often like to use bar charts, because the heights of the bars give you a way to visually compare the quantities, and the 'area' of the bars gives you a more direct visual representation of the data. So here's a bar chart of the same data, grouped by year, and using the SAS default colors for the htmlblue style. Note that no annotation was required in the bar chart (the year labels are provided by the group axis).


 So, which do you prefer in this situation - the line chart or the bar chart, and why?

Post a Comment

Why I love the Analytics conference series

Maggie Miller and I were chatting about the reasons for enjoying the Analytics 2014 conference in Las Vegas and I made a comment she thought was peculiar. I said that I liked the conference because I could talk about complicated things. She asked me to explain what I meant by that. I will do my best in this post to explain.


Here's me presenting about some "complicated things" at the Analytics conference.

I learned how to talk about complicated concepts in graduate school. I imagine many of you learned this skill in graduate school as well. I will never forget the first time I gave a seminar. WHAT A DISASTER!! I still recall my dissertation advisor saying, “Yeah that sucked.” So what happened? Why was my talk a disaster and how did I fix this? Aside from adjustment to the structure of an economics seminar (a free for all), the single biggest change I made was that I learned to talk to my graduate school colleagues about my research. I realized that I had a collection of people in my inner circle who all were smarter than me. It provided me an opportunity to put my ideas to the test. And that is what I did. At each opportunity I talked about my “identification strategy” and potential “endogeneity” issues. And, guess what? The next seminar was a success and I’ve never looked back.

So how does this relate to Analytics 2014? Selling your analytic models isn’t about the quality of your estimates. It isn’t about the complexity of your models. It is entirely determined by your ability to explain your models to other people. Sometimes it is to people who know more statistics than you. Sometimes it is to smart people who don’t know statistics. Acquiring these skills is entirely the result of practice. And in my three years of attending conferences as part of SAS, I have found no better place to talk about complicated models with sophisticated modelers and smart people who don’t have as much statistical experience.

The Analytics conference gives us the opportunity to start a conversation about a complicated econometric or statistical model with people who know more than we do. We can start our conversation without extensive background on the assumptions of the model. The time with SAS employees at the exhibitor booths allow for interaction with developers of models who know the math but perhaps not the applications. These interactions allow us all to be more comfortable talking about complicated statistical models.

So, when Maggie asked me to explain what I meant, perhaps I was reflecting on what I learned early in my graduate school career about presenting complicated material. Practice helps. While it will never be easy to present complicated material, the Analytics Conference Series events provide an additional opportunity to stand and confidently talk about your work in analytics. We hope to see you next year in Las Vegas at Analytics 2015.

Post a Comment

Dr. Mohammad Abbas, SAS UK & Ireland's top data scientist, forecasts the UK's energy consumption in 2020 using SAS

SAS UK & IreTop Data Scientistland recently ran a competition to find the region's 'top data scientist'; the competition challenge was to produce a forecast of energy demand for the UK in the year 2020 based on the data provided.  Competition for this coveted award was fierce; with the winner claiming a trip to SAS Global Forum in the USA and the chance to feature their submission on the SAS Professionals Network.

I recently caught up with Dr. Mohammad Abbas to discuss how he solved the challenge.

Phil:  Could you tell us a bit about your background?

pylonMohammad:  I hold a Master’s Degree in Organic-Geochemistry and a Ph.D. in Inorganic Chemistry.  While working in the public sector as a chemical analyst in an animal health laboratory, I developed a strong interest in how statistical applications and experimental design are used in animal health. I pursued this interest by gaining a Diploma in Statistics from the Open University and I’ve since devoted considerable time experimenting with analytics using data sets drawn from various disciplines.

Phil: Why did you choose to enter this competition?

Mohammad: Well, I saw the Top Data Scientist Competition as an opportunity to test drive my skills in Big Data Analytics.  Tackling a large analytical project in a predefined time scope was a worthy challenge. It offered me the opportunity to constantly re-evaluate my skills and identify ways to achieve a result.

Phil:  The challenge was to forecast energy consumption in 2020, how did you go about tacking the problem?

Mohammad: Having spent some time examining the 47 or so datasets and doing some background reading on energy consumption, I was in a position to develop some approaches to tackling the problem. In essence, it consisted of three key phases: exploratory data analysis, identifying the key model parameters and then selecting a model.

Phil:  An interesting approach, could you tell me a bit more about each phase?

Mohammed: Generally, exploratory data analysis is by far the most important step in any analytical process and I started by investing a significant amount of time in understanding and visualising the data. It was through this step that I was able to build data blocks and make logical connections between data objects.

UK_population_predictionNext, I needed to identify the key model parameters.  With energy data, there are a lot of variables which can be used at a later stage in the modelling process. The task at this stage was to be able to ask questions of the data  and subdivide those answers into clearly defined groups. For example, what impact do economic factors have on energy consumption?  How should factors such as gross domestic product, housing, population and disposable income be taken into account?  How was energy 'intensity' (that is energy consumption across the domestic, industrial, transport and services sectors) calculated and presented in the data sets? What was the relationship between energy consumption in primary equivalents and final energy consumption?

Phil:  What do you mean by energy consumption in primary equivalents and final energy consumption?

Mohammad:  By this I mean, the difference between the amount of energy generated and the final amount consumed.  Some energy is lost in the production and transmission of power; burning coal to generate electicity looses some of the coal's energy in the process and further power is lost when that electricty is transmitted via pylons, for example.domestic_energy_consumption

I needed to answer all of these questions and more to choose the best variables. Based upon these findings, I subdivided the key parameters into three distinct groups:

  1. Economic factors and related energy consumption variables
  2. Energy intensities by sector (domestic, industry, transport and services)
  3. Energy consumption in primary equivalents and final energy consumption.

Phil:  OK, so how did you go about selecting the best model?

Mohammad: SAS offers a wide array of modelling procedures; and choosing which model to use depends upon a clear understanding of the analytical problem and how much you know about the various statistical modelling methods available. Of course, you also need solid diagnostic skills.

To meet the challenge, it was essential to reduce the number of variables analysed to as few as were relevant; this is known in statistical parlance as 'reducing dimensionality'.  I also needed to take data quality into account and also standardisation was needed as some figures were expressed in thousands and others in millions. Also, some energy consumption data was expressed as tonnes of oil equivalents while others as Terawatt-hours so conversion of these units was needed.

Phil:  How did you go about reducing the number of variables, the 'dimensionality' as it's called?

Mohammad:  There are a number of ways to reduce dimensionality, one of which is a model that combines both dimensionality reduction techniques and regression models.  You can use methods such as 'factor analysis' and 'principal component analysis'  which can be applied individually to reduce dimensionality, or combine them with a regression model to obtain obtain a powerful unified approach known as a 'Partial Least Square Regression Model'.  Of course, SAS provides the ability to do all of this.

Phil:  So which fundamental questions were you trying to answer?

Mohammad:  I was trying to address two key questions,  Firstly, how much variation within the predictor variables (those variables which explain the values of other variables, sometimes known as independent variables) could I explain.  For example, atmospheric temperature could explain energy consumption, as it gets colder, more people put on their heating and hence use more power.  Secondly, how much variability in the target variables could be explained by my choice of predictor variables.  In other words, my target variables concerned energy consumption in 2020, so to what extent did the predictor variables I had chosen help to explain, and hence forecast, that?

Phil:  So what results came out of this process?

Mohammad:  My dimensionality reduction techniques reduced the large number of variables into a handful of factors.  Then the partial least square model generated what are known as factor loadings, weights and scores, which helped me to explain how much each factor contributed to the final forecast and how accurate those forecasts would be.  Also, examining the various models' outputs and their associated diagnostic plots helped me to shape the final prediction process.

Actual and Predicted electricity demand in the UKObviously, trying to predict a single value (energy consumption in 2020) has a large amount of uncertainty associated with it.  So, I ran the model a number of times using different inputs.  I tried broad economic factors, electricity consumption and energy intensity (consumption) for each specific economic sector and finally I used randomisation as a means of assessing my model's ability to differentiate between irrelevant (noise) variables and those with real predictive power.  This allowed me to forecast electricity consumption for the UK in 2020 with a difference of approximately 80 TW-h (terawatt hours) between the highest and the lowest predicted value.

Phil:  Amazing, so what did you find out?

Mohammad:  I predict that the overall demand for electricity in the UK in 2020 will be 527 (+/- 30 TW-h).  This represents an increase of 14.6% relative to 2013. Given the potential growth in population, housing and usage of electrical devices in the UK in the next few years, I think this is pretty accurate.

Finally, I would like to say, I am delighted to have been named as the first winner of this competition. From my experience, the most appealing about this competition was the challenge of taming a large volume of data and to be able to draw valuable insights and relate those findings to the real world we live in. This is what Big Data Analytics is all about.

bigdataskillsUK firms are struggling to find the big data skills they need, click here to read new research by SAS and Tech Partnership highlighting the extent of the problem facing British businesses.




Post a Comment

Graphs that make you go hmm... (early voting data)

Back in the 90s, there was a song by the C&C Music Factory about things that just didn't quite make sense - the song was called Things That Make You Go Hmm.... And in that same spirit, this blog post is about Graphs That Make You Go Hmm...

I'm not really into politics, but I look forward to elections just so I can see what they do with all the data. Some of the graphs are good, some are bad, and some just make me go "Hmm... what were they thinking???". Here's an example of such a graph that appeared in our local newspaper:


At first glance, the graph seemed OK, and told me that as working-age voters get older they tend to do more early voting, and then after they retire they do less early voting (I presume that's partly because retired people don't need to worry about their work schedule any more, and partly because there are fewer older people still living).

But then I started looking at the finer details of the graph, such as tick marks along the axes. Why did they choose 17 as the starting point for the age axis? You have to be 18 to vote. And the horizontal axis tick marks were spaced by 11 years ... until it reached age 61, and then they incremented by 10 years, and then went back to 11 years. Why? And why didn't they choose some increment of 5000 for the tick marks on the vertical axis? Why did they include a shadow to the right of the bars? The shadow makes the entire graph be visually biased towards the right. The hover-text for the bars shows values such as 'Count: 36800' - why not format the number with a comma like the axis values, maybe something like 'Voters: 36,800'. Hmm...

But the thing that really just made me SMH - when I re-sized my browser window, the vertical axis got re-drawn with a scale way taller than needed (making the bars very short), and the most significant digit of the highest value on the axis got chopped off (at first glance, I thought it was 30,000 instead of 100,000). I guess whatever dynamic/interactive software they're using to draw the graph is just too fancy. Hmm...


So, I decided to create my own SAS version of this graph, and try to avoid all the problems shown above. Here's what I came up with. Can you name all the improvements? What else would you change?


Post a Comment

Sports analytics - visualize your results!

In sports these days, there's a lot more data to keep track of than just the score! How can you make sense of it all? Being the Graph Guy, of course I recommend graphing it!

Here's an example that's up close and personal for me - dragon boat racing... Below is a cool head-on picture of our team paddling a dragon boat back to the dock after a race (note that the boat is over 40 feet long, and there are 22 people in it). We are Raleigh Relentless!


When our team goes to a race it's an all day event, with each team racing in several heats against various combinations of the other teams. In each heat, you can easily see how how you did against the other boats in that heat, but what's more interesting and important is how your team is performing compared to all the other teams across the entire day. The times are available in tabular form, but I created the following chart that I think provides a much better way to make sense of the data:


A nice visual analysis of sports data can help capture the fans' attention, help teams know which strategies work better than others, or even help with logistics and seating charts in the stadiums.  I've prepared a collection of examples to show you a few of the ways SAS can be used to help visualize sports-related analytics - hopefully you can reuse some of these examples with your own data. Click the screen-capture below to see the samples page:


 So, what's the most interesting sport to you, from a data perspective? How do you analyze (or wish you could analyze) the data from this sport?

Post a Comment

Are you at risk? Which blood types do vampires prefer?

SAS software has long been used to help analyze 'risk' - what about using it to help determine your risk of being attacked by a vampire?!?

On a previous Halloween, I was the victim of a Vampire attack. Here's the photographic proof...


Being the most common O+ blood type, I thought I was at low risk of attack (with vampires allegedly preferring the more rare blood types, according to this report) -- but I guess a really hungry vampire will lower her standards from time to time.

So, what is your blood type, and how common or rare is it in the human population? Assuming that vampires prefer the more rare blood types, what is your risk of a vampire attack? And if there is geographical clustering of the rare blood types, might vampires be lured to those areas (assuming they use analytics to determine such things)?

I found some data about blood types by country, and plotted the values on a world map. Perhaps you can analyze these maps to help determine the rarity of your blood type, and whether you live in a vampire-prone area of the world. I've included small thumbnails of my maps below, and you can click on them to see the full size maps with html hover-text for each country:












Post a Comment

Tracking Ebola: Using SAS bubble maps

In a previous blog post, I showed how to layer colored areas on a SAS map to show both countries, and the areas within countries that had cases of Ebola. But as the Ebola epidemic has spread, more data has become available, and in this blog I show how to represent that additional data by annotating bubble markers on the map.

While perusing the Internet for the latest Ebola information, I came across a map that the World Health Organization maintains.  It is interactive, and lets you pan and zoom the map. And as you zoom, it shows progressively more detail. It's a pretty cool map. After zooming-in, I did a screen-capture to include below:


But as I studied the map to try to determine the current status of Ebola, I noticed a few problems. The shades of color for the land were very similar to the colors of the bubbles, even though they were representing slightly different things. Also, the colors used were transparent, and therefore when they overlapped (such as bubbles overlapping other bubbles, or overlapping land) then the transparent colors combined and produced darker colors -- which made it very difficult to determine which legend colors they matched.

Therefore I decided to create my own SAS map, and use solid/non-transparent colors (so there was no color blending), and also use very different/distinct colors for the land and the bubbles (so it is easier to match up to the legend).

For the main bubbles, I used the SAS %centroid() macro to determine the location of the center of each region in the map, and I then used the annotate pie function to draw the bubbles.

I used specific lat/long coordinates for the locations of the Ebola Treatment Centers, and overlaid several pieces of geometry (all using variations of the annotate pie function) to create a white bubble with a red cross.

Below is a snapshot of my SAS version of the map. Click here to see the interactive version, which has html hover-text over all the land areas and bubbles, and drill-down for the Ebola Treatment Centers.


Here is a link to the SAS code, if you'd like to experiment with the map.

Post a Comment

What can universities do to fill the analytics skills gap?

Who do you want to be when you grow up? And can I offer you a suggestion?

Considering the huge shortfall in analytical talent we’re facing, we should be asking those two questions more often. Many of the people who are considering a change in careers or searching for their first career don’t realize there are rewarding – even lucrative – career opportunities in data analytics. What can we do to help them see themselves as data pros?

Dr. Michael Rappa is the founding director of the Institute for Advanced Analytics and a professor in the Department of Computer Science at North Carolina State University. Jennifer Priestly is a Professor of Applied Statistics and Data Science and the Director of the Center for Statistics and Analytical Services at Kennesaw State University. Both directors have strong opinions on how to provide the market with the type of graduates who can tackle the big data problems that change the world. (Sound hokey? Then check out Project Data Sphere and UN Global Pulse.)

Filling the talent gap isn’t going to happen overnight, but with a few purposeful steps we can make it happen. Here are a few of Priestly and Rappa’s recommendations:

  1. Teach differently. Priestly says universities need to innovate just as the private sector is. “We can’t teach the way we have always taught,” she says. “Universities need to use the new resources and tools and change their thinking to fit this generation of problems and this generation of learners.”
  2. Make the path clear. Universities can no longer assume that students will self-select courses that will help them match up to employer expectations. “We have to purposefully direct that path by changing course offerings, educate students on the opportunities and incentivize them to pursue that career.” 9North Carolina State University (NCSU) was the first to create an M.S. in Analytics.) Rappa says NCSU stopped thinking in terms of individual courses and developed a 10-month, full-time intensive curriculum. The cohort moves through the program together and learns as much from one another as from the course work.
  3. Look at the employer as the customer. In one of my undergraduate marketing courses, we were asked to research what universities could do to increase enrollment and improve student success. Our survey results were limited to a small segment of the student body, but it seemed conclusive that universities should treat the student and prospective students as the customer. Rappa disagrees (and I do now that I’m in the job force). He says universities will set the students up for success in the workforce by asking employers what kind of graduate will help them move their organization forward.
  4. Create team players. According to Rappa, the traditional response when an employer says that graduates are missing a skill set is to develop a course that teaches the skill. But with teamwork, that’s not really a skill you can learn without living it. “If you want team players, students need to work in teams,” says Rappa. He says that they are successful because they work together to solve a real problem from sponsoring organizations’ real data.
  5. Develop lifetime learners. Employers expect applicants to have knowledge of the current analytics technology, but they don’t expect or want them to show up on the first day knowing everything there is to know about analytics. What they want is someone who is curious and hungry for more problems to solve. Universities need to produce someone who can be productive from the start. You can do that by giving them interesting and challenging projects to work on – give a purpose to their learning.
  6. Partner with corporations to gain real – even messy – data. The answers to real-world problems can’t be found in the back of the textbook. Priestly says that universities should call on organizations to sponsor contests or give students the kind of data sets they can expect to see in their career.
  7. On-the-job training. Provide internship opportunities that introduce the students to real-world uses of analytics. Assign them to projects that capitalize on their creativity. Priestly says to encourage organizations to bring students a problem to solve. “It’s a heck of a lot cheaper than hiring a consultant firm and while you’re at it you will help create a pipeline of experienced talent,” she says.

Rappa’s message to universities, “Be bold. Break free of your comfort zones to educate students in powerful ways that justify the kinds of loans students take out ….  The message to industry is push back on universities in your area. Incentivize them to break free and offer new programs.”

Post a Comment