Higher education and analytics

It's my favorite time of year! The leaves are changing. Football is back. And it's also time for our annual Analytics conference.

One of the best parts about my job is getting to attend the conference each year and host the Inside Analytics video series.

Not everyone at the conference gets the chance to have one-to-one time with so many speakers.

My first interview was with Dr. Goutam Chakraborty, professor of marketing at Oklahoma State University.


If you're at the conference this week, here's my list of the six things you need to do.

Look for more updates this week. The conference runs from Oct. 20-21.

Post a Comment

How do men rate women on dating websites? (Part 2)

I always recommend looking at data in several different ways, to get a more complete picture of what's really going on - such is the case with the member 'ratings' on dating websites. Let's take a look at some data from a different angle...


In a recent blog post, I analyzed which age men & women the opposite sex rated most attractive. The graphs indicated that men rated 20-year-old women the most attractive, whereas women rated men closer to their own age most attractive. This sparked quite a bit of discussion (such as the comments in the cross-posting of the blog on allanalytics.com).

So I decided to look at the ratings data in a different way - this time ignoring age, and just looking at how men and women rate each other in general. I found some histograms on p. 16 of Christian Rudder's new book Dataclysm that showed almost what I was looking for, and I then used some graphs from his blog to estimate the data so I could create similar charts in SAS.

Whereas the men of all age groups consistently rated 20-year-old women the most attractive (which produced a very lopsided chart), their ratings of all women in general produced a very symmetrical chart. In Rudder's book he even describes it as "close to what's called a symmetric beta distribution - a curve often deployed to model basic unbiased decisions." Therefore it appears that men are very unbiased/honest in the way they rate women.


By comparison, women rated men very poorly. Rudder mentions that women only rate one guy in six as "above average."


What causes this huge difference in how men and women rate each other? Is one being more honest than the other? Are they rating based on different criteria (perhaps men are rating based on looks, and women are rating based on whether or not they think the men would make a good mate)? Perhaps women are hesitant to rate a man highly, because they know that will trigger okcupid to send that man a message letting them know which woman rated them highly? What other factors are perhaps influencing this data?

Feel free to leave your thoughts & theories on this topic in the comments section!


Post a Comment

Is Hadoop the answer to big data?

HadoopHaving spent a quarter of a century working on databases and on database-related technologies, I have developed an aura of skepticism on any new product that hits the market being presented as the best thing we have ever seen. It’s not that I love to revel in “I told you so” moments, it’s just that I have seen too many products fly high in the sky only to disappear like meteors.

For many, Hadoop’s entrance into the database field meant that technology had finally come up with the only possible instrument equipped with a framework capable of handling “big data.” On top of that, its affordability unequivocally meant that the end was in sight for traditional relational databases that had so far dominated the scene. Today, after much time and effort spent on integrating Hadoop in their environments, many of the companies that were quick to jump on its bandwagon are discovering that despite having an important role in their infrastructure, Hadoop is not the Godsend answer than many thought it would be.

Why is that? The explanation is simple. At the end of the day, Hadoop is another technological tool, just like its relational database counterparts. On the other hand, big data is not about technology, but rather about business needs. This means that Hadoop shouldn’t be considered as the sole player in the field of data analysis. For example, it makes sense to use Hadoop to run broad exploratory analysis of large data, but a relational database is still a better option to perform an operational analysis of what was uncovered. Hadoop is also good for looking at the lowest level of detail in a data set, but relational databases make more sense when it comes to storing transformed and aggregated data. As the Facebook analytics Chief Ken Rudin puts it, “you need to use the right technology to fit your business needs.”

A recent survey commissioned by an IT company, found that more than 30% of the companies interviewed had already deployed Hadoop, with an additional 30% having plans to deploy it within 12 months. Something interesting that came out of the survey was the fact that the majority of these companies planned to combine Hadoop’s data analysis capabilities with the ones provided by other databases that were already integrated in the companies infrastructures. According to the study, the goal was and still is to use Hadoop to perform raw data analysis, while using traditional databases to take care of non-analytic workloads, especially transaction-oriented ones, and perform data analysis on aggregated data coming from Hadoop.

Take eBay, for example. The San Jose, Calif.-based company’s three-tier data analytics approach is an example of the kind of role Hadoop can find within an organization alongside other traditional relational databases. Structured data resides in the first tier, an enterprise data warehouse that is used for daily housekeeping items, such as feeding business intelligence dashboards and reports. The second tier consists of a Teradata data management platform that is used to store huge amounts of semi-structured information. Fully unstructured data such as textual information lives in the third tier, a Hadoop cluster reserved for deeper research, analysis and experimentation.

The moral of the story is that Hadoop is not a synonym for big data, but one of the many players you need to mine and analyze your data. A good reason to hang on to those other databases a little longer.

I’ll be talking about big data and Hadoop at Analytics 2014 along with Josh Wills from Cloudera and my SAS colleagues Wayne Thompson and Kelly Hobson. Check out our panel presentation and round table discussion on Hadoop. We hope to see you there!

  • Panel discussion with SAS and Cloudera on Big Data and Hadoop: Moving beyond the hype to realize your analytics strategy with SAS® - Monday, October 20, 3:00-3:50 pm
  • Round Table discussion on Practical Considerations for SAS Analytics in a Hadoop Environment – Tuesday, October 21, 12:30-1:45 pm

You can also check out our starter services on Visual Analytics and Visual Statistics and the Expert Exchange for Hadoop.

Post a Comment

The SAS model factory – a big data solution

Do you have too many models to build, too many to manage, too few analytic resources or too much data?  A Model Factory may be your answer.

The mindset of analytics is changing.  This represents the transformation from a “craftsman” dominated culture in which multiple weeks were spent cycling through data and developing a model; to a production-oriented environment where analytically derived information almost instantaneously follows the strategic conceptualization of ideas.

This transformation is significantly accelerated by the integration of the SAS Model Factory.

The idea of a “Model Factory” may make one reminisce of a mechanical age of smokestacks and assembly lines.  When Henry Ford revolutionized the car making process by introducing the assembly line – the process that is still used worldwide in auto manufacturing today – he laid the foundation for the democratization of the car. This assembly line reduced the cost of making a car to an amount that made it sellable to a much larger audience.

What do we really mean by Model Factory?

A factory is defined as where something is made or assembled quickly and in great quantities.

A model factory is defined as where predictive models are automatically built quickly and in great quantities enabling an automated scoring process.

Why would you use a Model Factory?ModelFactory

  • Perhaps you have limited technical and/or analytic resources.
  • You have too many models to build and manage because you have various target variables and/or you segment your customers prior to modeling.
  • If you have 1000’s of customer attributes, you may need to select only a subset that is appropriate for each model.
  • Perhaps you need to perform repetitive data preparation with variable transformations, handling of missing values, etc.
  • You have Big Data which slows down model building and scoring.
  • In brief, you are unable to build models fast enough.

Can the model factory process be automated?

It consists of:

  • Model Initiation
  • Model Development
  • Model Deployment
  • Model Monitoring
  • Model Recalibration/Rebuild
  • Model Retirement

From a Factory Perspective, it looks like:

sas model factory

You choose to write a code-based Model Factory

You can use Base SAS and SAS/Stat with the High Performance Procedures to enable 100’s or 1000’s of models to be built automatically on as much data as you have.  With the needed code, your data will be structured properly.  Transformations, and missing values will be automatically handled.  Good enough models will be built.  And, no analytical skills will be needed to run the process.

Model Factory Deployment

  • Run Macro Driven Code
  • Parameter file

–      Manual entry
–      Point-and-Click entry

  • Code processes parameter file and data
  • Code runs analytic models
  • Model Factory code produces Scoring code

SAS has other solutions for model building

If you have fewer models to build and/or you have the needed analytic resource for model development, these Point-and-Click solutions may be sufficient:

  • Enterprise Miner
  • Rapid Predictive Modeler – run from Enterprise Guide

What can you do to build a Model Factory?

  • Take classes in Data Mining techniques
  • Read documents about data mining
  • Have internal working meetings to review goals and desired results
  • Engage consultants

In summary, we understand that you have experienced the chaos associated with building and maintaining a multitude of models.  The solution to your modeling problems may be the Model Factory Solution which replaces the chaos with automation, efficiency, and repeatability.  For more information, you may contact the author.  For more on this topic, attend the SAS Model Factory pre-conference workshop at Analytics 2014 in Las Vegas on Sunday, October 19, 2014, 1-5 pm.

Post a Comment

How does Amazon deliver packages so quickly?

With Amazon Prime's 2-day shipping, I seldom go to physical stores any more. How do they deliver items so quickly? Let's analyze some data to find out...

There are very few services/memberships that I truly feel like I'm getting "a good deal" for my money - and Amazon Prime is one of them. Amazon has a huge product selection, with detailed information about each product. There's also a good search engine that always seems to work the way I want it to. And they have a large customer base, and the customers frequently post very useful product reviews & ratings. But the feature that impresses me the most is the 2-day shipping! I used to hate ordering things through the mail, because it typically took 5-10 days - but with Amazon Prime's free 2-day shipping, I usually have the item quicker than I could have found the time to drive to physical stores shopping for it.

For example, here's an exact replacement antenna I recently bought on Amazon for my vintage 1980's Conion boombox. Believe me - I could have driven around town for weeks looking in electronics stores, and still not found one!


How does Amazon deliver their packages so quickly? Some claim that they use a fleet of unmanned drone aircraft to deliver their packages. They don't - or at least not yet! (see Amazon Prime Air proposal)


But what they do have is a network of huge distribution warehouses, strategically placed across the US. So, while you're shopping online, they check to see if the item you want is in a warehouse close enough to your location that it could be delivered within 2 days. And, of course, they have people working in the warehouses around the clock pulling the items you order, and packaging them to ship immediately after you order them.

I was curious which warehouse(s) were closest to me, and found a map on the Amazon website. I could see that three states bordering North Carolina have a distribution center, but I couldn't tell exactly where the warehouses were located within each state. I did a bit more searching and found an article that listed the addresses of the warehouses. With that info, I was able to use Proc Geocode to estimate the latitude/longitude of each warehouse, and plot them on a SAS map (click the map below to see the interactive version with html hover-text over each marker, and links to bring up a Google satellite map of each location):




Post a Comment

How do men rate women on dating websites?

What age women do men prefer on dating websites? - Let's have a look at the data...

When I first started using computers in the early 80s, I thought it would be great to have everyone take a survey, and then let computers show you who your best matches were. The computer would be a modern day Cupid! ... I don't know what Cupid really looks like, but here's a picture of a likely candidate my friend Reggie has in his collection of antiques:


Speaking of Cupid & online matchmaking ... one of the most popular free dating websites is called okcupid. Each user provides some personal information when they register (such as age), and optionally answers questions on various topics. Users also have the opportunity to interact with other users in various ways such as sending messages and 'rating' the other users on a scale from 1 to 5 ... and the people who run the website have access to all this data!

Christian Rudder was one of the founders of okcupid, and was in charge of their analytics team. His job was to "make sense of the data their users created" - what a great job, eh!?! And he has shared some very interesting graphical analyses in blogs and articles. One of his recent articles analyzed how people rated others' profiles on their dating site, and for each age (20-50) it showed the age of the people those users rated the highest.

Who did the men rate highest? For almost every age of men, the age of the women they rated highest was around 20. Below is my SAS graph very similar to their graph in the article:


It's an interesting graph, but I had to study it for a few minutes, and read the article, to be able to understand exactly what it was saying. And as usual, I couldn't leave well enough alone, and decided to try to make a few improvements to it...

First, I decided to sort the bars so that the older men are at the top (instead of at the bottom), as is customary with population pyramid charts. This small change helped make the graph's layout more familiar to me, and more logical.


Another thing that needed work - all those tiny numbers on the bars were difficult to read (I'm not getting any younger, you know!). So I decided to go with the more traditional approach of showing the numbers as tick marks along the axis, with reference lines. I made my axis symmetrical around the origin (zero). I also added html hover-text to my html output so you can easily see the exact values for a specific bar (click this link, or the graph below, to see the interactive version with hover-text).


I also decided to make the colors a bit more meaningful/mnemonic ... blue for guys, and pink for girls. And one last enhancement that I think is very important - I added a descriptive title to the graph, so people would know what it represents, without having to read through the text of the article.


Now that we've analyzed the men's preferences, how about the women?


Hmm ... so the men all rate the 20 year old women the highest, and the women rate men who are close to their own age the highest? Why the difference? What factors might be affecting this data?

I have a few theories, but first I invite you to share your theories in a comment!


Post a Comment

Machine Learning at Scale with SAS and Cloudera

MachineLearning_2Imagine being able to get into your car and say “Take me to work.” Then, it automatically drives as you read the morning paper.  We’re not there yet. But we’re closer than you think. Google has already developed a prototype for a driverless car in the U.S.  Driverless cars are just one example of machine learning. It’s used in countless applications including those that predict fraud, identify terrorists, recommend the right products to customers at the right moment and correctly identify a patient’s symptoms in order to recommend the appropriate medications.

The concept of machine learning has been around for decades. What’s new is that it can now be applied to huge quantities of complex data. Less expensive data storage, distributed processing, more powerful computers, and the analytical opportunities available have dramatically increased interest in machine learning systems.

Machine learning focuses on the construction and study of systems that can learn from data. The goal is to develop deep insights from data assets faster, extract knowledge from data with greater precision, improve the bottom line and reduce risk.

Considerable overlap exists between statistics and machine learning. Both disciplines focus on studying generalizations (or predictions) from data.  A big difference between statistics and machine learning, is that statistics focuses more on inferential analysis to make predictions about a larger population than the sample represents. Statistics also looks at things like parameter estimates, error rates, distribution assumptions and so forth to understand empirical data with a random component.

Naturally you want a scalable machine learning platform that provides enterprise ready storage, data processing, management along with the analytics. The deep partnership of Cloudera and SAS provides modern distributed analytical products such as SAS In-Memory Statistics for Hadoop and SAS Visual Statistics collocated with your CDH5.0 cluster.   

To learn more about Big Data and machine learning with SAS and Cloudera, check out the panel presentation and round table discussion on Hadoop at Analytics 2014 in Las Vegas!

  • Panel discussion with SAS and Cloudera on Big Data and Hadoop: Moving beyond the hype to realize your analytics strategy with SAS® - Monday, October 20, 3:00-3:50 pm
  • Round Table discussion on Practical Considerations for SAS Analytics in a Hadoop Environment – Tuesday, October 21, 12:30-1:45 pm

You can also check out our starter services on Visual Analytics and Visual Statistics and the Expert Exchange for Hadoop.

Post a Comment

Big data becoming reality with SAS

In the last 5 years, the buzzword "big data" has spread like wildfire.  One could argue big data had been around prior, but during this time media outlets such as the Wall Street Journal and C-level executives started to take a keen interest in this topic. No longer a problem specific to the internet giants like Google, Yahoo, and Facebook, companies across all industries were beginning to explore data storage and management strategies beyond the traditional relational database realm.

Why the need for this shift? Data was generated at higher volumes at faster speeds, and was not always easy to work with.  Unstructured data, or data not organized or formatted in a predefined way, came from all directions: call logs, click streams, cheap sensor data, social media posts and many other sources.  A new strategy was necessary to capture and extract business value from this information.BigData

One particular project, Hadoop, was taking off, and SAS was there to get in on the action.  Hadoop is an open-source software framework that runs on low cost commodity hardware and has the ability to scale to accommodate massive amounts of data.

The idea originated in the mid-2000s and gained serious momentum with the formation of the company, Cloudera, in 2008. Cloudera provides a platform for enterprise level customers to utilize the power of Hadoop.  Those enterprise level customers often have a need for analytics which is why it is no surprise that SAS formed an alliance with Cloudera in 2013.

Currently at SAS, there are products such as In-Memory Statistics for Hadoop, Visual Analytics, Visual Statistics, and High-Performance Analytics, to leverage analytics on data in Hadoop. No longer just a buzzword, customers are able to use their big data to gain greater insights and a competitive advantage.

For more on this topic, come see the panel presentation and round table discussion on Hadoop at Analytics 2014 in Las Vegas!

  • Panel discussion with SAS and Cloudera on Big Data and Hadoop: Moving beyond the hype to realize your analytics strategy with SAS® - Monday, October 20, 3:00-3:50 pm
  • Round Table discussion on Practical Considerations for SAS Analytics in a Hadoop Environment – Tuesday, October 21, 12:30-1:45 pm
Post a Comment

The taxman cometh - for Amazon.com!

Do you order things online, to avoid paying sales tax? Those "good old days" might be coming to an end soon...

Here's a snapshot of my latest purchase from Amazon.com (a little something for the Talk Like a Pirate party I had on Sept 19):


In the US, each of the 50 states handles sales taxes a little differently, especially when it comes to online purchases. In general, if an online retailer has a physical presence in your state (such as a store or warehouse), then that online retailer must charge you sales tax for your online purchases. And in my state (North Carolina), even if the online retailer does not charge you sales tax, the buyer is supposed to pay a use tax when they file taxes at the end of the year.

As consumers have been buying more online, and less in local stores (for convenience, price, etc), the states have seen a decrease in sales tax revenue. Therefore many states are pressuring online retailers to collect sales taxes for the state - especially the large online retailers like Amazon.com.

Being an Amazon Prime customer myself, I wondered how many states currently force Amazon to collect sales taxes. I did a few searches, and found a nice detailed map in a Wall Street Journal article that showed what I was looking for. But their map was somewhat 'busy' (showing 4 different categories of taxation), and took a while for me to understand. Therefore I decided to create a simplified version using SAS.

In my SAS map, I make the states where Amazon.com has to collect sales tax red (and all other states a light/subdued color). I also added a timestamp, which will become important as the states which do/don't charge tax will likely change in the future.



Do you have to pay sales taxes for your online purchases? What are other countries doing? What's your suggestion on the best/most-equitable way to handle it?


Post a Comment

Which cars get the most speeding tickets?

Is the type of car you drive more likely, or less likely, to get a speeding ticket? Let's analyze some data to find out!

Do red cars attract more attention from the police, and get more tickets? How about cars with a 'racing stripe'? Or cars with a big chromed motor, a blower, and side pipes (such as the one in the picture below that I took at a local car show)?  Zoom, zoom!


Of course, cars don't get speeding tickets - people do. But perhaps people who drive fast (and get lots of tickets) tend to drive certain types of cars? A recent CNN article used data for people who had gotten a quote from insurance.com, and listed the Top 20 cars where the highest percentage of people wanting to insure that car had a recent ticket.

I would imagine the insurance quote data is a little biased. For example, the people looking for an insurance quote might be more likely to have tickets than the general population (that might be why they're looking). But nonetheless, the data is 'interesting' so let's go with it!

The CNN article showed each of the top 20 cars on a separate page, which made it time-consuming to see all 20, and also made it difficult to compare them. Therefore I created a simple SAS bar chart to overcome those problems:


Seeing the 20 cars with the most tickets was interesting, but it made me curious about the 20 cars with the fewest tickets. Therefore I dug up that data, and created a similar bar chart for the fewest tickets. Note that I scaled it the same as the previous chart, so it would be easy to visually compare the two charts:


Of course, while I was scraping around to find the data for the above charts, I also got the data for all the cars in between (over 500 different models in all). And with all that data, I had to try to visualize it all at once! I created a scatter plot, with the data grouped by make along the vertical axis (similar to the bar chart layout), and sorted the makes by their average number of tickets.


 You can click any of the graphs above to see the interactive version, with html hover-text, and drilldowns that do a Google search for images of that vehicle!

Did any of the cars in the best and worst 20 surprise you? Do you own one of those cars, and can you confirm whether or not you have speeding ticket(s)? What other factors do you think influence your probability of getting a speeding ticket? Do you have any 'tricks' for not getting speeding tickets, that you'd like to share?


Post a Comment