How does Amazon deliver packages so quickly?

With Amazon Prime's 2-day shipping, I seldom go to physical stores any more. How do they deliver items so quickly? Let's analyze some data to find out...

There are very few services/memberships that I truly feel like I'm getting "a good deal" for my money - and Amazon Prime is one of them. Amazon has a huge product selection, with detailed information about each product. There's also a good search engine that always seems to work the way I want it to. And they have a large customer base, and the customers frequently post very useful product reviews & ratings. But the feature that impresses me the most is the 2-day shipping! I used to hate ordering things through the mail, because it typically took 5-10 days - but with Amazon Prime's free 2-day shipping, I usually have the item quicker than I could have found the time to drive to physical stores shopping for it.

For example, here's an exact replacement antenna I recently bought on Amazon for my vintage 1980's Conion boombox. Believe me - I could have driven around town for weeks looking in electronics stores, and still not found one!

antenna

How does Amazon deliver their packages so quickly? Some claim that they use a fleet of unmanned drone aircraft to deliver their packages. They don't - or at least not yet! (see Amazon Prime Air proposal)

AMAZON TESTE LA LIVRAISON DE COLIS PAR DES DRONES

But what they do have is a network of huge distribution warehouses, strategically placed across the US. So, while you're shopping online, they check to see if the item you want is in a warehouse close enough to your location that it could be delivered within 2 days. And, of course, they have people working in the warehouses around the clock pulling the items you order, and packaging them to ship immediately after you order them.

I was curious which warehouse(s) were closest to me, and found a map on the Amazon website. I could see that three states bordering North Carolina have a distribution center, but I couldn't tell exactly where the warehouses were located within each state. I did a bit more searching and found an article that listed the addresses of the warehouses. With that info, I was able to use Proc Geocode to estimate the latitude/longitude of each warehouse, and plot them on a SAS map (click the map below to see the interactive version with html hover-text over each marker, and links to bring up a Google satellite map of each location):

amazon_fulfillment_centers

 

 

Post a Comment

How do men rate women on dating websites?

What age women do men prefer on dating websites? - Let's have a look at the data...

When I first started using computers in the early 80s, I thought it would be great to have everyone take a survey, and then let computers show you who your best matches were. The computer would be a modern day Cupid! ... I don't know what Cupid really looks like, but here's a picture of a likely candidate my friend Reggie has in his collection of antiques:

cupid

Speaking of Cupid & online matchmaking ... one of the most popular free dating websites is called okcupid. Each user provides some personal information when they register (such as age), and optionally answers questions on various topics. Users also have the opportunity to interact with other users in various ways such as sending messages and 'rating' the other users on a scale from 1 to 5 ... and the people who run the website have access to all this data!

Christian Rudder was one of the founders of okcupid, and was in charge of their analytics team. His job was to "make sense of the data their users created" - what a great job, eh!?! And he has shared some very interesting graphical analyses in blogs and articles. One of his recent articles analyzed how people rated others' profiles on their dating site, and for each age (20-50) it showed the age of the people those users rated the highest.

Who did the men rate highest? For almost every age of men, the age of the women they rated highest was around 20. Below is my SAS graph very similar to their graph in the article:

okc_rating_men

It's an interesting graph, but I had to study it for a few minutes, and read the article, to be able to understand exactly what it was saying. And as usual, I couldn't leave well enough alone, and decided to try to make a few improvements to it...

First, I decided to sort the bars so that the older men are at the top (instead of at the bottom), as is customary with population pyramid charts. This small change helped make the graph's layout more familiar to me, and more logical.

okc_rating_men1

Another thing that needed work - all those tiny numbers on the bars were difficult to read (I'm not getting any younger, you know!). So I decided to go with the more traditional approach of showing the numbers as tick marks along the axis, with reference lines. I made my axis symmetrical around the origin (zero). I also added html hover-text to my html output so you can easily see the exact values for a specific bar (click this link, or the graph below, to see the interactive version with hover-text).

okc_rating_men2

I also decided to make the colors a bit more meaningful/mnemonic ... blue for guys, and pink for girls. And one last enhancement that I think is very important - I added a descriptive title to the graph, so people would know what it represents, without having to read through the text of the article.

okc_rating_men3

Now that we've analyzed the men's preferences, how about the women?

okc_rating_women3

Hmm ... so the men all rate the 20 year old women the highest, and the women rate men who are close to their own age the highest? Why the difference? What factors might be affecting this data?

I have a few theories, but first I invite you to share your theories in a comment!

 

Post a Comment

Machine Learning at Scale with SAS and Cloudera

MachineLearning_2Imagine being able to get into your car and say “Take me to work.” Then, it automatically drives as you read the morning paper.  We’re not there yet. But we’re closer than you think. Google has already developed a prototype for a driverless car in the U.S.  Driverless cars are just one example of machine learning. It’s used in countless applications including those that predict fraud, identify terrorists, recommend the right products to customers at the right moment and correctly identify a patient’s symptoms in order to recommend the appropriate medications.

The concept of machine learning has been around for decades. What’s new is that it can now be applied to huge quantities of complex data. Less expensive data storage, distributed processing, more powerful computers, and the analytical opportunities available have dramatically increased interest in machine learning systems.

Machine learning focuses on the construction and study of systems that can learn from data. The goal is to develop deep insights from data assets faster, extract knowledge from data with greater precision, improve the bottom line and reduce risk.

Considerable overlap exists between statistics and machine learning. Both disciplines focus on studying generalizations (or predictions) from data.  A big difference between statistics and machine learning, is that statistics focuses more on inferential analysis to make predictions about a larger population than the sample represents. Statistics also looks at things like parameter estimates, error rates, distribution assumptions and so forth to understand empirical data with a random component.

Naturally you want a scalable machine learning platform that provides enterprise ready storage, data processing, management along with the analytics. The deep partnership of Cloudera and SAS provides modern distributed analytical products such as SAS In-Memory Statistics for Hadoop and SAS Visual Statistics collocated with your CDH5.0 cluster.   

To learn more about Big Data and machine learning with SAS and Cloudera, check out the panel presentation and round table discussion on Hadoop at Analytics 2014 in Las Vegas!

  • Panel discussion with SAS and Cloudera on Big Data and Hadoop: Moving beyond the hype to realize your analytics strategy with SAS® - Monday, October 20, 3:00-3:50 pm
  • Round Table discussion on Practical Considerations for SAS Analytics in a Hadoop Environment – Tuesday, October 21, 12:30-1:45 pm

You can also check out our starter services on Visual Analytics and Visual Statistics and the Expert Exchange for Hadoop.

Post a Comment

Big data becoming reality with SAS

In the last 5 years, the buzzword "big data" has spread like wildfire.  One could argue big data had been around prior, but during this time media outlets such as the Wall Street Journal and C-level executives started to take a keen interest in this topic. No longer a problem specific to the internet giants like Google, Yahoo, and Facebook, companies across all industries were beginning to explore data storage and management strategies beyond the traditional relational database realm.

Why the need for this shift? Data was generated at higher volumes at faster speeds, and was not always easy to work with.  Unstructured data, or data not organized or formatted in a predefined way, came from all directions: call logs, click streams, cheap sensor data, social media posts and many other sources.  A new strategy was necessary to capture and extract business value from this information.BigData

One particular project, Hadoop, was taking off, and SAS was there to get in on the action.  Hadoop is an open-source software framework that runs on low cost commodity hardware and has the ability to scale to accommodate massive amounts of data.

The idea originated in the mid-2000s and gained serious momentum with the formation of the company, Cloudera, in 2008. Cloudera provides a platform for enterprise level customers to utilize the power of Hadoop.  Those enterprise level customers often have a need for analytics which is why it is no surprise that SAS formed an alliance with Cloudera in 2013.

Currently at SAS, there are products such as In-Memory Statistics for Hadoop, Visual Analytics, Visual Statistics, and High-Performance Analytics, to leverage analytics on data in Hadoop. No longer just a buzzword, customers are able to use their big data to gain greater insights and a competitive advantage.

For more on this topic, come see the panel presentation and round table discussion on Hadoop at Analytics 2014 in Las Vegas!

  • Panel discussion with SAS and Cloudera on Big Data and Hadoop: Moving beyond the hype to realize your analytics strategy with SAS® - Monday, October 20, 3:00-3:50 pm
  • Round Table discussion on Practical Considerations for SAS Analytics in a Hadoop Environment – Tuesday, October 21, 12:30-1:45 pm
Post a Comment

The taxman cometh - for Amazon.com!

Do you order things online, to avoid paying sales tax? Those "good old days" might be coming to an end soon...

Here's a snapshot of my latest purchase from Amazon.com (a little something for the Talk Like a Pirate party I had on Sept 19):

pirate_rings

In the US, each of the 50 states handles sales taxes a little differently, especially when it comes to online purchases. In general, if an online retailer has a physical presence in your state (such as a store or warehouse), then that online retailer must charge you sales tax for your online purchases. And in my state (North Carolina), even if the online retailer does not charge you sales tax, the buyer is supposed to pay a use tax when they file taxes at the end of the year.

As consumers have been buying more online, and less in local stores (for convenience, price, etc), the states have seen a decrease in sales tax revenue. Therefore many states are pressuring online retailers to collect sales taxes for the state - especially the large online retailers like Amazon.com.

Being an Amazon Prime customer myself, I wondered how many states currently force Amazon to collect sales taxes. I did a few searches, and found a nice detailed map in a Wall Street Journal article that showed what I was looking for. But their map was somewhat 'busy' (showing 4 different categories of taxation), and took a while for me to understand. Therefore I decided to create a simplified version using SAS.

In my SAS map, I make the states where Amazon.com has to collect sales tax red (and all other states a light/subdued color). I also added a timestamp, which will become important as the states which do/don't charge tax will likely change in the future.

amazon_sales_tax

 

Do you have to pay sales taxes for your online purchases? What are other countries doing? What's your suggestion on the best/most-equitable way to handle it?

 

Post a Comment

Which cars get the most speeding tickets?

Is the type of car you drive more likely, or less likely, to get a speeding ticket? Let's analyze some data to find out!

Do red cars attract more attention from the police, and get more tickets? How about cars with a 'racing stripe'? Or cars with a big chromed motor, a blower, and side pipes (such as the one in the picture below that I took at a local car show)?  Zoom, zoom!

fast_motor

Of course, cars don't get speeding tickets - people do. But perhaps people who drive fast (and get lots of tickets) tend to drive certain types of cars? A recent CNN article used data for people who had gotten a quote from insurance.com, and listed the Top 20 cars where the highest percentage of people wanting to insure that car had a recent ticket.

I would imagine the insurance quote data is a little biased. For example, the people looking for an insurance quote might be more likely to have tickets than the general population (that might be why they're looking). But nonetheless, the data is 'interesting' so let's go with it!

The CNN article showed each of the top 20 cars on a separate page, which made it time-consuming to see all 20, and also made it difficult to compare them. Therefore I created a simple SAS bar chart to overcome those problems:

most_ticketed_cars

Seeing the 20 cars with the most tickets was interesting, but it made me curious about the 20 cars with the fewest tickets. Therefore I dug up that data, and created a similar bar chart for the fewest tickets. Note that I scaled it the same as the previous chart, so it would be easy to visually compare the two charts:

most_ticketed_cars1

Of course, while I was scraping around to find the data for the above charts, I also got the data for all the cars in between (over 500 different models in all). And with all that data, I had to try to visualize it all at once! I created a scatter plot, with the data grouped by make along the vertical axis (similar to the bar chart layout), and sorted the makes by their average number of tickets.

most_ticketed_cars2

 You can click any of the graphs above to see the interactive version, with html hover-text, and drilldowns that do a Google search for images of that vehicle!

Did any of the cars in the best and worst 20 surprise you? Do you own one of those cars, and can you confirm whether or not you have speeding ticket(s)? What other factors do you think influence your probability of getting a speeding ticket? Do you have any 'tricks' for not getting speeding tickets, that you'd like to share?

 

Post a Comment

Analysis of credit scores, and automobile loans

Have you heard the old saying that "Banks only loan money to people who don't need it"?  Let's analyze the data and see if that is true!...

I'm very much a car-guy, and I love learning about all the new vehicles, and love the new-car feel ... and even the smell.  It's hard to not like a nicely detailed sporty vehicle. For example, here's a picture of the Miata a co-worker (and fellow car enthusiast) recently bought. Looks really nice sitting there on the Blue Ridge Parkway, doesn't it!

jims_miata2

 ... and with the price of vehicles these days, most people need a loan to buy one. Speaking of car loans, I recently saw a very interesting article by Liberty Street Economics where they showed how much $ in car loans was made, grouped by credit score. I found the raw data, downloaded it, and created my own SAS version of the graph. I kept mine very similar to their original, but cleaned up the time axis a little (only showing the year at each tick mark), stacked the color legend values, and included markers on the lines (which I think provides a little more visual insight into how fast the data is changing, etc).

auto_loan_originations

As you can see in the graph, subprime lending (to people with lower credit scores) took the biggest hit during the recent recession, but is currently making a comeback.

Later in the article, they show the same graph split into 2 categories - auto finance companies, and banks & credit unions. The auto finance companies tend to cater towards the subprime lending more than the banks & credit unions. Rather than scaling them both to the same axis of the first graph ($30 billion), I let each of these auto-scale to show the spread of the data in my SAS version.

auto_loan_originations1

auto_loan_originations2

And, I guess in answer to the original question, it appears that banks do loan money to people who need it (ie, people who have low credit scores) - close to $6 billion this year. But they loan a lot more money to people with higher credit scores.

Anybody got any inside-insight into this data, or ideas about other ways to graph this data? - Feel free to share it in a comment!

 

Post a Comment

Help! Why does the WHERE clause choke on the INPUT function?

A student brought in this coding problem after her manager was struggling with this issue for a while. They played guessing games, but to no avail. Here’s what happened when they submitted data step and proc sql code using a WHERE clause with an INPUT function?

 data aileen;
length hcn $10.;
input prov $ hcn $;
datalines;
BC 9999999698 
AB 612345800 
99 1 
CA V79999915 
QC NIGS999996 
ON 0 
ON 9876543210 
;
run;
 
 
NOTE: The data set WORK.AILEEN has 7 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.07 seconds
      cpu time            0.09 seconds
 
 
data dswarn;
set aileen;
where input(hcn,10.)>=1000000000; 
run;
 
WARNING: INPUT function reported 'WARNING: Illegal first argument to function' while processing
         WHERE clause.
WARNING: INPUT function reported 'WARNING: Illegal first argument to function' while processing
         WHERE clause.
NOTE: There were 2 observations read from the data set WORK.AILEEN.
      WHERE INPUT(hcn, 10.)>=1000000000;
NOTE: The data set WORK.DSWARN has 2 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds
 
 
proc sql;
create table sqlwarn as
select * from aileen
where input(hcn,10.)>=1000000000;
quit;
 
WARNING: INPUT function reported 'WARNING: Illegal first argument to function' while processing
         WHERE clause.
WARNING: INPUT function reported 'WARNING: Illegal first argument to function' while processing
         WHERE clause.
NOTE: Table WORK.SQLWARN created, with 2 rows and 2 columns.
 
172  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

Their take:

The data step and sql procedure both generated this warning twice. (I think it is the 2 HCNs with leading characters that generated the warnings)

The Solution:

They are on the right track. The WHERE clause errors out when it finds an invalid value being passed through the INPUT function. Two records have leading character data and are being passed into the INPUT function to get converted to a numeric value. The INPUT function fails to convert the leading character data to numeric & hence the error. I fixed it by using the “?” format modifier on the INPUT or PUT function. This time SAS doesn’t articulate the choke on the INPUT function resulting in a clean log. The output dataset results are still the same. The 2 leading character observations will be stored as a missing value. The big takeaway? Using the “?” format modifier just ensures a clean log.

data warn;
set aileen;
where input(hcn,?10.)>=1000000000; 
run;
 
 
NOTE: There were 2 observations read from the data set WORK.AILEEN.
      WHERE INPUT(hcn, 10., '?')>=1000000000;
NOTE: The data set WORK.DSWARN has 2 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.07 seconds
      cpu time            0.06 seconds
 
 
proc sql;
create table sqlwarn as
select * from aileen
where input(hcn,?10.)>=1000000000;
quit;
 
NOTE: Table WORK.SQLWARN created, with 2 rows and 2 columns.
 
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Resources:

I’d love to claim I came up with the solution. However, it was the encyclopedic support.sas.com to the rescue. Here’s where I learned about this error and how to fix it.

Isn’t that pretty amazing? I’m always pleasantly surprised by how much knowledge is available on support.sas.com and it’s all free!!

Post a Comment

Just say no (not only) to OLS

Zubin Dowlaty

Zubin Dowlaty

This guest post was written by Zubin Dowlaty. He has 20+ years’ experience in the business intelligence and analytics space. At Mu Sigma, he works closely with Fortune 500 companies counseling them on how to institutionalize data-driven decision-making. Zubin is focusing his efforts managing an agenda of rapidly implementing innovative analytics technology and statistical techniques into the Mu Sigma ecosystem. 

There is an old adage that goes, “don’t put all your eggs in one basket” and for those that like financial advice, the only free lunch is diversification.  The spirit of these ideas emphasizes risk reduction with a behavior change. The idea of minimizing risk to interpret and generalize analytical models is not new. Reducing the risk of over-fitting and methods such as bootstrapping are utilized to reduce risk. However, the idea of diversification tends to be underutilized in analytics workflow.

In the big data space, we are witnessing a trend towards NoSQL technologies, where we have multiple tools and frameworks to access data at our disposal. For the data analyst, prepping data for modeling consumes a tremendous amount of time. Anecdotal estimates are usually between 60%-80% of the data scientist’s time allocated to preparing the ‘model' ready data set.

Why then, when we complete the data prep, most analysts will estimate OLS regression models and stop? If that’s not the case, then only one modeling technique will be selected. The analyst will then interpret, refine the model and present results. At least in corporate America there is a clear bias towards running only one technique. This one model bias clearly goes against the spirit of minimizing risk by utilizing a portfolio approach. Ensembles, the technical term for running multiple models, should be the default method not the former.

Design thinking and design principles are beginning to be taught in major graduate business schools. One of the major principles of design thinking is prototyping and ideation. Furthermore, the ensemble approach leads to the champion model of a natural ‘po’ concept, which translates to a provocation. Design concepts are also in alignment with the ensemble approach, especially for measurement and forecasting use cases.  One should explore and provoke the ‘champion’ model.

Let’s say you have selected an OLS model to be your champion. One should challenge this model by running a portfolio, or ensemble of methods to improve insight and generalization. Given the various assumptions around robustness, functional form, structural change within our data, it can be very risky to estimate one model. The idea is not to run various models like in the traditional ensemble sense, but rather to aggregate the various models in some form to get a better predictor. Again, it would be ideal to run an ensemble, leveraging the advantages of robustness, variable importance, and functional form in other techniques, in order to improve your dominant champion model, not replace it.

With today’s computational resources, the marginal time and cost of running many models are near zero – there is no longer any excuse. The bottom line, it’s a mindset change.

Join me in a discussion of these ideas at the Analytics 2014 conference.  We will review over a dozen models used in a business use case, in order to harden the champion model.  We will demonstrate how the ensemble approach to improving your champion model can significantly improve interpretability as well as trust in your model outcomes.

Post a Comment

IMDb ratings for "The Big Bang Theory"

In my previous blog, we visualized how many people viewed each episode of The Big Bang Theory TV series. Now let's analyze how well people liked each episode...

As a starting point, I looked around to see if anyone else already had a plot of the episode ratings. I found a plot, but it didn't give me as much information as I wanted. So I downloaded the IMDb data, imported it into SAS, and started working on a new/improved version of the plot.

I decided to create 2 plots - one with the axis going from 0-10 (to see the "big picture"), and then a second plot 'zoomed-in' to better see the more subtle changes in ratings. Below are snapshots of my two graphs - click them to see the full-size interactive versions. My interactive graphs have html hover-text and drill down for each marker - I think this adds a huge amount of value to the graph, over the original graph!

big_bang_theory_ratings

big_bang_theory_ratings1

So although the number of viewers has gone up by several million people (as shown in the graph in my previous blog), the ratings have generally been going down slightly. What's your theory on the reason for that, and what do you think about the few outlier episodes that were higher/lower than the rest?

Post a Comment