More on analysis of means and model building

I had the privilege of participating in JMP’s Analytically Speaking series a couple of weeks ago (June 8, 2016). While I was able to answer many questions submitted during the live broadcast, there were additional questions that are answered in this blog post. In addition, look for future blog posts with more details on analysis of means (ANOM).

Anne Milley and I prepare for the live webcast.

Anne Milley and I prepare for the live webcast.

Q. Your book on ANOM is mainly based on SAS; do you intend to provide something similar based on JMP?

A. Our book, The Analysis of Means: A Graphical Method for Comparing Means, Rates, and Proportions was written prior to the implementation of ANOM in JMP. It does include an appendix with some SAS code, but otherwise the material is not software-dependent. All of the examples could now easily be reproduced in JMP (the figures would just look like JMP figures rather than SAS figures). We have no plans for an updated volume.

Q. Where do I find the Analysis of Means menu in JMP?

 A. Beginning with JMP 9, the ANOM menu is found in the Fit Y by X platform. When X is categorical and Y is continuous, there is an Analysis of Means menu under the red triangle. When both X and Y are categorical, there is an Analysis of Means for Proportions option under the red triangle.

Q. What hardware do you use for your analysis? Specifically, clustering big data sets?

A. I work on a MacBook Pro with 16 GB of RAM. I tend to work with smaller rather than bigger data sets, so I rarely have an issue with computing speed. If I run a simulation that is computationally intensive, then I might let it run while I take a break.

ANOM_chartQ. How can analysis of means or JMP be used for analysis of financials in support of audit works?

A. The analysis of means can be used in many different instances. The key is that you have a number of groups and you want to compare the group means, rates or proportions to the overall mean, rate or proportion. In addition, there are ANOM-type procedures to compare the variability across groups (i.e., to test for homogeneity of variance). In your case, perhaps you would use ANOM if you are auditing multiple departments or similar departments across a large company and you are interested in knowing if any one department had a different mean value of some measure or a different rate of some item you are auditing.

Q. How do JMP and model building fit into tools that enable machine learning?

A. Machine learning is one type of model building. JMP has many different modeling capabilities, and JMP Pro has even more. Recently, I have found the Generalized Regression platform (JMP Pro) to be extremely powerful in my model building. I work in areas where it is important not only to build a model for prediction, but also to be able to interpret the factors and their contribution to the model.

If you’re interested in learning more about some of these topics, you can see my full conversation with Anne Milley.

Post a Comment

Empowering AP Statistics teachers: Free JMP workshops this summer

For the fifth consecutive year, JMP is sponsoring AP Statistics teacher workshops during the summer break. These workshops are designed for those who want to employ data analysis software in their AP Stat course and who have taught the course at least twice.

Two sessions are offered this summer, at SAS offices in Irvine, CA, July 11-12 and in Arlington, VA, August 8-9. Limited seats are still available. The sessions are paid for by SAS and complimentary to qualified teachers on a first-come, first-served basis.

Sessions are led by AP Statistics leaders and textbook authors Daren Starnes and Chris Olsen, with instructional support by our very own Mia Stephens. The workshop will highlight the benefits of using software and strategies for teaching modern data-driven pedagogy. Attendees will get hands-on training with JMP Student Edition, along with lesson plans, course materials and a personal copy of JMP Student Edition software for Windows or Mac.

Past attendees of this workshop have found the practical, hands-on approach especially helpful in transitioning from calculators to computers. And if you know of a local AP Stat teacher who might be interested, please share this post with him or her.

Want more information? You can get all the details and request to attend via the workshop page. If you have any questions, please contact Mia Stephens at mia.stephens@jmp.com.

Post a Comment

What does a winning thoroughbred horse look like?

In a previous post, I wrote how pedigree might be used to help predict outcomes of horse races. In particular, I discussed a metric called the Dosage Index (DI), which appeared to be a leading indicator of success (at least historically). In this post, I want to introduce the Center of Distribution (CD) as a metric that can help us predict a horse’s potential regarding speed and distance.

Specifically, with the Belmont Stakes set to be run on Saturday, I want to combine DI and CD to analyze what some horse racing fans refer to as the concept of Dual Qualifiers. Historically speaking, there is a general rule of thumb that if a horse has a DI below 4.0 and a CD below 1.0, then it has an advantage; the horse qualifies as a favorite relative to both metrics, making it a Dual Qualifier. However, I wonder if we might use analytics to improve on this historical rule of thumb.

Explanation of Key Metrics

“Dosage,” as I explained in my earlier post, is a system that attempts to explain the horse’s potential that might be due its pedigree. It is described in terms of the Dosage Profile, Dosage Index and Center of Distribution. In this discussion, I am most interested in the following terms:

Dosage Index (DI). A horse with a high DI has been bred for speed, and since this is racing, that seems important! But, these high-stakes races are relatively long, and the Belmont Stakes is the longest of the Triple Crown races. Owners, trainers and fans must also consider the stamina necessary to run 12 furlongs (or 1.5 miles).

Center of Distribution (CD). The CD ranges between -2 and +2 and represents the balance of speed and stamina. Thoroughbred racehorses are bred with distance in mind, and CD points to that distance.

We can use the historical rule of thumb (regarding DI and CD) to quantify a horse’s ability to “run fast,” and also to have enough stamina to “run long.” But can we improve it?

I looked at Belmont Stakes races from 2005 to 2015 and analyzed these metrics, seeking to develop a profile that would enable us to understand with greater clarity what a winning horse looks like.

Developing the Profile

Using the Clustering platform in JMP, I analyzed DI and CD using the default settings regarding the K-Means method. Three clusters were returned with the following parameters:

K Means Cluster

With the new clusters identified, I plotted all the data as a Scatterplot Matrix (an option within the K-Means analysis platform).

scatter1

Next, I wondered how horses performed relative to their clusters. I used a Local Data Filter to isolate those horses finishing “in the money” and found something interesting: 21 of 34 horses were assigned to Cluster 2 (which is appropriately colored green).

scatter2

Then, I took the analysis a step further. To which clusters were the race winners assigned? Another Local Data Filter revealed that in the previous 11 races, nine of the winners were assigned to Cluster 2.

scatter3

My next question was: Relative to DI and CD, what are the parameters such that I can determine the true profile of a Cluster 2 horse? The  Distribution platform in JMP provided extensive insight.

Distribution

Distribution2

The Distribution analysis revealed the parameters for this cluster range from 1.8 to 3.62 for DI and from 0.5 to .92 for CD. Summary statistics offered greater precision for each metric.

Conclusion

Developing a profile of horses is a fun exercise, but I have found similar approaches to be of the great value regarding human behaviors, whether for marketing, risk, healthcare, education, crime or athletics. The quantification of your subject of interest is a first step in my preferred approach to analytics, and yields the opportunity for truly targeted modeling, and thus better results.

The Clustering platform in JMP allows you to save the formula used to create your clusters, so you can readily apply it to the wider population or your new customers. In the case of the Belmont Stakes, I used my research on the previous 11 races to determine my set of Cluster 2 horses for the 2016 race.

CaptureFormula

So thanks to this analysis, I know what my Belmont Stakes horses look like for 2016. Do you know what your customers, patients and students look like?

Post a Comment

Complicated stuff in simple words from Randall Munroe

In a world filled with jargon, it’s refreshing to hear from a subject-matter expert who can communicate in a direct and uncomplicated fashion – so that even a layperson would understand.

You could say this is Randall Munroe’s mission. Munroe is masterful at using math, science and comics to make a point. His website, xkcd, showcases stick figure comics with themes in computer science, technology, mathematics, science, philosophy, language, pop culture and romance.

And in his latest book, Thing Explainer, he uses the 1,000 (or, rather, ten hundred) most common words in the English language to explain concepts like how smartphones work, the periodic table and nuclear reactors. As the book’s subtitle suggests, it's about complicated stuff in simple words.

In fact, in the middle of writing this blog post, I ordered a copy on Amazon for my kids. They are preschoolers who always ask questions that I think I should know the answer to, but don’t. This book might be my go-to for the next 10 or 15 years!

Bill Gates calls it “a brilliant concept” because “if you can’t explain something simply, you don’t really understand it.” Houghton Mifflin recently announced that a selection of his drawings and explanations will be included in new editions of its high school-level chemistry, biology and physics textbooks.

Munroe is a former NASA roboticist who, on a typical day, puzzles over absurd hypothetical questions about science, many of which come in from fans of his blog What If? For example, he spent several days trying to answer this question: “If all digital data in the world were stored on punch cards, how big would Google's data warehouse be?” If you’ve ever watched his TED talk, you know how that one ended.

How does he get to an answer?

Much like a statistician or data analyst, he uses what he knows to model for things that he doesn’t know. “I love calculating these kinds of things, and it's not that I love doing the math,” Munroe says in his TED talk. “I do a lot of math, but I don't really like math for its own sake. What I love is that it lets you take some things that you know, and just by moving symbols around on a piece of paper, find out something that you didn't know that's very surprising. And I have a lot of stupid questions, and I love that math gives the power to answer them sometimes.”

Munroe will join us as the closing keynote speaker for Discovery Summit, Sept. 19-23 at SAS. You won’t want to miss it!

Post a Comment

“Bolder” statistics with Karen Copeland

Karen Copeland, Ph.D owner and sole employee of Boulder Statistics, a statistical consultancy

Karen Copeland, Ph.D, owner and sole employee of Boulder Statistics, a statistical consultancy, will be the guest on Analytically Speaking on June 8.

Dr. Karen Copeland will be our featured guest on Analytically Speaking on June 8. She is the owner of Boulder Statistics, a successful consultancy to a wide array of industry sectors around the world — medical device, diagnostics, chemicals, marketing, environmental, consumer and food products, pharmaceuticals, and web analytics, among them. When Karen named her company, she may not have intended it to be a play on words, but I think it’s fitting. She has made some bold steps in her career.

She works with scientists and engineers and enjoys the diverse projects she is given as well as the challenge of learning new methods to be an effective problem-solver.  With more than 20 years of applying statistical methods and working in academia and industry before starting her consultancy, she has some interesting stories and experiences to share. In addition, she co-authored the books, The Analysis of Means: A Graphical Method for Comparing Means, Rates, and Proportions and Introductory Statistics for Engineering Experimentationin addition to a number of journal articles.

You may recognize Karen from her popular posts to the JMP Blog over the last few years. Check out her most recent posts on model visualization:

Karen has a great deal of technical expertise — on such topics as analysis of means, experimental design and data visualization — but she also has other important skills contributing to her success, like effective communication and the ability to see the big picture to know which questions to ask and identify the best path forward. We'll cover those subjects in our live interview.

We hope you will join us June 8. If you can’t join the live webcast, you can always catch the archive which is usually available by the following day.

Post a Comment

Remaking a mosquito trends chart

Recreating graphs is a hobby of mine. It both helps me test the limits of JMP and sharpens my own data handling and visualization skills. This time, there was a third benefit: finding a significant data error in the published chart.

I recently saw this interesting mosquito trends chart as part of an article, “When the mosquitoes will be biting in your state,” on the Washington Post’s Wonk Blog. Its shows Google search trends for the word “mosquito” by state, with each state on a different scale.

wpmosquito1

 

It’s not a typical analytical graph, but I thought the layout would be a good test of Graph Builder’s small multiples grouping, and I was intrigued by the overall lack of geographic pattern. For instance, the article mentioned that Tennessee is so unlike its neighbors, and the same can be said for other states. The main point of using a map is show geographic patterns, but the connections are pretty weak here.

Getting the Data

The article nicely includes a link to the Google Trends page for mosquito trends for the whole United States, and that page nicely has a Download as CSV menu item.

However, that’s where the niceness ends for getting the data. Each state trend is on a separate page, there’s not a separate URL for the download, and the CSV file is not really a pure CSV file. At least the state URLs followed a pattern, so I wrote a script to open each state in a separate tab in a web browser. Then I had to manually click the Download as CSV menu item for each tab. Each “CSV”  followed a regular pattern which included some descriptive text and multiple embedded tables. An example snippet:

Web Search interest: Mosquito
Wisconsin (United States) 2014-2015

Interest over time
Week,Mosquito
2014-01-05 - 2014-01-11,4
2014-01-12 - 2014-01-18,3
2014-01-19 - 2014-01-25,4
...
2015-12-27 - 2016-01-02,4

Top metros for Mosquito
Metro,Mosquito
Duluth MN-Superior WI,100
Wausau-Rhinelander WI,85

Fortunately, JMP’s text import wizard let me tell it how many lines to skip and how many to read. After doing it once in the wizard, I looked at the generated script and was able to put it in a loop to read the other files the same way:
Open( "report (" || Char(i) || ").csv",
   Columns( Column( "Week", Character), Column( "Mosquito") ),
   Import Settings( Column Names Start( 5 ),
      Data Starts( 6 ), Lines To Read( 104 ) )
);

After doing all that and splitting the first field into separate start and end dates, I could concatenate all the table into one big table containing all the state data, which is two years of weekly data.

First Look at the Data

We can see all the data fairly well in this small multiples line chart overlaid by year.

mosquitolines

We don’t have the geographic arrangement, but we can already see enough to compare against the original, and two features stood out to me:

  1. Sometimes the years were quite different within a state (especially Arizona and Hawaii).
  2. Tennessee does not have the spikiness of the original.

The first observation suggests we probably need more years of data to establish what the article calls a “typical” year, and I haven’t pursued that, yet. The second item was more of a mystery. I initially thought it was an artifact of the binning since the original chart is showing data by month instead of by week, but that didn’t hold up as I looked at the data closer. I double-checked my graph against the Google Trends page for Tennessee, and they agreed. I continued on, hoping to discover the source of the discrepancy.

Arranging the States

The above chart uses the Group Wrap role in Graph Builder to lay out the states in a grid alphabetically. For complete control, I assigned each state a row and column value and used those values in the Group Y and Group X roles .

xgmosq

In an effort to approximate the original chart, I switched from the overlaid lines to smoothed area charts. (I didn’t do bars because I still haven’t tried converting weeks to months – or maybe there's a way to get monthly data from Google Trends.) It was enough to notice some of my states looked like other states in the original. The original Tennessee and Ohio look a lot like my South Dakota and North Dakota; the original Pennsylvania looks a lot like my Oregon. Luckily, a pattern occurred to me. The original chart’s states are off by one alphabetically!

Almost, anyway. After further study, I realized only the states after District of Columbia were off by one. Our charts agreed on the other states. Coincidentally, I also had a similar error in my data where I had downloaded the DC data twice, and my initial charts were off by one before DC. Weird. After some Twitter messages, the author confirmed my findings and quickly updated the Wonk Blog post graph and commentary.

Scaling the Data

In the data from Google Trends, each state’s interest levels are scaled so the maximum value is 100. That means the magnitudes are not comparable from state to state as you would expect with a small multiples chart. Just the patterns (e.g., spikiness) can be compared in this case.

How could I convert the state data to be on the same scale? The Google Trends page for the whole US includes a list of summary levels for each state. If those summaries represent each state’s average value, we can make the adjustment with a scale factor. Here’s the result with all the states on a common scale.

xgmosqabs

Notice I also used a different state layout of my own design, trying to give my home state of North Carolina a truer positioning. Within the constraints of equal-sized rectangles, it's impossible to preserve all the geographic properties. The layout in the original chart is from Jon Schwabish at PolicyViz and has the nice property that the overall shape resembles the overall shape of the country.

Given the thinness of many of the curves, I added a background color based on the state's average value, which besides allowing a big picture view of the yearly pattern, helps anchor the areas in their cells and make the labeling clear.

xgmosqheat

Insights Gained

Though the Wonk Blog article makes it clear that the Google Trends data may not be representative of any real-world trends, by redoing the analysis I did gain a few insights I couldn't get from the original article and graph:

  • Some states have significant year-to-year variation.
  • Monthly versus weekly aggregation may be an issue.
  • The state data was normalized by the maximum value.
  • The Dakota spikes are really high (if my re-scaling is right)

I realize it’s not always practical to do our own remakes of graphs and analyses, but it’s a great way to really understand the nuances of the data.

Post a Comment

AMA Advanced Research Techniques (ART) early registration ends Thursday!

The co-chairs of last year's AMA ART Forum

The co-chairs of last year's AMA ART Forum kick off the last day of the 2015 conference. Join us this year in Boston!

I’ve posted here in the JMP Blog about the American Marketing Association’s Advanced Research Techniques (ART) Forum and the impressive work that’s presented there every year. As co-chair, I am doubly excited for this year’s conference, which will take place June 26 – 29 in Boston, MA.

We had an amazing group of paper submissions that we used to build the program schedule. Take a look. You'll see sessions on such topics as Social Networking, Market Segmentation, and analyzing Large and Unstructured data. I’ll be chairing one of two sessions on Choice Modeling, and my conference co-chair, Rex Du, will be chairing a session on Marketing Metrics. We’ll also have a full slate of tutorials to choose from the Sunday before the official conference begins. Among the tutorials are Introduction to Discrete Choice Experiments, Probability Models for Customer-Base Analysis, and Leveraging Online Search Trends in Marketing Research.

One of the things I love about ART Forum is the interaction between academic researchers and marketing research professionals. The conference sessions and format are designed to encourage discussion. In every session, academics present alongside experts who are working in industry. Post-presentation discussions from market leaders show how to work through problems that might happen when implementing a new technique, as well as analyze the gains from doing so. This year, our presenters have made a special effort to make sure white papers and how-to guides are part of the conference take-home materials.

Between the tutorials for basic training and the cutting-edge research techniques on display at the main conference, ART Forum is a great training opportunity for people who are new to marketing research as well as for experienced professionals.

Early bird registration ends this Thursday, so register soon. I hope to see you in Boston next month!

Post a Comment

When should programming come into play in statistics courses?

blurrycode

Does exposure to coding in statistics courses dampen students' enthusiasm — for both programming and statistics?

Both academically and professionally, more courses are being offered and developed to make more people comfortable with data, analysis and risk assessment. This necessitates some use of statistics, and software is pretty much a tool of the trade. Software — some new, some enhanced, some commercial and some open source — is increasingly available to broader audiences and is ever-changing.

For the quantitative courses I took in college, I had to learn some coding languages to use SAS, SPSS and SHAZAM. I was not a fan of learning JCL and other programming languages initially and found learning the syntax of the languages an impediment to understanding statistical concepts.

On the positive side, even my limited coding skills later proved useful for my career, but many of my classmates’ exposure to coding dampened their enthusiasm — for both programming and statistics. Once I was exposed to the highly visual and interactive experience that JMP provides in data exploration and analysis, I wondered whether I would have understood statistical concepts more quickly and whether fellow classmates would have had greater enthusiasm for statistics had we used JMP.

More intro stats courses are being offered as MOOCs. Many universities are evolving their curricula to include business analytics and other courses to appeal more broadly to engage more people in statistical thinking. Professionally, more basic data analysis courses are being offered as well. In light of all this, it’s interesting to see which software is used: spreadsheets, interactive visual software like JMP, some SAS interfaces, interfaces to R, Minitab, etc., as well as language-based approaches like R, SAS, Python and others.

What factors affect which software is used in courses?

Screen Shot 2016-05-19 at 2.03.36 PM

I wonder if I would have understood statistical concepts more quickly if I had had access to JMP in college.

Having written a blog post about teaching statistics with JMP and continuing to engage with academics on how they teach statistical concepts, I’m curious about the motivating factors in choosing software for use by students with such varied levels of numeracy. Often, cost is the driving factor. Open source software is freely available. Excel is so ubiquitous that it is essentially perceived as free (but many recognize the limitations of spreadsheets).

Another motivating factor of some intro-level courses may be to leave the students with more marketable skills, and knowing a popular programming language is certainly such a skill (in addition to knowing about data analysis, of course).

Yet another consideration could be that the software is already there, what’s been there and what the instructor already knows.

Teaching how to think statistically

But beyond these factors, many instructors truly want to engage more students to see and feel the power of data, to experience what it is to “think statistically.” They recognize that many people will appreciate and benefit from understanding statistical concepts, but may never go on to learn any programming languages. They may be capable of statistical thinking without knowing how to program. Obvious examples would be doctors and judges, whose recommendations and decisions can powerfully affect people's lives.

I recently finished reading Risk Savvy: How to Make Good Decisions by Gerd Gigerenzer. For many important decisions regarding our health, finances and more, he shares well-founded research in how we can better assess risk to make better decisions. For example, he has done a lot of work with doctors to better communicate probabilities to their patients (in short, he advises translating probabilities into natural frequencies). For more along these lines, David Spiegelhalter, who has done a great deal to educate the masses about understanding uncertainty and the many things to consider in presenting risk to decision-makers, has written a great blog post with interactive graphics on 2845 ways to spin the Risk.

Understanding risk is part of thinking statistically, an important skill in this data-rich era. For attracting the broadest audience and to give more people a foundational understanding of important statistical concepts, there is considerable evidence that interactive data visualization plays an important role. Through dynamic and interactive graphs, learning becomes play.

Observations from statistics professors

Many professors/instructors offer compelling reasons for taking a visual path (and choose JMP) as a means to introduce more people to statistical thinking. For example, here a few excerpts from an interview last year with Christian Hildebrand, Assistant Professor of Marketing Analytics at the Geneva School of Economics and Management:

  • “[Students] said ‘Wow, I never knew that statistics could even be fun!’ That’s when I realized that the statistical software is not just a medium, it is an environment that can actually help in understanding statistical concepts better.  JMP was a big amplifier for that."
  • "With the software focusing so heavily on visualization, it’s much easier for you to really understand what is the issue in the data. It's critical for students to understand their data better by interacting with the data in a software environment like JMP. "
  • "What students really loved about the software was that they had a very intuitive way of learning. This intuition is very important because statistics is very much cognitive, and you have to learn the basics. At the same time, it is very important to still be creative and to think about new hypotheses, and very often you learn that out of the data. The capabilities you have with JMP — with the rich visualization capabilities — those are key to understand statistical concepts better.”

Peter Goos, Full Professor at the University of Antwerp in the Department of Environment, Technology and Management, and David Meintrup, Professor of Mathematics and Statistics at the Ingolstadt University of Applied Sciences co-authored, Statistics with JMP: Graphs, Descriptive Statistics and Probability. In their preface, they say:

"We chose JMP as supporting software because it is powerful yet easy to use…. We believe that introductory courses in statistics and probability should use such software so that the enthusiasm of students is not nipped in the bud. Indeed, we find that, because of the way students can easily interact with JMP, it can actually spark enthusiasm for statistics and probability in class."

David Meintrup also recently shared this story: "I always end the first session on JMP with Graph Builder. The first time my students see how to interactively create a map of the unemployment rate in Europe over the years 2000-2015, they are blown away. I can see how their facial expression changes, and from that point on I don't need to worry about motivation anymore."

Iddo Gal, Senior Lecturer and past Chair, Department of Human Services at the University of Haifa, and past President of the International Association of Statistical Education:

"In 2015, I attended the JMP workshop (three hours) in our IASE Satellite in Rio, and remember being particularly impressed with these tools, which far exceed options in other packages, and for me can help our participants see what is unique about it and also does not require strong formal/procedural skills. I also recall how the local (Brazilian) statisticians were taken by surprise — they said they work so hard to impart the technical [formulaic, statistical] underpinnings of multivariate stuff and running traditional analyses, and their students struggle with traditional outputs — yet within 15 minutes into the visualization portion of the JMP workshop, all of a sudden, they realized how their students can view things so much easier and understand and see what is coming out.”

Earlier this year in an interview with Jason Brinkley, biostatistician and senior research methodologist at American Institutes for Research, he discussed some of his experiences teaching with JMP from his 2014 Discovery Summit paper, Using JMP as a Catalyst for Teaching Data-Driven Decision Making to High School Students. Though the course targeted high school students who were gifted in math and science, Jason explained that this hands-on approach was well received, especially by the students who had not yet taken Advanced Placement Statistics. They could see and feel the power of data, and this piqued their interest. Jason said, “You could see the passion start to come up from the students, not necessarily about the research but about the data.”

What about you?

For those of you in the noble profession of teaching, how do you teach statistical concepts to a broad audience? Is some level of programming involved from the beginning, do you take a more visual approach, or do you give the students options to choose the tools they use?

For those of you who were/are students, how were you introduced to statistics? Did you have to learn a programming language first or did you learn via an interactive tool like JMP? If the former, do you think you would’ve understood the concepts more quickly if you’d had a more visual introduction? If the latter, did you later invest in learning a language (perhaps JSL?) anyway because it helped you do more with your data?

Thanks for your interest and I look forward to hearing from you!

Post a Comment

Does the pedigree of a thoroughbred racehorse still matter?

You may have heard that a horse named Nyquist won the Kentucky Derby recently. Nyquist was the favorite going into the race, though he was not without his doubters. Many expert race prognosticators questioned his stamina, and I was curious about the basis for those comments.

My due diligence revealed that breeders (and race handicappers) have a language of their own when speaking of these great thoroughbred horses. For example, in evaluating a horse’s potential, they speak of “dosage,” which at first glance appears to imply the use of performance-enhancing drugs. However, this is a term used (at least) since the earliest part of the 20th century (in France and England), and it refers to the horse’s pedigree.

Here is a brief explanation of commonly used terms:

Dosage. A system designed to predict the distance potential (stamina) of horses based on the sires in the first four generations of their pedigrees. Categories range from most speed/least stamina to least speed/most stamina, and points are assigned over this range to form the Dosage Profile.

Two key statistics can be generated from the Dosage Profile:

The Dosage Index (DI). DI is the ratio of points in the “speed wing” to points in the “stamina wing.” The average DI for race horses in North America is 2.4. Nyquist’s DI is 7, indicating that he has 7 times as much speed as stamina in his pedigree. Since 1940, there has only been one other horse (Strike the Gold, 1991) with a DI higher than 7!

Center of Distribution (CD). CD provides a point of reference relative to the horse’s pedigree with a metric ranging from +2.00 to -2.00. A positive CD indicates the horse has speed (in the pedigree), and a negative CD indicates the horse has displayed more stamina. The average racehorse in North America has a CD of 0.70. Nyquist’s CD is 1, indicating that he should have more speed than the average horse.

The fundamental theory behind dosage is that the higher the DI and CD, the lower the distance potential of the horse; conversely, the lower the DI and CD, the higher the speed potential of the horse.

The rule of thumb has been that a horse with a DI higher than 4.0 and a CD higher than 1.25 will not perform well and would not win the Triple Crown races like the Kentucky Derby, the Preakness or the Belmont Stakes. For 50 years, this rule of thumb appeared to be valid, as there were only two race winners with a DI of at least 4 from 1940 to 1990, but something changed the game in 1991.

DerbyDosage1940

In 1991, a horse named Strike the Gold kicked down the barn door of pedigree metrics and ushered in a new era in high-stakes horse racing as he won the Kentucky Derby with a DI of 9. I sought to analyze this change using control charts from JMP’s Quality and Process platform.

Control charts use the data to differentiate what is typical from what should be considered special. In creating my control chart, I realized that 1991 wasn’t the first time that such a change had occurred since 1940; it’s just the first one that emphasized that something was rendering metrics like DI and CD outdated.

In the chart below, note the change in the average values of DI per each time period. What’s causing these changes? New training methods? Something else?

DerbyDIChange

On to the Preakness!

Given what we have seen in Kentucky Derby winners since 1940, can we throw out the historical data in this post-1990 world of racing? Does the DI rule of thumb still hold true? And, does pedigree still matter in the world of horse racing?

Here’s a graph depicting more than 30,000 horse races since 1980, and it indicates that the rule of thumb still holds true: The longer the race, the lower the DI.

DerbyDistanceDI

The DI metric has seen some interesting activity in the Preakness as well. But since 1990, it’s actually been trending lower. And since 1940, the average DI for the Preakness (a 10-furlong race) winner is 2.82, and that’s a number falling within historical expectations (according to the chart above).

DerbyPreaknessDist

Conclusions

For more than 50 years, racehorses in the Kentucky Derby behaved as expected relative to their pedigree-related metrics. At least in the Kentucky Derby, something has apparently changed, and horses that previously would not have even been considered contenders have been winning! Might it be better training methods enabling speed horses to be better trained to run the distance? Might it be drugs? What do you think?

Regardless, the numbers indicated that Nyquist would not win the nine-furlong Kentucky Derby. On Saturday, he’ll run in the Preakness, which is a 10-furlong race. The numbers again say “No,” but this horse has a calmness about him that’s rare. We'll have to watch the race and see how it turns out!

Post a Comment

Graph Builder tutorial materials

I'll be leading a pre-conference tutorial on Graph Builder at this year's Discovery Summit conference in Cary. We'll start with the basics and then walk through more advanced ways to create effective visualizations.

We did a similar tutorial this spring in Amsterdam, and materials from that course are posted in the JMP User Community. The materials you will find there include pictures and source materials (data tables and scripts) for recreating 100 different graphs -- some simple, some advanced.

Take a look, and if you're interested, sign up for the live tutorial in September.

Post a Comment