Discovery Summit: Best-in-class analytics conference

JMP-Discovery2015_50B4126Four times a year, we host Discovery Summit, where scientists, engineers, statisticians and researchers exchange best practices in data exploration, and learn about new and proven statistical techniques.

Past attendees have called the event a “best-in-class conference to benchmark best practices in analytics” with sessions that are “immediately relevant to daily work.”

Save your seat among them.  The conference is Sept. 19-23 at SAS world headquarters, and there’s no better time to learn from fellow JMP users and to grow your network of analytically minded people.

As always, you’ll have a chance to meet with developers in person, hear real-world case studies from power users and find inspiration in thought leader keynotes.

You can also add training courses or tutorials to your week. Training courses, led by SAS education instructors, combine lectures, software demonstrations, question-and-answer sessions and hands-on computer workshops for an interactive learning experience. Tutorials, led by JMP developers, are a rare opportunity for you to go in-depth on specific topics with the experts themselves.

And here’s the inside scoop: Sign up soon because the first 225 people to register will have the opportunity to attend the opening dinner held at the home of SAS co-founder and JMP chief architect John Sall.

 

Post a Comment

The QbD Column: Split-plot experiments

Split-plot experiments are experiments with hard-to-change factors that are difficult to randomize and can only be applied at the block level. Once the level of a hard-to-change factor is set, we can run experiments with several other factors keeping that level fixed.

To illustrate the idea, we refer in this blog post to an example from a pre-clinical research QbD (Quality by Design) experiment. As mentioned in the first post in this series, QbD is about product, process and clinical understanding. Here, we focus on deriving clinical understanding by applying experimental design methods.

The experiment compared, on animal models, several methods for the treatment of severe chronic skin irritations[1]. Each treatment involved an orally administered antibiotic along with a cream that is applied topically to the affected site. There were two types of antibiotics, and the cream was tested at four different concentrations of the active ingredient and three timing strategies.

The experiment was run using four experimental animals, each of which had eight sites located on their backs from the neck down. Thus, the sites are “blocked” by animal. For each animal, we can randomly decide which sites should be treated with which concentration by timing option. The antibiotics are different. They are taken orally, so each animal could get just one antibiotic, and it would then apply to all the sites on that animal.

The analysis included a number of outcomes, and the most important were those that tracked the size of the irritated area over time, as a fraction of the initial size at that site. The primary CQA (Critical Quality Attribute) summarizing the improvement over time is the area under the curve[2] (AUC), and that is the response that we will analyze in this blog post. The AUC is an overall measure of the rate of healing, with low (high) values when healing is rapid (slow).

For more details on split-plot experiments, a great source is the introductory paper by Jones and Nachtsheim.

Why is split-plot structure important?

In the topical cream treatment study, the animals form experimental blocks. The basic reason for considering blocks in the data analysis is that we expect results from different sites on the same animal to be similar to one another, but different from those for sites on other animals. We take advantage of this property when we compare timing and concentration. Those comparisons are at the “within animal” level, which neutralizes the inter-animal variation and thus improves precision. For the antibiotics, the differences between animals will affect our comparison. The fact that we think of each animal as a block means that we do expect to see such differences. We need to take this into account both in designing the experiment and in analyzing the results.

What are whole plots and sub plots?

We use the term “whole plots” to refer to the block-level units and “sub plots” to refer to the units that are nested within each whole plot. In the example above, the animals serve as whole plots and the sites as sub plots. The terminology goes back to Sir R.A. Fisher, the pioneer of the statistical design of experiments.  Fisher worked at an agricultural research station in Rothamsted, UK, in the early 20th century. Typical experiments at this station involved comparing types of crops, planting times, and schedules of irrigation and fertilization.

Fisher observed that some of these factors could be applied only to large plots of land, whereas others could be applied at a much finer spatial resolution. So “whole plots” to Fisher were the large pieces of land, and “sub plots” were the small pieces that made up a whole plot. Some experiments have more than two such levels. Nowadays, we continue to use these terms, even though split-plotting affects many kinds of experiments, not just field trials in agriculture, and the “whole units” and “sub units” usually are not plots of land. In the QbD context, they consist of animal models, like in the example used here, batches of material or setup of production processes.

When does split-plotting occur?

There are many possible sources of split-plot structure in an experiment. Sometimes, as above, we have “repeat measurements,” but at different conditions, of the same experimental subject. Sometimes the experiment involves several factors that are difficult to set to varying experimental levels, such as a column or bioreactor temperature. In that case, it is common to set the hard-to-change factors to one level and leave them at that level for several consecutive observations, in which the other factors are varied. This leads to a split-plot experiment, with a new whole plot each time the hard-to-change factor(s) are set to new levels.

Sometimes a production process naturally leads to this sort of nesting. For example, consider an experiment to improve production of proteins for a biological drug. The process begins by growing cells in a medium; then the cells are transferred to wells in a plate where they produce protein. An experiment might include some factors that affect the growth phase and others that affect only protein production. Dividing the cells in a flask among several different wells makes it possible to test the production factors at a split-plot level.

How do I design a split-plot experiment?

The Custom Design Tool in JMP makes it easy to create a split-plot design. First, enter the factors in your experiment. The split-plot structure is specified using the column labeled “Changes” in the factor table. The possible entries there are “easy,” “hard” and “very hard,” corresponding to three levels of nesting among the factors. Factors that can be assigned at the individual observation level are declared as “easy” (the default setting). Factors that can be applied only to blocks of observations are declared as “hard.”

There may be a third level consisting of factors that can be applied only to “blocks of blocks,” and these factors are labeled as “very hard.” Figure 1 shows the factor table for our experiment. Antibiotic is the hard-to-change factor because it is applied to the animal, not the individual sites. Timing and concentration are numerical factors, as the company wished to get information for comparing all the levels under consideration, without the need to extrapolate by a regression model. So we decided to declare all the factors as categorical. Note that a concentration of 0 means applying a base cream with no addition of the compound being tested.

Figure 1: Factor definition for the experiment

Figure 1: Factor definition for the experiment

The next step is to specify any special constraints. For example, some factor combinations may be impossible to test, or there might be some inequality constraints that limit the factor space. Then you need to declare which model terms you want to estimate, including main effects and interactions. In our experiment, the company wanted to estimate the main effects of the three factors. They wanted information on the two-factor interactions but did not consider it essential; however, the experiment is large enough to permit us to estimate all these terms. If there were fewer runs, we could indicate the “desired but not crucial” status by clicking on the estimability entry for these terms and choosing the “if possible” option. See Figure 2.

Figure 2: Model definition for the experiment

Figure 2: Model definition for the experiment

We are then asked to specify the number of whole plots, i.e., the number of animals available.  For our study, there were four animals. Finally, we need to specify the number of runs. The Custom Design tool recommends a default sample size, tells us the minimum possible size and allows us to specify a size. In our experiment, it was possible to stage eight sites on each animal, for a total of 32 runs.

Clicking on the “Make Design” button generates the design shown in Table 1. There are four whole plots with each antibiotic assigned to two of them. There are 12 combinations of timing and concentration, but only eight sites on each animal. So it is important to make an efficient choice of which of these treatment combinations will be assigned to each site. Moreover, there is no textbook solution for this allocation problem. This is a setting where the algorithmic approach in JMP is extremely helpful.

Table 1. The 32-run design for three factors in four whole plots of eight runs each. The factor “antibiotic” can only be applied at the “whole plot” level.

Table 1. The 32-run design for three factors in four whole plots of eight runs each. The factor “antibiotic” can only be applied at the “whole plot” level.

The design found by JMP uses the two-hour time scheme for 12 sites and each of the other schemes for 10 sites. Each concentration is used eight times. Each timing by concentration option is used either two or three times. (Note that we would need 36 runs to have equal repetition of the combinations, but the experiment has only 32 sites.) The design automatically includes a Whole Plots column – in our experiment, this tells us which animal is studied, so we changed the name of the column to "Animal."

How does a split-plot experiment affect power?

The power analysis in Table 2 is instructive. We see that the power for the timing and concentration factors is much higher than for the antibiotic. The higher power is because we are able to compare levels of these factors “within animal,” thus removing any variability between animals. For the antibiotics, on the other hand, the comparison is affected by the variation between animals, so that the relevant sample size is actually four (the number of animals) and not 32 (the number of sites).

Table 2. Power analysis for the 32-run design.

Table 2. Power analysis for the 32-run design.

It is important to realize that the power analysis must make some assumptions. These include the size of the factor effects (the Anticipated Coefficients) and the magnitude of the variances.  The entry for RMSE in the table is for the site-to-site variation. There is also an assumption about the size of the “between animal” variation to the “within animal” variation. The default assumption is that they are roughly the same size. If you thought that most of the variation was between animals, the default should be changed to a number greater than 1. To do so, click on the red triangle next to Custom Design, select the Advanced Options link, and then the Split-Plot Variance Ratio link.

How do I analyze a split-plot experiment?

The analysis follows the same general structure as for other designed experiments; see the earlier blog posts in this series. The major difference is that we need to add the factor Animal to the design as a “random effect.” It is this random effect term that tells JMP that the experiment is split-plot. Use the Fit Model link under Analyze. If you do this from the design table, the list of model terms will automatically include the random effect. If you access the data differently, then you will need to add Animal to the list of model effects and to declare it as a random effect by highlighting the term and clicking on the “Attributes” triangle next to the list of model terms. The first option there is “Random Effects.”

What happened in the skin irritation experiment?

We analyze the results on AUC.  Effective treatment combinations will have low values of AUC. Table 3 shows tests assessing whether the factors have significant effects. There is a clear effect associated with concentration (p-value=0.002). The effect for timing has a p-value of 0.076, so there is an indication of an effect, but much weaker than for concentration. The F-statistic for comparing the two antibiotics is larger than the one for timing. However, it has a p-value of 0.078, close to the one for timing. The reason is that the antibiotic comparison is at the “whole plot” level and so has more uncertainty, and much lower power, than the comparisons of timing strategies and concentrations.

Table 3. Effect tests.

Table 3. Effect tests

None of the interactions is strong. So concentration is clearly the dominant factor. Table 4 and Figure 3 summarize the estimated effect of concentration on AUC. There is a clear relationship, with higher concentration leading to lower AUC, hence faster healing.

Table 4. Estimated mean AUC for the 4 concentrations.

Table 4. Estimated mean AUC for the four concentrations

 

Figure 3. Plot of the estimated mean AUC by concentration

Figure 3. Plot of the estimated mean AUC by concentration

The analysis includes a random effect for “Animal,” reflecting the team’s belief that part of the variation is at the “inter-animal” level. Table 5 shows estimates of the “within animal” (residual) and the “between animal” variances.  The Var Component column lists the estimated variance components: 0.0029 at the “within animal” level and 0.0010 at the “between animal” level. The first column gives the between animal variance as a fraction of the “within animal” variance, estimated to be about 0.34 for our experiment.

Table 5. The estimated variance components.

Table 5. The estimated variance components.

What are the take-home messages?

The topical cream study provided valuable information that the cream is more effective at higher concentrations. The use of multiple sites per animal permitted “within animal” comparisons of the concentrations and timing, so that the positive effect of increasing concentration could be discovered with a small number of animals. The “between animal” variation was only about 1/3 as large as the “within animal variation.” This was a surprise, as we had expected that there would be substantial inter-animal variation. Of course, the estimate of inter-animal variation is based on a very small sample, and thus quite variable, so we will still be careful to take account of split-plot structure in future experiments like this one. Consequently, factors that must be administered by animal, rather than by site, will be detectable only if they have very strong effects or if the number of animals is increased.

Coming attractions

The next post in this series will look at the application of QbD to design analytic methods using Fractional Factorial and Definitive Screening Designs.

References

[1] For more information on testing topical dermatological agents, see the FDA “Guidance for Industry” document at http://www.fda.gov/ohrms/dockets/ac/00/backgrd/3661b1c.pdf

[2] In the example, we use data normalized to [0,1] after dividing all by the largest AUC.

[3] Jones, B. and Nachtsheim, C. (2009). Split-Plot Designs: What, Why, and How, Journal of Quality Technology, 41(4), pp. 340-361.

About the Authors

This blog post is brought to you by members of the KPA Group: Ron Kenett, David Steinberg and Benny Yoskovich.

Ron Kenett

Ron Kenett

David Steinberg

David Steinberg

Benny Yoskovich

Benny Yoskovich

Post a Comment

Discovery Summit China focuses on global trends in data analysis

Feng-Bin Sun of Tesla delivers a keynote speech at Discovery Summit China.

Feng-Bin Sun of Tesla delivers a keynote speech at Discovery Summit China.

About 200 experts, analysts, and JMP users and fans from all trades and professions gathered in Shenzhen for Discovery Summit China 2016.

The conference focused on the latest global trends in data analysis and its application.

Attendees came from government, banking, automotive, pharmaceutical, energy, semiconductor, electronic and public service organizations, to name a few. The annual analytics event took place at the Four Seasons Hotel in Shenzhen on April 29.

The day began with three keynote talks, featuring:

  • SAS co-founder and Executive Vice President John Sall on the JMP story. The JMP creator detailed the design and evolution of the software over 27 years, from release 1 to release 12, and noted that JMP 13 will be out in September.
  • Feng-Bin Sun of Tesla on data analysis in high-tech product research and development. He provided an overview of data analysis trends as seen in leading companies and examples featuring product reliability.
  • Author Kaiser Fung on why numbersense is a priceless asset in data science. He likened the data analysis process to running an obstacle course full of trapdoors, dead ends and diversions, explaining why the best analysts have a keen sense of direction as they navigate data.

Experienced JMP users led breakout sessions on the following topics:

  • Statistical modeling in crop research.
  • Continuous improvement in data analysis.
  • Producing China’s first integrated circuit package substrate.
  • Multiple correspondence analysis.
  • Design of experiments in high-tech.
  • Groundbreaking virtual product packaging.

During those presentations and in question-and-answers sessions, summit attendees participated in thorough and lively discussions about global trends in data analysis and its application, as well as best practices in data analysis and data-driven decision making.

John Sall (center) answers questions during an Ask the Experts session.

John Sall (center) answers questions during an Ask the Experts session.

During the Ask the Experts sessions, attendees spoke one-on-one time with JMP developers to learn tips and tricks, see demonstrations of new or unfamiliar features, and offer suggestions for upcoming versions of the software.

Comments from attendees reflected the high quality of the presentations and deep interest in JMP. "This is a great event, I've learned a lot from  presenters," said Wendy Yang from Dow Chemical. Ying Zhang from BUCM said, "I did not know that JMP could be so excellent for statistical education, and I will consider using JMP when writing dissertations.” Chongfa Yang from Hainan University called the conference "one of the best data analysis events I've ever attended."

Those who wanted an opportunity to go in-depth and hands-on with JMP for design of experiments (DOE) and predictive modeling attended pre-conference training at Shenzhen University.

"I have gained a lot from the training course and meetings during the past few days,” said attendee Liangqing Zhu, from Cargill. “I feel more confident to encourage my colleagues to love data analysis."

The day concluded with a Chinese-style feast and entertainment.

The third annual conference in China will take place in Beijing in 2017.

John Sall talks about The Design of JMP in his keynote speech.

John Sall talks about The Design of JMP in his keynote speech.

 

Kaiser Fung explains why "numbersense" is priceless in his keynote talk.

Kaiser Fung explains why "numbersense" is priceless in his keynote talk.

 

Jianfeng Ding talks with attendees during an Ask the Experts session.

Jianfeng Ding talks with attendees during an Ask the Experts session.

 

Kaiser Fung signs copies of his book during the one-day analytics conference

Kaiser Fung signs copies of his book during the one-day analytics conference.

 

Like all Discovery Summit conferences, this one was highly interactive.

Like all Discovery Summit conferences, this one was highly interactive.

 

The evening entertainment included Chinese opera.

The evening entertainment included opera.

Post a Comment

JMP Clinical is coming to PharmaSUG!

Everyone’s favorite mash-up of JMP and SAS software will be at PharmaSUG in the Mile-High City, May 8-11.

Stop by our booth in the exhibition hall to see demos of JMP and JMP Clinical, as well as of JMP Genomics, another JMP and SAS combination. You can be among the first to see the new interface and features of JMP Clinical 6.0!

In addition, attend these PharmaSUG sessions to take a deeper dive into many of the features that JMP Clinical has to offer:

  • Paper AD02: “Efficient safety assessment in clinical trials using the computer-generated AE narratives of JMP Clinical” (May 9, 1:45-2:35 p.m., Location: Centennial G). Learn how JMP Clinical leverages CDISC standards and the Velocity Template Engine to generate truly customizable narratives for patient safety directly from source data sets.
  • Demo Theater: “Assessing data integrity in clinical trials using JMP Clinical” (May 10, 3:30-4:30 p.m.). Fraud is an important subset of topics involving data quality. Unlike other data quality findings in clinical trials that may arise due to carelessness, poor planning, or mechanical failure, fraud is distinguished by the deliberate intention of the perpetrator to mislead others. Despite the availability of statistical and graphical tools that are available to identify unusual data, fraud itself is extremely difficult to diagnose. However, whether or not data abnormalities are the result of misconduct, the early identification, resolution, and documentation of any lapse in data quality is important to protect patients and the integrity of the clinical trial. This presentation will describe examples from the literature and provide numerous practical illustrations using JMP Clinical.

RCZFinally, stop by and chat with Richard Zink at the SAS booth (May 10, 11:00-11:30 a.m.). He’ll be autographing copies of his two SAS Press books:

Don’t yet have a copy of these books? Stop by the SAS Press booth where you can purchase copies at 20% off or you can pick up a free excerpt of either title.

Post a Comment

Kobe Bryant took 30,699 shots, and I've plotted them all using JMP

The Los Angeles Times recently produced a graphic illustrating the 30,699 shots that the recently retired Kobe Bryant took over the span of his 20-year career. It became such a topic of conversation that the Times later offered the graphic for $69.95 (plus shipping). The paper also published a follow-up article that described the process of creating the graphic and revealed that four software packages were utilized: Python, CartoDB, Pandas and Matplotlib.

I decided to commemorate Kobe’s retirement by recreating the graphic using only one software package: JMP, and in particular Graph Builder in JMP. The graph below represents the shots taken over the span of his 20-year career, displayed boldly in Lakers purple and gold!

KobeShotChart

How might data like this be used to the advantage of a team?

I’m convinced I could use this data to help a team win games -- and maybe even an NBA championship -- because what we have here is a serious intelligence-producing tool!

I grew up playing basketball and had the chance to interact with a lot of great coaches. I observed the way they approached “preparation,” and it has since influenced the way I approach analytics. One coach who was particularly influential was eight-time NCAA champion Pat Summitt, who always started by asking a question such as, “What can the opponent do that will cause my team problems?”

The answer to this crucial question then becomes the driving factor in game preparation.

So, as I looked at the shot chart, I thought about what great coaches would do with this information. I believe they would first ask this question: “If I’m playing the Lakers, how could their best player hurt my team?” And, in asking the question they would be driving the preparation and, specifically, how their team would prepare to stop Kobe Bryant.

To better evaluate the process, I used a Local Data Filter to look back to the 2009-2010 season when the Lakers last won the NBA championship. All season long, a “dream” matchup was anticipated between the Boston Celtics and the Lakers, and both teams advanced to the finals.

So, had I been coaching the Celtics, and prior to the beginning of the championship series, it would have been an intense period of getting my hands into the data and repeatedly asking these questions:

  1. How is Kobe Bryant scoring on us?
  2. How can we stop him?

What can we learn from the data?

Using Graph Builder and a Local Data Filter, I looked at the regular season games between the Celtics and Lakers to see if patterns might emerge indicating where Kobe was most effective.

Take a look at the graph below. You can see that Kobe was effective on the left wing (right side of the chart), and made 10 of 15 shots from a particular mid-range zone. He was probably feeling comfortable in that zone, and that had to change!

KobeImage2

The graph below reveals something disturbing: Kobe was penetrating the lane, and while he  missed some shots (indicating that he was contested), a player of this caliber simply must be denied the lane as he would find a way to score.

KobeImage3

So to summarize the strategy, the Celtics must not allow Kobe to gain his “comfort zone” on the left wing, and deny him the lane. Did the Celtics accomplish this? Well, not exactly.

The Celtics had Ray Allen, Paul Pierce, Kevin Garnett and Rajon Rondo. But apparently, they didn’t have Graph Builder to help them derive key intelligence. The graph below indicates that effectively nothing changed in the playoff series versus the regular season: Kobe hit 59 percent of his shots in that championship series in the key areas that have been identified (areas that he had demonstrated success during the regular season), and the Celtics lost to the Lakers 4 games to 2.

KobeImage4

Conclusion

By asking the right questions, and efficiently visualizing the data in Graph Builder, we can uncover insights in the data that might otherwise go undetected. Such an approach is not applicable only to the Lakers and Celtics, but also to any industry. How can something like this help your work or business?

Thanks to my SAS associate Tony Cooper for helping to build the basketball court in JMP Graph Builder! Watch this step-by-step video to see how we did it:

Post a Comment

Teaching modern stats – and assessing reasoning – at scale

If you are an instructor who teaches large-enrollment introductory statistics courses and wishes to teach a modern data-driven course, read on.

You know about the challenges of assessing student mastery in courses where there are hundreds – or even thousands! – of students and little or no support for grading homework or exams. Online tools are available, but often these scalable tools for assessment are limited to multiple-choice formats emphasizing procedural technique and/or hand computation. Instructors who want to adopt modern approaches that emphasize concepts, application and use of data analysis software may find these online tools less than adequate for assessing student performance. So how might an instructor of a high-volume course who is constrained by grading resources assess students on their statistical reasoning abilities?

JMP is pleased to announce a new partnership with WebAssign, a provider of online assessment tools used by more than 2,600 colleges, universities and high schools. The result of this partnership is a collection of new assessment items integrating interactive and dynamic JMP graphs with questions directed at interpretation and reasoning. Best of all, these assessment items are free to WebAssign users and cover many of the concepts introduced in introductory statistics courses. While the assessment items look and act like JMP graphs and output, they are actually JMP HTML5 output embedded into the WebAssign application. So, the items completely stand alone and don’t require JMP to be installed or accessed.

Learn more and check out a sample assignment.

JMP's HTML5 output provides interactivity within a browser environment.

The HTML5 output from JMP provides interactivity within a browser environment.

Check out the video below that illustrates these and other assessment capabilities with JMP. We hope you will give these new assessment items a try and consider adopting JMP for your course. For more information about accessing these items and using them in your own course, contact WebAssign.

One more thing...

In a previous post, we shared with you that JMP has integrated learning “applets,” or what we have referred to as interactive learning modules, into the core JMP product. These are concept demonstration tools that help students “see” the nature of core statistical concepts in an interactive and visual manner. These JMP applets are similar to Java applets that have been available for many years, but with the added benefit of a single interface and the ability to both analyze and explore concepts from the same data of interest.

JMP's Confidence Interval of the Mean Learning Module

With the partnership with WebAssign, JMP can now support three critical needs in your introductory statistics class: data analysis and visualization, concept demonstration and now assessment. While these foundational functions are fairly standard in the modern course, they have largely been provided by different products, which can add complexity to the learning process.  Feel free to contact us at academic@jmp.com if you have any questions or would like to see JMP in action.

Post a Comment

Using analytics to explore the eras of baseball

In my previous post, I showed how we explored the eras of baseball using a simple scatterplot that helped us generate questions and analytical direction. The next phase was figuring out how I might use analytics to aid the “subject matter knowledge” that had been applied to the data. Could I confirm what had been surmised regarding the eras of the game? And, might we use analytics to surface more periods of time in the game that we should investigate, or at least, attempt to explain?

I chose to work with the Fit Model platform in JMP, and in particular Generalized Regression.

Generalized Regression is a method that we might use to build a predictive model, but in this case I wasn’t seeking to build a predictive model. I wanted to take advantage of the variable selection capabilities that are available in the Lasso Option of this method. I wanted to see which variables were selected as “significant and important,” and in what order. I thought that what’s selected will ultimately be the years in which I’m interested – that is, I thought I would confirm my assessment of the eras of the game and probably find some interesting information that my “eyeball” method had overlooked.

Longtime SAS associate and friend Tony Cooper is one of the people who talks baseball with me. Besides being a sports fan in his own right, Tony is an expert JSL (JMP Scripting Language) programmer. In mere minutes, he created a script to help me develop a set of dummy variables that enabled better differentiation of run production through the years. It was a key to the results!

In Generalized Regression, I was looking for a method that would help me surface “change,” and I had seen this tool in action in other areas. My expectations were high, and this method did not disappoint. In fact, the results were exciting, indicating that this method could uncover “change” not only in baseball data but also in other areas!

Utilizing the JMP platform for Generalized Regression enabled me to walk through the years of the game and confirm suspicions about the eras of the game. And it also found what might not necessarily be an “era” but a demonstration of the effects of expansion in specific years, or (for example) the act of raising the mound in 1968 and then lowering it in 1969.

The graph and table shown below tells a story: Solution Path shows the order at which variables enter the model. And, in the accompanying table, the estimates tell us if RPG (Runs Per Game) went up or down in the associated year.

Picture6

Just like the scatterplot, this visualization led to questions as each year was presented. For example:

  • What was going on in 1993? Answer: It could have been the start of the “Steroid Era,” or it could be due expansion (two new teams joined the league) and an indication of the dilution of pitching talent as runs went up, or both!
  • What was happening in 1942? Answer: Runs went down as major league players left the league to join the military. The period from 1942 to 1946 is generally referred to as the “War Years.”
  • What was happening in 1920 and 1921? Answer: Baseball was leaving the “Dead Ball Era” for the “Live Ball Era.” Both players (like Babe Ruth) and events affected the game.
  • What about 1963? Answer: Baseball historians traditionally point to this year as the beginning of the period of expansion, and some call the period from 1963 to present the “Expansion Era.”
  • And finally, why were runs down in 2010? Answer: Baseball had cracked down hard on steroid users, and it may have caused a decrease in overall runs.

Conclusion

So, while having some fun playing with data from a favorite game, we were also able to demonstrate the effectiveness of generating questions and analytical direction using the simple scatterplot. We then showed how we could surface information at even greater levels using Generalized Regression. Potential applications to areas outside the arena of sports are endless. How might you use Generalized Regression?

Interested in seeing more?

  • You can learn more about the JSL script Tony Cooper created to aid in the analysis (and download it for yourself) by visiting the JMP File Exchange.
  • Follow the analysis using the Generalized Regression method by watching this step-by-step video:

 

Post a Comment

The power of crowdsourcing data science ideas

I work with some brilliant people – there’s no doubt about that. Just around the corner and down the hall, you’ll find one of the most brilliant of them: Russ Wolfinger.

Russ is the JMP Director of Scientific Discovery and Genomics here at SAS, leading an R&D team for genomics and clinical research. He’s also a thought leader in linear and nonlinear mixed models, multiple testing and density automation.

Over the past year, I’ve heard rumblings of Russ’ involvement with various data science competitions like Kaggle and DREAM. These competitions are an excellent way to crowdsource ideas to solve some of the most complicated and pressing problems. They are open to data scientists around the world who want to lend their expertise and develop their skills in the process.

A couple of weeks ago, I had the chance to hear Russ talk about his obsession with the competition. He is passionate about his work in this arena and in awe of the sharp, innovative minds in the data science and predictive modeling community, minds that are investing hundreds of hours developing effective models to more usefully deal with complex problems posed by big corporations and nonprofits alike.

Just ask him about it, and you’ll feel his enthusiasm right away. Even if you’re like me and don’t completely understand everything he’s throwing out there, you’ll be inspired by the excitement and implications of harnessing the cognitive talents of the top data scientists.

Russ came in fourth place in Kaggle’s Rossmann Store Sales challenge and was co-winner in the Prostate Cancer DREAM Challenge. He says the bag of tricks you need to solve problems of this nature is getting bigger, and he credits his participation for keeping him on top of the latest methods.

Of course, Russ is using JMP, SAS, Python and R to work on these challenges. He finds JMP software especially well-suited for the discovery and exploration phase of model building, saving him loads of time.

If you’re intrigued, I hope you’ll watch Analytically Speaking on Wednesday, May 11 at 1 p.m. ET. Russ is our guest, and data science competition is the topic of discussion.

Post a Comment

Spotting the elusive cheetah: An Earth Day story

cheetahEarth Day focuses attention on big questions: What’s the future for Homo sapiens? How can we coexist sustainably with other species?

To stem the loss of biodiversity and ensure continued provision of essential ecosystem services, world leaders adopted the 20 Aichi Biodiversity Targets in 2010, to be fulfilled by 2020. One key target (Target 11) prescribes an expansion of the global protected area system to at least 17 percent of land surface and 10 percent of oceans.

One of the most revered conservation biologists of our time, E.O. Wilson, a Professor at Harvard and Duke, has an even bolder vision: We should set aside half the planet for the other species we share the Earth with.

That’s music to the ears of many who care about the future of our planet. But how to decide what to put aside to allow the majority of species to flourish? We’ll need not only political will, but more robust data on the numbers and distribution of the 16,000 endangered species we know (possibly 0.1 percent of all species).

Currently, species distribution maps more closely represent where conservation biologists range than the species they study.

Take one of the best studied species in the world: the cheetah. Much of its distribution across Africa (for example, Angola and Sudan) is completely unknown.

The cheetah (Acinonyx jubatus) is Africa's most endangered felid and listed as Vulnerable with a declining population trend by the IUCN Red List of Threatened Species. There are only between 7,000 and 10,000 cheetah globally, and Namibia has around a third of them.

The challenge of spotting cheetah

Spotting cheetah is not as easy as it might seem. You’d be forgiven for thinking such an iconic large cat, especially one that’s active during the day, should be pretty easy to count and geotag. Not so fast! Cheetah are surprisingly elusive, and generally have huge ranges, up to 1,400 sq km in Namibia. Roaming  across commercial farmland in Namibia, they are often persecuted for raiding domestic livestock and have learned to keep a low profile. As a result,  population estimates range from 2,905 to 13,520 and distribution maps show huge country-sized gaps over Africa.

Shortly after we’d formed WildTrack to monitor  endangered species, we were approached by the N/a’an ku sê Wildlife Foundation, based near Windhoek in Namibia. The conservationists there were trying to make peace with local landowners – who were mostly tough, no-nonsense Afrikaaners – to mitigate human/cheetah conflict. Unless the Foundation could relocate troublesome cheetahs to new areas, the farmers would simply shoot them. At that point, the then-Director of Research, Dr. Florian Weise, was convinced by his own trackers that footprints offered a new key to censusing and managing Namibian cheetah populations, helping understand distributions to keep cheetah and livestock separate. He reached out to us to ask us if FIT could help.

 A 'training data set' of cheetah footprints

Our first step in developing the FIT algorithm for the cheetah was to collect a "training data set" of footprints from cheetah of known ID, age-class and sex. Drawing on lessons learned from tiger footprinting at Carolina Tiger Rescue in North Carolina, we thought about adapting the technique to Namibian conditions. Florian and his expert team at N/a’an ku sê engaged the capable help of Chester Zoo in the UK, the Cheetah Conservation Botswana, AfriCat and Foundation SPOTS in the Netherlands, where the training data set of cheetah footprints would be made.

We had some initial challenges finding the right substrate texture to hold clear prints. We tried different sand/water mixtures and experimented in coaxing cheetah to walk at a "natural" pace over the sand trails we laid. Before long, we demonstrated that captive cheetah can directly contribute toward the conservation of their wild counterparts through their footprints and had collected 781 footprints (male:female ratio was 395:386) belonging to 110 trails, from 38 cheetah.

Chester Zoo Cheetah Adaeze sample

Processing pictures of footprints

The digital images of footprints were then ready to be processed in the FIT software that runs as an add-in to JMP data analysis and visualisation software from SAS. FIT first takes the prints and optimises their presentation and orientation to a standardised format. It then allows measurements of distances, angles and areas to be made between anatomical points on the footprint image.

These are fed into a robust cross-validated discriminant analysis and the output processed by Ward’s clustering. The resulting output is a dendrogram that allows us to tweak the algorithm to provide classification for the number of individuals, sex and age-class. For cheetah, we have consistent accuracies of >90 percent for individual identification.

Fig 7

Last year, we were approached by the Journal of Visualized Experiments (JoVE) to publish an article. JoVE is the world’s first peer-reviewed scientific video journal and offered us an ideal opportunity to communicate about FIT in JMP for cheetah to a wide audience.

Our paper in JoVE details the whole process of FIT for cheetah, with video footage from Namibia: Jewell, Z. C., Alibhai, S. K., Weise, F., Munro, S., Van Vuuren, M., Van Vuuren, R. Spotting Cheetahs: Identifying Individuals by Their Footprints. J. Vis. Exp., e54034, doi:10.3791/54034 (2016).

Let the field monitoring begin!

The FIT algorithm for cheetah is up and running in the field. Duke University Master's student Kelly Laity, supervised by Professor Stuart Pimm, N/a’an ku sê and WildTrack, has categorized the quality of footprints that can be used by FIT. We’re now going to apply the technique to monitoring cheetah at N/a’an ku sê fieldwork sites and are confident that it will begin to shed light on true numbers and distribution of cheetah.

FIT algorithms for many other species are in the pipeline, and together with many emerging technologies in conservation, engineering, forensic science, computer science and other disciplines, can begin to map species better.

Moving forward, if we are to set aside land for other species, we’ll need to make more efficient use of the land we have. We’ll need new technologies to intensify agriculture and develop innovative architectural engineering strategies to make cities healthier and better places to live.

Earth Day is a day to think about the challenges we face. How we approach them will, quite simply, determine our future on Earth. E.O. Wilson’s clear vision is a wonderful start.

To learn more about WildTrack, here's a quick look at our mission:

Post a Comment

Using data visualization to explore the eras of baseball

In consulting with companies about building models with their data, I always talk to them about how their data may differentiate itself over time. For instance, are there seasons in which you might expect a rise in flu cases per day, or is  there an economic environment in which you might expect more loan defaults than normal? These are examples of key pieces of information that come with a challenge: How do you identify these periods in your data where change occurs? And, can you explain the change?

This topic is always at the top of mind when I work with customers. This week, the game of baseball is also on my mind, as the season is now underway.

Recent discussions with some of my friends who are baseball fans have centered on the history of the game, and how various rules and events affected the game over time. Or had the game been affected? Some thought the rule changes and events could not have had a significant impact, while others were noncommittal.

As a statistician with data and tools to analyze it, I decided to do a bit of research. It occurred to me that this was a nice opportunity to illustrate how we might discern the periods of time that have been affected by events, policy or rules. We could have fun with baseball data while keeping in mind that the same approach could apply to businesses in other industries.

My approach

Major League Baseball (MLB) makes its data publicly available through Sean Lahman (SeanLahman.com). From his robust set of files, I built a comprehensive database of SAS data sets featuring baseball data.

The approach to discerning different periods of time (or in the case of baseball, eras) was twofold: First, I would rely on the expert opinion of … myself. And second, I would explore an analytical technique to see if the result would agree and support expert opinion – and also, would it surface more periods of interest? You'll see this second part in a follow-up post.

I like to develop my data using the SAS Data Step, and did so from within SAS Enterprise Guide. In doing so, I developed a simple metric representing the Runs Per Game (RPG), believing that would be the metric that I could use to represent rule changes over time. It’s been said that “runs are the currency of baseball,” and if a rule or event disrupted the normal production of runs over time, then we should discuss it! I built the data set and seamlessly sent it to JMP.

A graph spurs discussion

Using Graph Builder in JMP, I quickly created one of my favorite means of analytical communication: the scatterplot. This one featured the mean RPG versus Year. And, as soon as I built the graph (and shared it), the questions and observations from my friends started to flow:

  • Why were there so many runs before 1900?
  • Why were there so few runs between 1900 and 1920?
  • Why did runs fall off in the early 1940s?
  • Runs didn’t rise as much as I had expected in the 2000s…
  • What era are we in now?

The graph evolved a bit as we discussed these questions. Here’s the scatterplot of Mean Runs Per Game Through the History of Baseball that triggered these questions and many more.

Mean Runs Per Game Through the History of Baseball

I added the colors and references lines as the eras of the game were differentiated in our discussions. The majority of the questions directly related to eras as identified by baseball historians.

Some of the questions (and answers) were as follows:

Why were there so many runs scored before 1900?

  • Until 1887, the batter could essentially call the pitch (i.e., “high or low”), and the pitcher was obligated to comply.
  • Until 1885, “flat” bats were used.
  • Until 1883, pitches were launched below the waist and had less velocity.
  • Until 1877, there were “fair foul hits” where balls that might hit inbounds and “kick out” before first or third base were considered hits (today they are called foul balls).
  • This era was known as the “19th Century Era.”

Why were there so few runs scored from 1900 to 1920?

  • Many manufacturers produced baseballs with poor and inconsistent specifications.
  • Teams used the same ball literally until the cover came off – it became dirty and difficult to see.
  • This era was known as the “Dead Ball Era.”

What happened to increase run production after 1920?

  • After Ray Chapman was hit and killed by a pitch, baseball began using clean balls. Witnesses stated that Chapman didn’t even flinch, which led most to believe that he hadn’t seen the ball approaching.
  • Home run hitters like Babe Ruth emerged.
  • Consistent manufacturing (with consistent rubber cores) made the baseballs come off the bats more readily.
  • This era was known as the “Live Ball Era.”

What happened in the early 1940s? It appears runs fell off again.

  • Replacement players played the games as “the regulars” joined the military during World War II.
  • This period is not always called an era, but referred to as the “War Years.”

Questions continued to bubble up…
The discussion continued in many interesting directions, for example:

  • The era from 1947 to 1962 is referred to as the “Integration Years,” as Jackie Robinson joined the Dodgers on April 15, 1947.
  • The era that begins in 1963 still perplexes baseball historians at what to call it, or even how many eras might exist from 1963 to the present. Here are some of the events and rule changes that have affected the game since 1963:
    • The league expanded from 16 to 30 teams, effectively diluting the talent among teams and prompting many to refer to this entire period from 1963 to present as the “Expansion Era.”
    • The American League instituted the “Designated Hitter” into the game in 1973, leading some to refer to the period from 1973 to present time as the “Designated Hitter Era.”
    • Rumors of players using performance-enhancing drugs surfaced in the mid-1990s, resulting in some calling 1995 to 2009 the “Steroid Era.”

What’s cool about all this “discovery” is that it happened from the initial scatterplot, and the identification of what appears to be clusters of years with similar RPG. As we identified clusters of years with similar run production, we either explained the reason behind the cluster, or noted it as a period of time having a change due an unknown cause (and looked forward to researching it further!).

Next week, we'll use analytics to try to confirm these eras of the game and possibly uncover more periods worth investigating.

Interested in seeing more? This step-by-step video shows how I created the graph in Graph Builder:

Post a Comment