Celebrating statisticians: George E.P. Box

In this International Year of Statistics, we at JMP are celebrating famous statisticians on a monthly basis. This month is my turn, and early this year I chose Professor George E.P. Box as the subject of my celebration. I was looking forward to writing this piece because I knew George personally and have been an admirer of his since the beginning of my career.

Sadly, George passed away in late March, and I wrote a remembrance of him for the JMP Blog at that time. That blog post expresses what I would have written in a post celebrating him. So, instead of speaking in general about his life and accomplishments, in this post I will focus on one of his many great papers. My plan is to write several such blog posts this month, each emphasizing a different one of his wonderful publications. One of the benefits for me is that I get to reread these papers.

In this post, I want to focus on the first of his two-part paper with J. Stuart Hunter on the family of regular two-level fractional factorial designs that was published in Technometrics in 1961.

This seminal paper is 40 pages long, and one thing I found notable about it was that the mathematical content did not go past arithmetic and a little algebra! Despite this, there are many fundamental results in this paper, but all are stated in natural language without formal proofs. That was refreshing.

How does the paper begin?

The paper starts with a brief exposition of two-level full factorial design in k factors. It shows how these designs can estimate interactions of all orders up to the k-factor interaction. This provides the motivation and background for introducing a half fraction of the full factorial design. They illustrate the construction method using the 2(4-1) design showing how one starts with the full factorial design in three factors and then adds a fourth factor by computing the elementwise product of the first three factors.

Can you show this in JMP?

To reconstruct their example the way they did it by using JMP, we start by using the Full Factorial designer. We call our factors 1, 2 and 3 and our response Y. We compute the 4th column using the formula editor. The Y column in our Table 1 below has the same values as the ones they use in their Table 3.

Table 1: Formula column for factor 4 as the elementwise product of the first three.

Of course, we could also just use the Screening Designer in JMP to enter 4 factors. The design we want is the first in the list. :-)

What happens next?

They now have a design with 8 runs that is just half as many runs as are in the full factorial design with four factors. With the full factorial design, you can estimate 16 effects – the overall average, 4 main effects, 6 two-factor interactions, 4 three-factor interactions and 1 four-factor interaction (16 = 1 + 4 + 6 + 4 + 1). Now with 8 runs, you can only estimate 8 effects. It turns out that the construction the authors use confounds the 16 effects of the full factorial into 8 pairs of effects. The average is confounded with the four-factor interaction. The 4 main effects are each confounded with one of the 4 three-factor interactions. Finally, the 6 two-factor interactions are confounded in 3 pairs (8 = 1 + 4 + 3).

Figure 1 below shows the analysis from the JMP Screening platform. The values JMP reports are half of the quantities Box and Hunter report, because they define their effects as being the difference in the response when changing from one level of the factor to the other. JMP defines an effect as the change in the response due to a one-unit change in the factor. Since one level of the factor is coded -1 and the other is coded +1, each factor changes by two units going from its low to its high level. Thus, the effect of a one-unit change is half the effect of going from low to high.

Figure 1: JMP screening analysis showing aliases for the two-factor interactions

How does the rest of the paper go?

Of course, the paper is much too long for me to cover everything Box and Hunter introduce – especially not in this level of detail. Here are some of the big concepts:

  • Generalizing their 4-factor example, they show that the best way to create a half fraction of a k factor full factorial design is to start with a full factorial design in k – 1 factors and then calculate the last column by computing the elementwise product of the original k – 1 columns. They also show that one can reconstitute the full factorial by combining a half fraction with another half fraction that where every value in the second fraction is obtained by multiplying the corresponding value in the first fraction by –1. This leads to the concept of a foldover design – a term they also introduce here.
  • They introduce the idea of design resolution and define resolution III, IV and V designs. Introducing the idea of a saturated design, they describe resolution III designs of 7 factors in 8 runs, 15 factors in 16 runs and 31 factors in 32 runs. They also throw a bone to Plackett and Burman (1946) mentioning their constructions of 11 factors in 12 runs, 19 factors in 20 runs, 23 factors in 24 runs, etc.
  • They introduce the idea of design generators and use this idea to show how to block the fractional factorial designs in groups of runs that each have 2, 4, 8 or some other power of 2 runs per block.
  • They show how to obtain designs of resolution IV by folding over a design of resolution III and introduce the idea of design projectivity. For example, they state that every resolution IV design projects to a full factorial (or replicated full factorial) in any three of the factors. The benefit of this is that if only three factors turn out to be important, it is possible to estimate all the interaction effects of those three factors. And, it does not matter which three are important.

Where has design for screening gone in the 50+ years since then?

It is a tribute to the combined power and simplicity of this approach that the regular two-level fractional factorial designs are still in frequent use today. The construction and analysis of these designs does not require a computer, which made them popular when computers were rare. Of course, the calculations can be a bit tedious, so having a computer do them for you makes for fewer errors and more free time.

In the same year as the publication of this paper, Hall published 5 different orthogonal arrays for 15 factors in 16 runs. The saturated design in the Box and Hunter’s paper was one of the 5. This paper was also fundamental as it turns out that all the orthogonal arrays 16 runs for fewer factors are projections of the Hall arrays.

Forty years later, Sun, et al. (2002) catalogued all the orthogonal 16 run designs for 5 to 14 factors. For 9 to 14 factors, the 16 run designs of Box and Hunter are all of resolution III, which means that main effects are confounded with two-factor interactions. Sun, et al. found designs in these cases where none of the two-factor interactions confounds a main effect. Instead, some two-factor interactions may be correlated either plus or minus one-half with a main effect. The benefit of these designs is that main effects can be identified without the built-in ambiguity that resolution III designs entail.

References

Box, G. E. P. and Hunter, J. S. (1961) "The 2k-p Fractional Factorial Designs Part I" Technometrics Vol 3, No. 3 311-351.

Hall, M. Jr. (1961). Hadamard matrix of order 16. Jet Propulsion Laboratory Research Summary, 1, 21–26.

Sun, D. X., Li, W., and Ye, K. Q. (2002), “An Algorithm for Sequentially Constructing Non-Isomorphic Orthogonal Designs and Its Applications,” Technical Report SUNYSB-AMS-02-13, State University of New York at Stony Brook, Dept. of Applied Mathematics and Statistics.

Post a Comment

Is your data too precise?

There is usually a desire to have the most precise measurement of any measurement. In theory, that is good, but for the purposes of data analysis, more precision isn't always better.  

It is usually best to examine any continuous variable and determine a reasonable precision for the recorded values. For instance, suppose I have a variable X, and X has the values shown in the table below:

The data is recorded to 10 decimal places. But if we are analyzing this data, do we really need that many digits to the right of the decimal? One way to examine this is to look at the plot of the estimates of the mean and standard deviation of this data, as a function of the level of rounding used. A good question to ask is "How sensitive are the summary statistics to the level of precision?"


A rule of thumb

The table and the chart show that we don't lose very much information in the estimate of the mean even if we round all the way to one decimal place. A good rule of thumb about the needed precision for a variable is to divide the standard deviation of the unrounded data by 3, and use the leading significant digits decimal place as the level of rounding. In the example shown here, the standard deviation of the unrounded data is 0.3612, so 1/3 of that is 0.12, indicating that one decimal place is sufficient.

Does this matter when building models?

Recently, I encountered a predictive modeling problem where the time it took to fit a decision tree model was longer than desired (many hours of computational time). The data set being used was fairly large (several million rows). Investigating the problem, I discovered that many of the continuous predictor variables (the Xs, or the factors) were recorded out to many decimal places. Using the rounding rule of thumb, the computation time for fitting a decision tree model decreased to several minutes. Why did this happen? The recursive partitioning algorithm that builds decision trees uses the unique levels of each continuous predictor and builds binary splits based on each unique level in order to find the optimum split. If the factors in the model are overly precise, this adds a large amount of computational overhead with often very little benefit in improved model accuracy. Rounding the factors reduced the number of unique levels and made the model fitting algorithm perform much faster.

Binning Data

In practice, it is also known that even less precise indicators of the levels of continuous variables can be useful and lead to better models. One approach to reducing the precision of a predictor variable is to employ binning. Binning simply assigns each continuous variable to a categorical level. Rounding is one example of binning. Another example of binning is to build a histogram of the data and using the bins in the histogram as the predictor. A previous blog post described an interactive binning tool that you can use to do this sort of binning manually. A third example of binning is to use a supervised approach, where the bins in the predictor are chosen in a way that maximizes the predictive ability of the binned variable.

For the data analysis problem I faced, with overly precise data leading to long computational time, I had 100 predictor variables that needed some level of binning applied. The interactive binning approach would have taken too long, and the supervised binning approach is, in itself, computationally intense, so it was taking quite a bit of time. I decided to employ an unsupervised binning approach that simply looks for groupings or clusters in the predictor variables, one variable at a time.

Binning Data Using Normal Mixture Distributions

Consider the slightly "lumpy" variable shown in the histogram below. A Normal Mixture distribution using 3 normal distributions is fit to the data (Hotspot>Continuous Fit>Normal Mixtures>Normal 3 Mixture). The parameter estimates from the mixture distribution are recorded, and a binning formula is created that assigns a row to a group based on which distribution in the mixture the row has the highest associated probability.

The binned variable is an integer that preserves the ordering in the data, so it can be used as a continuous, ordinal or nominal variable.

Click image to see animated version

To automate the process of using normal mixture distributions for unsupervised binning, I created a JMP add-in (available for download on the JMP File Exchange -- requires a free SAS profile).

As part of your data preparation for modeling, take into account the data precision of your predictor variables, and see if lower precision might be helpful as you do your analysis.  I hope the ideas shared in this post are useful to you as you work to build better models.

Post a Comment

JMP attends the PSI Conference in Edinburgh

I had the pleasure of interviewing Richard Zink, Principal Research Statistician Developer in the JMP Life Sciences division, prior to his visit to the UK to speak at the PSI (Statisticians in the Pharmaceutical Industry) Conference in Glasgow on 14 May. His PSI talk is titled "Assessing the Similarity of Subjects Within a Study Site" and will discuss sampling approaches and describe how the availability of extensive computerized logic and validation checks early in the clinical trial not only ensures data quality, but can be used to identify potentially fraudulent activities. Richard has been instrumental in the development of JMP Clinical, especially for Pharmaco-Vigilance (PV), clinical trial fraud, patient narratives and Bayesian methods.

International Conference on Harmonisation (ICH) guidelines suggest that clinical trial data should be actively monitored to ensure data quality. What do you see as the limitations of and issues surrounding on-site monitoring of clinical trials?

Traditionally, this is a very manual process where monitors compare case report forms (CRFs) pages against the physician records. Not only is this time-consuming, but traveling to numerous clinical sites can be extremely expensive. There are also limitations in how the data can be reviewed. When working with paper, it would be extremely difficult to examine trends in variable across time or compare the results of multiple subjects. It is also not possible to compare results across investigator sites.

What is your view of risk-based monitoring of clinical trials?

I think people have interpreted the ICH guidelines very literally and have gotten in the habit of performing 100 percent source data verification (SDV) for all CRF fields. People may spend a lot of time reviewing fields that have little chance of error, and at the end of the day, may have little impact on the findings of the clinical trial. In risk-based monitoring, we would take a random sample of available CRF pages and perform a thorough review of these sampled pages. Only if the number of errors exceeded a certain error rate would more CRFs be sampled. Of course, it would be important to sample from all relevant CRF domains. Further, the sampling fraction may be based on the importance of the data to the study. For example, given their importance, 100 percent of all data for the primary endpoint and serious adverse events (SAEs) may be reviewed.

Given the drive to reduce the cost of clinical trials, how can statistical sampling be used to achieve data quality whilst minimizing cost of analyzing trials?

Hopefully, the sampling will be performed in such a way to minimize the amount of on-site monitoring at the sites. This is beneficial in terms of travel costs, but also in terms of the number of person-hours spent manually reviewing the data. It is also extremely important to perform central monitoring of the data using a robust set of computational tools. This can include checks for outliers or implausible values, but may include more complex analyses to examine trends across time, identify missing data, or identify noteworthy differences between the investigator sites.

What are the emerging trends in discovering fraud in clinical trials, and how do you see tools evolving to meet the requirements for examining data for quality and fraud?

I think fraud has always been an important issue. Fraud, compared to other issues regarding data quality, is different in that there is a deliberate intent to deceive. Other issues regarding data quality may be due to poor training or carelessness. If you have identified unusual values that point to a data quality problem, going the extra step to say that this is necessarily due to fraud is difficult. At the end of the day, whether the problem is due to fraud or carelessness, data quality issues need to be identified early so that appropriate remedies can be applied to minimize disruption to the trial and maintain trial integrity. In the last 10 to 15 years, there have been some publications describing statistical methods used to identify fraud in clinical trials. With numerous competing priorities, implementing these methods in practice may be difficult. Data standards certainly help to ensure that any software developed can apply to different study teams, therapeutic areas or companies. Interactive graphical methods are useful to get as many team members involved, with the ability to drill-down to interesting cases. Of course, there are a lot of ways in which things can go wrong. Making these reviews efficient will be extremely important.

JMP Clinical covers so much more than just improving data quality and uncovering fraud. What are the key capabilities that you would highlight about the software?

I think the software has something for everybody, but what I find very satisfying about the software is its interactivity and the ability to review graphical results and statistical summaries side by side. In addition to the seven fraud detection tools, JMP Clinical has customizable patient profiles and adverse event narratives that allow for more straightforward clinical review and reporting. There is a snapshot comparison feature that allows the user to identify new or modified records as the study database is updated. There is a built-in notes feature that allows users to save and view notes at the analysis-, subject- or record-level. For the more analytically minded, we have a robust set of analyses for adverse events with adjustment using FDR and double FDR for incidence or time-to-event analyses, and a new feature that makes use of Bayesian hierarchical modeling. There is also an extensive set of predictive modeling tools and cross-validation features.

Post a Comment

Registration is open for Discovery Summit

Every year, JMP users and developers get together to challenge theories, benchmark best practices, and share tips, tricks and innovative methods. Discovery Summit is an opportunity for JMP users — including everyone from total beginners to seasoned experts — to explore new strategies and methodologies and ultimately leave better equipped to analyze data and spread analytic excellence.

And we’ve got good news: Registration is open now!

This year’s conference, located in San Antonio, is the perfect opportunity for analytics trailblazers to refine their skills, present their ideas and network with other JMP users. In addition to breakout sessions and poster presentations, Discovery Summit features pre- and post-conference training, keynote speeches from New York Times blogger Nate Silver and statistics professor Dick De Veaux, and the official unveiling of JMP 11 by co-founder and Executive Vice President of SAS, John Sall.

It might seem like such a premier event would come at a premium price, but registration for Discovery Summit is only $600. And remember: That includes all meals during the conference, if you choose, from dinner Sept. 9, to the last-chance networking lunch on Sept. 12. Plus, many attendees will qualify for a reduced rate of $300 through one of the several discount programs.

So, whether you’re looking to kickstart your JMP skills as a new user or learn the latest advanced techniques for experienced users, Discovery Summit offers a fun, high-value opportunity. Join the rest of the JMP community for this celebration of analytics.

Register for Discovery Summit today!

Post a Comment

Analyzing adverse events using Bayesian hierarchical models

You may be asking yourself… “Two Bayesian posts in a row? What is going on?”

Though my statistical training focused on Frequentist methodologies, I am a big believer in using whatever tools help me gain insight into the statistical problem I happen to be focusing on at the moment. Frequentist or Bayesian, it makes no difference. In fact, because I'm a biostatistician, my appreciation of these two schools of statistical inference makes complete sense.

It is a common misunderstanding that biostatistics focuses on the statistical analysis of medical or biological data. Nothing could be further from the truth.  Let’s dissect the word “biostatistics.” According to the dictionary, “bi-“ means “two.” So clearly, “bi-“ and “statistics” refers to the two major inferential paradigms of modern statistics. What does the “o” stand for? It stands for “oh, how I love.”

So let’s review:

bi                   “Bayesian or Frequentist”

o                     “oh, how I love”

statistics       “statistics”

So biostatistics means “Bayesian or Frequentist, oh, how I love statistics.”

Ramblings aside, today we are going to talk about the analysis of adverse events using Bayesian hierarchical models. It is no surprise that the analysis of adverse events, and the analysis of safety endpoint in general, tend to involve a large number of comparisons between two treatment arms. Further, many adverse events occur infrequently. Finally, typical analysis of adverse events ignores how events may be related to one another.

To address these issues, Berry & Berry (2004) define a three-level hierarchical mixture model that:

  1. Analyzes all events simultaneously. Inference is based on summarizing MCMC samples from the posterior distribution.
  2. Allows related AEs to borrow strength from one another (based upon system organ class membership, say).
  3. Modulates extreme results, possibly due to the rarity of some events.

The AE Bayesian Hierarchical Model Analytical Process (AP) fits the Berry & Berry (2004) model in JMP Clinical 4.1 using PROC MCMC to compare the treatment-emergent adverse event profile of two treatment arms. There are options to specify the number of burn-in samples and posterior samples, the thinning rate and the number of independent Markov chains. Further, users can alter the assumed hyperparameter values suggested in Berry & Berry (2004).

Figure 1. Bayesian Volcano Plot for the Difference in Proportions (Nicardipine - Placebo)

Figure 1 summarizes the adverse event analysis of Nicardipine versus Placebo using a Bayesian version of a volcano plot (Figure 2). In lieu of the –log10(p-value) for the y-axis, we use the posterior exceedence probability, which is initially defined as p[difference in proportions > 0|data] (or odds ratios could be analyzed instead). For each adverse event, this probability is calculated as the number of posterior samples where the difference in proportions (Nicardipine – Placebo) is > 0 divided by the total number of posterior samples (here:  10,000). This can be interpreted as the probability that Nicardipine has excess risk for an event compared to Placebo. The user is free to alter the definition of the posterior exceedence probability and redraw the volcano plot.

Figure 2. Volcano Plot with FDR Adjustment

The x-axis is similar to the Frequentist volcano plot, except that we use the posterior mean of the difference in proportions (which is just the average of all posterior samples). Bubbles are sized so that the area of the bubble is proportional to the total number of subjects experiencing the event across both treatments combined. Color indicates the system organ class to which events belong. It is straightforward to observe that several vascular disorders, respiratory and psychiatric disorders show increased risk for Nicardipine.

Figure 3 contains Forest Plots that summarize the equal-tailed and HPD posterior credible intervals for all events where p[difference in proportions > 0|data] > 0.80. Using the equal-tailed intervals, we can see there are several events that appear to have higher risk for Nicardipine: Isosthenuria, Phlebitis, Pleural effusion, Hypotension and Agitation. These events also stand out in Figure 2.

However, the Bayesian model highlights additional events: Atelectasis (collapse of part of the lung) , Delirium and Disorientation. Atelectasis was experienced by 86 (19.2%) and 63 (13.8%), Delirium by 6 (1.3%) and 0 and Disorientation by 8 (1.8%) and 3 (0.7%) subjects for Nicardipine and Placebo, respectively. The raw and FDR-adjusted p-values for Atelectasis  were 0.0293 and 0.4100, Delirium were 0.0132 and 0.2876, and Disorientation were 0.1222 and 0.5416, respectively. Redefining the exceedence probability and examining events where p[difference in proportions < 0|data] > 0.80 shows an increased risk of Hypertension and Vasoconstriction for Placebo (which agrees with the analysis in Figure 2). However, Sinus Bradycardia shows no important difference between treatments.  This is likely due to the fact that there are 12 other events in the Cardiac Disorders system organ class with no treatment effects.  Ideally, the Bayesian analysis should assess the sensitivity of the results to different prior assumptions, but we have identified some additional adverse events between the treatments that require greater scrutiny.

Figure 3. Forest Plots of Equal-Tailed and HPD Credible Intervals

Alternatively, a model that adjusts for subject-exposure can be used to analyze adverse events in JMP Clinical (Xia, Ma & Carlin, 2010). For subjects with an event, their exposure is calculated as the time from the start of dosing until their first event. The exposure for subjects without an event is computed as their entire time on study.

I'll be discussing these methods in further detail at SAS Global Forum. Feel free to review my paper here.

We are gearing up for SAS Global Forum! Are you?

What is SAS® Global Forum, you ask? It is the premier event for SAS professionals worldwide, offering educational and networking opportunities, as the conference has done since the first meeting in 1976. The conference is an annual event planned and sponsored by the SAS Global Users Group, which is open to all SAS software users throughout the world.

This year’s conference will be held in San Francisco’s Moscone Center West from April 28 – May 1, and JMP will be shown in many ways. You can find JMP incorporated into some of the sessions listed below. And some of these presentations will be by our very own JMP development staff.                                                                              

Session
Date
500-2013 Opinion Mining and Geo-Positioning of Textual Feedback from Professional Drivers
Speakers:
Mantosh Kumar Sarkar, Oklahoma State University; Goutam Chakraborty, Oklahoma State University
Location:
Room 2004 - Moscone West
Monday, April 29
5:30 p.m.-5:50 p.m.
008-2013 Tips and Techniques for Moving SAS Data to JMP Graph Builder for iPad
Speaker: Michael Hecht, SAS
Location: Room 2014 - Moscone West
Tuesday, April 30
8:00 a.m.-8:50 a.m.
434-2013 From Big Data to Big Statistics
Speaker:
John Sall, SAS
Location: Room 3016 - Moscone West
Tuesday, April 30
9:30 a.m.-10:20 a.m.
010-2013 Give the Power of SAS to Excel Users Without Making Them Write SAS Code
Speaker: William Benjamin, Owl Computer Consultancy LLC
Location: Room 2014 - Moscone West
Tuesday, April 30
9:30 a.m.-9:50 a.m.
331-2013 10-Minute JMP
Speaker: George Hurley, The Hershey Company
Location: Room 2003 - Moscone West
Tuesday, April 30
3:45 p.m.-3:55 p.m.
105-2013 Be Customer Wise or Otherwise: Combining Data Mining and Interactive Visual Analytics to Analyze Large and Complex Customer Resource Management (CRM) Data
Speaker:
Junyao Ji, SAS
Location:
Room 2004 - Moscone West
Tuesday, April 30
5:30 p.m.-5:50 p.m.
179-2013 Assessing Drug Safety with Bayesian Hierarchical Modeling Using PROC MCMC and JMP
Speaker: Richard Zink, SAS
Location: Room 2000 - Moscone West
Wednesday, May 1
11:00 a.m.-11:50 a.m.

See JMP in action at the SAS Support and Demo Area, located on Level 1 of Moscone West. This space is 64,256 square feet of SAS user delight. From Sunday evening until 5 p.m. on Wednesday when the demo floor closes, you can stop by the JMP booth and visit with JMP developers, ask those burning questions you may have, and talk with other JMP and SAS users about how you are using the software -- and if you’re lucky, you might even be able to find out a little bit about new features that will be available in JMP 11, which is coming out later this year.

We will have cool giveaways, including JMP stickers and window clings (so you can spot your fellow JMP users), and you can join us in support of the International Year of Statistics by sporting an I ♥ Statistics, Statistics 2013, or 2013* pin. We will also be dealing out a new design of playing cards to show our love of statistics in every language.


If you aren’t able to make it to San Francisco at all this year, you can still be a part of the action via SAS Global Forum Livestream! You can watch the opening session, the presentation by SAS Co-Founder and JMP chief architect John Sall -- From Big Data to Big Statistics on Tuesday morning from 9:30 a.m.-10:30 a.m. PST -- and Wednesday morning’s Live Report, all via the Livestream!

That’s just a sampling of what you can expect from JMP at this year’s SAS Global Forum. If you’d like to learn more, visit the SAS Global Forum website. We hope to see you in San Francisco!

Post a Comment

French JMP users meet near Marseille

The plan was for French JMP users to meet in the south of France, which is usually a very pleasant place in April. Unfortunately, this year was simply cold and rainy, but that did not keep JMP users from gathering in Marseille.

The users were welcomed by Jean-Francois Christaud, Device Director at ST Microelectronics, who spoke about speed as a key competitive factor. Building on this theme, Florence Kussener of JMP showed users how to quickly build application in JMP using the Application Builder.

A wide range of presentations from Air Liquide,  Elanco and Lafarge Ciments followed on such topics as data visualisation, design of experiments and SPC. You can probably tell from the group picture that everyone enjoyed the day, but what you probably can’t see is that everyone is wearing their pins celebrating the International Year of Statistics.

Attendees were surprised to discover that users in other industries were solving similar problems to their own, which lead to great conversations. This is what the user group meeting is all about, and that did work. After Toulouse in 2010, Lyon in 2011, Paris in 2012 and Marseille this year, the only question that remained unanswered was where next!

Post a Comment

Predictive modelling returns to the UK

We have had such a favourable response to our seminar on Building Better Models that we held our third one on 10 April, with nearly 100 people attending. It's become a global success, having been delivered 20 times throughout the world. The seminar is based on George Box's concept that "all models are wrong, but some are useful," and it investigates how different predictive modelling methods can be used to make the models built by JMP Pro as useful as possible. George Box is one of the great statisticians of the last century and, sadly, passed away earlier this month, which added poignancy to the event.

The seminar was kicked off by Sam Gardner, who gave the audience an introduction to statistical modelling and why it is important to avoid overfitting a model, which is because it might lead to the model "predicting" the noise, rather than focusing on underlying signal. You can hold back data to help you select the best model using the "Validation" sample in JMP Pro, because this data is used to validate which model is best by observing when the RMSE is at a minimum. JMP Pro also allows you to evaluate how good the model by holding back a third "Test" sample from the data. It has a range of techniques to assess this, including R2 (which assesses the proportion of variability within the model), AIC, BIC, Confusion Matrices, ROC and Lift curves.

Sam gave an introduction to decision trees, and Robert Anderson of the JMP UK team then showed how these, coupled with holding back data to validate the model, can help allocate engineering resources to the right tasks, using a semiconductor process engineering case study. The upshot was a more productive use of engineering effort, saving time and cost, compared with traditional statistical modelling methods.

Sam followed this with a research and development example, in which he built a linear regression model to determine the melting point of pharmaceutical compounds based on molecular structure. He combined other techniques such as exploratory data analysis to identify important compounds and variable reduction using principal components analysis and clustering.

JMP Pro allows you to apply boosting and bagging techniques to build better decision trees. Robert showed how these could be used to build a more accurate model predicting customer churn in a telecommunications company, so that users could be marketed to appropriately. He also showed how you can create an interactive visualisation of the model --  the Prediction Profiler  in JMP (see image below) -- so that you can use scenario analysis to communicate with your executives and reach better business decisions.

JMP Pro enables the use of advanced predictive modelling techniques such as cross-validation and multi-layer neural network models.

Neural networks are very flexible models and so are prone to overfitting. Sam showed how to build better neural net models and compare them, using a financial risk case study. He also talked about how neural nets are starting to be used to build "models of models" using legacy data, so that scientists can conduct experiments on those models, saving time and resources in the lab. You can find an example of how Goodyear uses this to design better tyres at our website.

Following on from the popularity of this seminar, we will be offering two webcasts on building better models on 1 and 7 May. You and your colleagues can register via our webcasts page. We will also be holding a one day, hands-on workshop on 25 June for new users of JMP Pro. Places on this are strictly limited, so if you would like further details, you can contact me via email.

Post a Comment

Visualization of life sciences data

Recently, Georges Grinstein, head of the Bioinformatics Program and Co-Director of the Institute for Visualization and Perception Research at the University of Massachusetts Lowell, was in our studios hosting a webcast and promoting his upcoming seminar Exploring Data Visualization in Life Sciences Research. I had a chance to sit down with Dr. Grinstein and ask him about the changing nature of life sciences data, analysis and the new technologies that allow us to visualize new insights into what our data is telling us.

Why do you believe that it’s important to visualize life sciences data and results of analysis?
Without visualization, we depend on numbers, which most often represent summaries or aggregations, statistics, or other computed results. These can hint at structure in the data but are often too precise to show related structures. I often think of these numbers as representative of not just the data but aliases of the data as well.

When should you visualize life sciences data?
It depends on the task, but my gut feeling says almost always.

Although visualization of this kind of data can bring new insights, are there any dangers to relying on visuals?
Yes, there's danger if one jumps to conclusions without validating either analytically or visually the "insights." I think of these insights as hypotheses, and so they must be validated. But visualization really does bring on new insights (and there are many different visualizations).

Data sets are extremely large. Do you need to do anything with your data before you begin creating visual representations of it?
This depends on the task as well. Some algorithms and some visualizations cannot deal with large data, or they take much too much time to generate results. In such cases, data reduction has to take place through such techniques as subsetting, dimensional reduction and sampling, to name a few. And if one wants to interact with a large data set, this can be tricky.

What’s next in visualization formats and trends?
Here are four trends that you should be aware of:
1. Much tighter coupling between analysis and visualization.
2. More interaction with visualizations of larger data.
3. Parallelization of many algorithms and visualizations.
4. Precomputation of large data sets to save time in the later discovery steps.

Dr. Grinstein will be presenting at the Exploring Data Visualization in Life Sciences Research seminars April 24 at the Broad Institute in Cambridge, MA, and on April 25 at the Chauncey Conference Center in Princeton, NJ. Visit our website for more information on – including how to sign up for – the Cambridge or Princeton seminars.

Post a Comment

Using JMP to evaluate MCMC diagnostics

It’s no secret that JMP excels in the visual exploration of data. There’s a healthy dose of statistics, too. But when asked about Bayesian methods, JMP is probably not the first software package that comes to mind. JMP 10 does contain Bayesian D-optimal and I-optimal designs in our design of experiments (DOE) features, and Bayesian variance components are available for variability charts. While Bayesian methods may be limited in JMP, it is the perfect tool to evaluate and summarize posterior samples obtained from Monte Carlo Markov Chain (MCMC) estimation.

Let me be clear: JMP does not have methods to fit models in the Bayesian paradigm using (MCMC), but it is valuable for understanding the posterior samples obtained from other packages such as PROC MCMC, WinBUGS or BRugs. All you need is the freely available JMP add-in for MCMC Diagnostics (free SAS profile required).  Below I’ll describe the add-in using posterior samples of 40 adverse event treatment parameters (log-odds ratios) obtained from PROC MCMC using data from a vaccine trial described in Mehrotra & Heyse (2004).

Figure 1. JMP MCMC Diagnostics Dialog

The MCMC Diagnostics dialog (Figure 1) displays all of the variables (COLUMNS) of the input data set. The only requirement to run the add-in is that at least one PARAMETER should be specified. In these instances, it is assumed that all samples are from a single Markov chain, and samples will be numbered in trace plots from 1 to the total number of rows in the data set. If ITERATION is provided, trace plots will reflect appropriate sample numbers (say, if burn-in samples were removed). CHAIN specifies a numeric value if multiple Markov chains are generated to assess parameter convergence to the target distribution. COLOR PREFERENCE specifies the color (default Blue/Red) of any credible intervals that exclude the NULL VALUE (default 0). Under the defaults, intervals entirely to the right or left of the null value will be blue or red, respectively. ALPHA (default 0.05) calculates (1-α)˟100% credible intervals for the forest plots.

Figure 2. MCMC Diagnostics Including Histogram and Density Function of Posterior Samples, Trace Plots, Autocorrelation Assessment and Gelman-Rubin Statistics

The add-in generates the MCMC Viewer Window in Figure 2. The Diagnostics Tab provides histograms, density function curves and summary statistics of the posterior samples from Chain 1 for all parameters. Trace plots summarize the behavior of the Chain 1 samples over the iterations and can be used to assess convergence of the chain to the target distribution. Histograms and summary statistics summarize the autocorrelation of Chain 1 posterior samples up to lag 25. If the analysis includes multiple Markov chains, trace plots summarize all chains simultaneously, and Gelman-Rubin Statistics are provided.

Figure 3. 95% Equal-Tailed and HPD Credible Intervals of Posterior Samples

The interactivity of JMP is a key benefit of the add-in. The diagnostic output of all parameters except the first is initially collapsed. This output can be opened or closed by selecting the outline boxes in the Tab. By default, a nonparametric kernel density curve is fit to the posterior samples in the histograms. However, the user can add multiple reference lines from the red triangle menu of the histogram. If needed, a partial autocorrelation or variogram summary figure can be generated from the red triangle menu of any trace plot.

The Forest Plots of Credible Intervals Tab provides two figures of 95% credible intervals for the parameters using samples from Chain 1. Figure 3 summarizes equal-tailed credible intervals of the posterior samples. Here, the lower and upper endpoints for these intervals correspond to the 2.5th and 97.5th percentiles of the samples, respectively. In addition, Figure 3 summarizes the 95% highest posterior density (HPD) credible intervals, which are the narrowest intervals covering 95% of all samples. For both plots, the mean and median sample values are summarized using circles and diamonds, respectively. Only the intervals for T_17, which corresponds to the adverse event of irritability, exclude the assumed null value of 0. If needed, the underlying statistics for these figures are a button-click away.

Figure 4. Univariate Probability Calculator

The Univariate Posterior Probability Calculator enables the user to define probability statements for the parameters, the results of which are summarized in a table (Figure 4). Ranges can be added manually, or the sliders can be used to select limits which are restricted to the minimum and maximum values of the samples for all parameters from Chain 1.

Figure 5. Multivariate Probability Calculator

The Multivariate Posterior Probability Calculator lets the user define probability statements that consider two or more parameters simultaneously. Figure 5 illustrates this calculator using only the treatment parameters from the first five adverse events. We calculate the posterior probability that treatment has an undesirable effect (essentially each parameter greater than 0) on astenia/fatigue, fever, infection-fungal, infection-viral and malaise simultaneously as 0.1756. The multivariate calculator makes use of the JMP Data Filter to select data table rows meeting the criteria defined in the filter. Alternatively, the user can open the data table and select rows manually, or apply a function to the columns of interest. Once rows are selected, the user can push the Calculate Posterior Probability button.

Finally, while we analyzed odds ratio parameters on the log-scale, these could have easily been transformed to odds ratios using the Transform function under the Column menu.

Mehrotra, DV & Heyse, JF. (2004). “Use of the False Discovery Rate for Evaluating Clinical Safety Data.” Statistical Methods in Medical Research 13: 227-238.

Post a Comment