Reliability regression with binary response data (probit analysis) with JMP

Many readers may be familiar with the broad spectrum of reliability platforms and analysis methods for reliability-centric problems available in JMP. The methods an engineer will select – whether to solve a problem, improve a system or gain a deeper understanding of a failure mechanism – are dependent on many things. These dependencies could include whether the system or unit under study is repairable or non-repairable. Is the data censored, and if so, is it right-, interval-, or left- censored? What if there are no failures? How can historical data on the same or similar component be used to augment understanding?

I’d like to address a data issue specific to the response variable. The Reliability Regression with Binary Response technique can be a useful addition to the tools that reliability engineers or medical researchers use to answer critical business and health-related questions. For instance, when the response variable is simply counts of failures, rather than the much more commonly occurring response that is continuous in nature, alternate analytical procedures should be used. For example, say you are testing cell phone for damage due to dropping phone onto floor. You may test 25 phones each at various heights above the floor, e.g. 5 feet, 8 feet etc. Then you simply record the number of failures (damaged phones) per sample set. In a health related field, you may want to test the efficacy of a new drug at differing dosages, or compare different treatment types and record the patient survival counts.

The purpose of this blog post is to help you understand how you can perform regression analysis on reliability and survival data that has counts as the response. This is known as Reliability Regression with Binary Response Data, sometimes referred to as Probit Analysis. The data in Table 1 is a simple example from a class I attended at the University of Michigan a number of years ago. The study is focused on evaluating a new formulation of concrete to determine failure probabilities based on various load levels (stress factor). A failure is defined as a crack of some specified minimum length. Some questions we would like to answer include the following:

  • For a given load, say 4,500 lbs., what percent will fail?
  • What load will cause 10%, 25%, and 50% of the concrete sections to crack?
  • What is the 95% confidence interval that traps the true load where 50% of the concrete sections fail?
Table 1: Concrete Load Study

Table 1: Concrete Load Study

The data contains three columns. The Load column is the amount of pressure, in pounds, applied to the concrete sections. Trials are the number of sections tested, and Failures is the number of sections that failed as a result of crack development under the applied pressure. We will use JMP’s Fit Model Platform to perform the analysis. Depending on the distribution selection you choose to analyze your data with, I refer you to Table 2 below which will assist you in selecting the correct Link Function and appropriate transformation, if required, for your x variable.

Distribution Link Function Transformation on X
Sev Comploglog None
Weibull Comploglog Log
Normal Probit None
Lognormal Probit Log
Logistic Logit None
Loglogistic Logit Log

Table 2: Depending on your distribution, this table will guide you to the appropriate Link and Transformation selections in the Fit Model Dialog.

Open the data table and click the JMP Analyze menu, then select Fit Model. Once the dialog window opens, select the Load and Trials column and add to the Y dialog. Add Load as a model effect, and then highlight load in the Construct Model Effects dialog, click the red triangle next to Transform and select Log. Your model effect should now read Log(Load) as seen in the completed Fit Model dialog screen below. Select Generalized Linear Model for Personality, Binomial for Distribution since we are dealing with counts and Comp LogLog for the Link Function since we are using a Weibull fit for this example.

Figure 1: Completed Fit Model Dialog for fitting a Weibull in our example.

Figure 1: Completed Fit Model Dialog for fitting a Weibull in our example.


Next select Run. You will see the output in Figure 2:

Figure 2: Initial output with Regression Plot and associated output. Note the Log(Load) parameter estimate of 4.51 is the Weibull shape parameter.

Figure 2: Initial output with Regression Plot and associated output. Note the Log(Load) parameter estimate of 4.51 is the Weibull shape parameter.

So now let’s begin to answer the questions we posed at the beginning. To find out what percent of sections fail at a load of 4,500 lbs, go to the red triangle at the top next to the output heading Generalized Linear Model Fit. Select Profilers > Profiler. See Figure 3. Scroll down in the report window and drag the vertical red dashed line to select 4,500 for load, or highlight the load value on the x-axis and type in 4,500. You will see that at a load of 4,500 pounds, we can expect a 45% failure rate. The associated confidence interval may be of interest as well. With this current sample, results could range from as small as 29% up to as high as 65%.

Figure 3: Prediction Profiler with a load of 4,500 pounds.

Figure 3: Prediction Profiler with a load of 4,500 pounds.


Now, to find out what load will cause 10%, 25%, and 50% of the concrete sections to crack, we again go to the red triangle at the top of the report and select Inverse Prediction. You will see the following dialog in Figure 4. Type in 0.1, 0.25 and .50 to obtain results for 10, 25 and 50 percent, respectively.

Figure 4: Dialog for Inverse Prediction

Figure 4: Dialog for Inverse Prediction

Scroll down in the report where you will find the Inverse Prediction output. See Figure 5. The predicted load value, in pounds of pressure, for the B10 is 3055, B25 is 3817and B50 is 4639. A corresponding plot, which includes a visual representation of the confidence intervals, is also provided.

Figure 5: Inverse Prediction output.

Figure 5: Inverse Prediction output.

Finally, we would like to find the 95% confidence interval that traps the true load where 50% of the concrete sections fail. Again, refer to the Inverse Prediction output in figure 5. We find that a lower bound of 3,873 up to an upper bound of 5,192 traps 95% of the true load where 50% of the sections fail.

JMP has numerous capabilities for reliability analysis, with many dedicated platforms such as Life Distribution, Reliability Growth and Reliability Block Diagram, to name just a few. However, as you can see here, you can perform other reliability and survival analysis methods that using other JMP analysis platforms.

Post a Comment

Combining city & state information on map in Graph Builder - Part 1

Showing a map within Graph Builder in JMP has become a popular way to visualize data. This is partly because you can color the geographic area of interest based on a variable in the data table (Figure 1).


Figure 1

Or you can plot cities as points if you have latitude and longitude information (Figure 2).


Figure 2

But what if you want to combine both?

A customer wanted to do exactly that. This JMP user was trying to show specific cities with states of interest while coloring those states on a particular property that was in a data table. On top of that, the JMP user wanted to be able to hover over the city to display its name and additional city information.

No problem! I’ll show you how. In my example, I'll use city pollution and population data found in data set (found in the Sample Data under Help in JMP), and I'll join it with some state-level crime data (total crime, in this case). I'll use the crime data from data set, which is also found in the Sample Data directory in JMP. The goal here is to show crime rate for each state in a given year and be able to see pollution levels for a given city in that state. The purpose is to explore a potential link between the two without plotting too much information.

The desired graph looks like this (Figure 3):


Figure 3

To create the desired graph, I will need to overlay the cities in their geographic location as points on top of the states, while at the same time making sure that only the states are colored. To make the graph, you would do the following, in order:

  1. Drag Latitude and Longitude to the Y and X areas, respectively.
  2. Drag State to the Map Shape Zone.
  3. Remove Smoother Graph Type by clicking its icon on the top Graph Type Toolbar.
  4. Drag State Total Crime Rate to the Color Zone.
  5. Drag and drop Points Graph Type onto the plot.
  6. Go to the Points section (see Figure 4). Find the Variables subsection, click on the “…” button and uncheck Color and Map Shape (see Figure 4). This option is needed to remove the coloring from the points and to allow them to center on their geographic coordinates instead of being centered on the state.

Figure 4

For presentation purposes, I need to remove the axes (they do not add any information here) and change the color of the gradient representing total crime rates to something that is sequential instead of divergent (so I display the information in a more informative way). Right-clicking on each axis and removing tick marks and label gets rid of most of the axis. Next, I right-click on the center of the graph and then go to Graph>Border to uncheck the left and bottom border. If Latitude and Longitude still appear, I can select the text and delete it. I now have the graph/map depicted in Figure 3, but I am not done yet.

I wanted to be able to hover over each city and see the city name and additional meta-data/information found in the other columns. To make this happen, I:

  1. Select the columns of interest on the data table.
  2. Right-click on one of the column headers and choose Label/Unlabel (see Figure 5).

Figure 5

When I hover the cursor over the city of interest, I get the information I want. I now have the desired output and behavior, as in Figure 3.

Now I can explore each city of interest without having to plot all the information on the same graph!

However, what if I wanted to show more information about the cities on the map? How would I show something like population size for each city and one of the pollution columns in the map without having to hover over each city? Stay tuned – the answer to these questions will come in a follow-up blog post.

Post a Comment

Teaching with JMP, part 2

After writing the post on Teaching statistics with JMP last month, I didn’t think about a follow-on post since we had so many wonderful comments. But when we heard from Roger Hoerl at Union College about the thesis his student, Keilah Creedon, wrote (using JMP for the designed experiment part), it seemed a great opportunity to call attention to some good work.

When we hosted Roger and Ronald Snee for a webcast last year, Roger had just transitioned from leading GE Global Research to teaching at Union College. Roger and Ronald are the authors of Statistical Thinking: Improving Business Performance, an excellent book and one we recommend.

Roger is also co-author with Presha Neidermeyer of Use What You Have: Resolving the HIV/AIDS Pandemic. Roger kindly shared a copy of this book, which takes a statistical-thinking approach to the HIV/AIDS pandemic. In his words: “We have a disease that’s preventable and it’s treatable and billions of dollars have been spent on it. It’s the most studied disease in history and yet millions of people are still dying. Why? How can this be? It doesn’t add up.”

Thus, he chose to spend a sabbatical he was awarded studying this pandemic and writing about it. He and his co-author take a long-term look at a complex problem, recognizing that change is constant and that you have to look at the big picture with a goal of incremental improvement over time.

Keilah’s thesis, "Evaluating the Connection Between Gender Based Violence and HIV/AIDS," takes a statistical-thinking approach as well. She focused on one of the goals of the United Nations joint program on HIV/AIDS (UNAIDS) of eliminating gender inequalities, which includes addressing violence — a key risk factor for women with HIV.

She expanded one of UNAIDS' Excel-based models to incorporate the effect of gender-based violence with a sensitivity analysis of the revised model, using a designed experiment approach. The results indicate that gender-based violence is a significant contributor to the HIV/AIDS epidemic and that addressing gender-based violence should be an important goal of the HIV/AIDS response. But Keilah’s statistical thinking didn’t stop there. She went on to point out many ways to address gender-based violence and noted a few programs that seem particularly promising (Stepping Stones and One Man Can).

It has been said that teaching is the most noble profession. Students who learn how to think statistically is a gift that can keep on giving, a philosophy of learning and action that makes the world a better place. Our thanks to all the teachers and mentors who inspire statistical thinking and to the students who are motivated to put this skill to good use.

Post a Comment

Identifying re-enrolled subjects in clinical trials, the sequel

This past June, at the Drug Information Association (DIA) annual meeting, I had the opportunity to present and participate in a panel discussion on innovative approaches to ensure quality and compliance in clinical trials. Not surprisingly, a majority of the discussion focused on sponsor responsibilities for building quality into its clinical program, as well as the responsibilities of investigator sites participating in clinical trials. As often happens, there was a lull in questions being asked to the panel, so I took the opportunity to ask a question to the audience: How do we address the issue of patients enrolling multiple times within the same clinical trial, or multiple times within the same clinical program?

Unfortunately, there was no good solution offered to this problem, even from individuals representing regulatory agencies. Patient privacy makes it difficult to identify instances of the “professional patient.” To address individuals who may enroll in multiple clinical trials within the same development program, the best advice was to include exclusion criteria in each protocol that prevents someone from participating a second (or more) time. This might be effective if patients try to enroll within the same site for multiple studies, but it will likely not help if they try to enroll elsewhere.

In a previous post, I discussed ways to try and identify these subjects using birth dates and initials as a way to match potential re-enrollers. Demographic and physical characteristics can help highlight the interesting matches. But rules on patient privacy may mean that data on birth dates and initials are not available in the study database. For example, clinical sites in Germany can only provide the birth year for study participants, which makes this data far less useful to identify potential re-enrollers.

Figure 1.  Summary of Between-Patient Distances

Figure 1. Summary of Between-Patient Distances

Because data on initials or birth date may not be available in the study database, JMP Clinical 5.0 offers a new method to identify potential re-enrollers using data collected at the study site. The Cluster Subjects Across Study Sites analysis calculates the similarity between pairs of patients within subgroups based on gender, race and/or country (user-specified) using pre-treatment data on vital signs, laboratory measurements, physical examinations etc. (Figure 1). Users can further highlight interesting pairs of subjects as those within a few kilograms or centimeters of one another, or those within a small range of age differences (which can vary depending on the duration of the clinical trial). Further, cluster analyses help identify sets of subject IDs for those individuals who may have participated three or more times.

While I have presented these analyses in the context of identifying subjects who re-enroll within the same clinical trial, the same approaches can be used to identify patients who have participated in multiple studies within the same clinical program. If these cases are identified early, sponsors can minimize the amount of data collected on these individuals.

If you want to learn more about the issue of identifying re-enrolled subjects in clinical trials, you may want to pick up a copy of my new book Risk-Based Monitoring and Fraud Detection in Clinical Trials Using JMP and SAS, which is being released today (you can buy it soon now!). Let me know what you think of the book.

Post a Comment

JMP smashes new ground at The Shard

The Shard in London was the venue for a seminar on statistical discovery in consumer and marketing research. (Copyright of The View from the Shard. Image used with permission.)

JMP UK broke new ground last week when we held a new seminar on Statistical Discovery in Consumer and Marketing Research. And what better place to hold it than The Shard, the latest building to smash through London’s skyline. It stands out as the highest building in the city, so it is fitting that JMP, which stands out as the best in desktop data analytics, should choose that venue.

The event was attended by delegates from a wide range of industries, from media to marketing, and from finance to pharmaceuticals. We aimed to show how you can use JMP to:

  • Get deep insight into your consumer and market research data
    • Through the unique marriage of advanced analytics with compelling visuals.
  • Get more from your current environment, be it a database, Excel, SAS, SPSS or some other statistical package.
    • JMP is simple to install and easy to use.
  • Build better models to understand what drives your customers’ behaviour.
    • Perform scenario analysis with your clients and executives.
  • Ultimately, make better marketing decisions faster.

Ian Cox, our European Marketing Manager, introduced the seminar. He described how we would use case studies to show how you can separate the signal from the noise in your data easily. He also acknowledged, to nods round the room, that people in marketing tend not to be statistical experts, so having a simple way to access the right method is important.

Ian Cox introduces the seminar with London's Tower Bridge in the background.

The cases studies covered a wide range of themes from visualizing and exploring general and geographic data, to building models to understand the drives behind customers leaving your business. Robert Anderson explained how you can build better models by the breaking the data set into three portions -- one to train the model, another to validate that it's the best model to avoid overfitting, and one, which is not used in the model-building process, to test how good the model is. He showed how you can use bootstrap forest combined with this model validation method to build a robust model that you can have confidence in -- without having to understand the statistics behind the method. He demonstrated how you could compare models to get an idea of which would give the best return on your campaign. He also showed how you can profile these models so that you, clients and executives can do “what if” analysis to test different scenarios.

JMP Profiler helps you see the drivers behind your analysis and explore different scenarios.

JMP Profiler helps you see the drivers behind your analysis and explore different scenarios.

A member of the audience asked how easy it is to deal with outliers and missing values in JMP. Robert explained that we see these and correlations as “messy data” and that JMP has many ways of dealing with it, such as:

• Using the missing data pattern to understand where you have missing values.
• Using Informative Missing in JMP to understand if “missingness” is important.
• Using the Bootstrap Forest in JMP to build models that are robust to outliers.

We showed how rich the modelling techniques in the software are through two examples of other modelling techniques for particular problems:

• Using Partial Least Squares (PLS) to analyse short and wide data sets where regression techniques would not be effective, for example, with sensory panel data.
• Using Uplift Modelling to target your campaign at the people who are going to respond best to it.

The event proved very popular, with double the number of people registered to the seats available. We are thinking about whether to hold the event again, so if you would be interested in it, let me know by emailing me.

Post a Comment

New JMP book on risk-based monitoring & fraud detection in clinical trials

Book JacketRisk-based monitoring (RBM) is a hot topic in the clinical trials arena. It’s a new way of designing and operating clinical trials. Now, rather than visiting each clinical trial site and reviewing all patient records, you identify just those who present the higher risk of poor data quality, fraud and weak operational efficiency. Last month at the annual Drug Information Association conference, Richard Zink gave two presentations on the topic of RBM and turned up the heat on interest in the subject. Richard, who is the lead developer on RBM for JMP and a prolific blogger on the topic, will release his book, Risk-Based Monitoring and Fraud Detection in Clinical Trials Using JMP and SAS, on July 16.

Recently, we talked with Richard about his book and asked for his insight into what RBM means to clinical trials and the pharmaceutical industry.

Why did you write the book, and how do you hope practitioners will use it?
I thought a book would be a good idea to help current users get the most out of JMP Clinical. Initially, I was thinking about a book on safety topics. However, while I was mowing my lawn one random Saturday, I had an epiphany that many of the features that I was currently developing and had recently developed addressed issues related to data quality in clinical trials. Data quality is an important topic, since the well-being of patients’ needs to be protected, as well as the integrity of study conclusions. However, a significant proportion of trial costs in the pharmaceutical industry are associated with the review of trial data. The goal is to maintain high quality data with more efficient review, leveraging statistics and graphics as much as possible.
I hope users from all backgrounds can use the text to gain insights into how they can identify and investigate issues related to data quality and misconduct in clinical trials. Further, the examples illustrate to new JMP users how easy it is to extend analyses beyond the reports that are immediately provided. I never used JMP while I was in pharma, which in retrospect is very unfortunate since it is so easy to explore data using JMP. I hope to convince others in clinical development of how great JMP is – mostly so they don’t make the same mistake I did.

zink image 2Tell me something surprising you discovered while researching the book.
In pharmaceutical trials, source data verification (SDV) – which is the act of comparing the case report forms (CRFs) used to collect trial data against physician records – has traditionally been performed so that all CRF fields are verified. On the surface, reviewing every data point seems like a good way to capture errors and provide a high-quality database. However, this manual process doesn’t easily allow for comparisons across time, patients and other clinical centers, which are extremely effective methods for identifying unusual data. Furthermore, despite some misconceptions, 100 percent SDV is not a requirement of the FDA. Given the cost of traditional methods, their limitations in assessing various aspects of data quality and the lack of regulatory requirement, I am bit surprised at how these practices became so widespread. The good news is that the industry and numerous regulatory agencies are working to identify effective and efficient means of providing quality data, all while maintaining sufficient protections for the well-being of trial participants.

What is one of the most important insights from this book?
It’s really important to understand that every team member has a role to play in the review of trial data, and each individual brings a unique skill set important for understanding patient safety, protocol adherence, or data insufficiencies that can affect the final analysis, clinical study report, and subsequent regulatory review and approval.

What do you think is the first step a pharma company/sponsor has to take to begin implementing RBM?
First, companies will need to identify all of the data sources they are interested in using for RBM, and there are many possibilities. For example, the study database can provide information on safety outcomes, disposition and enrollment. Randomization systems provide data on enrollment, screen failures and drug dispensation. Database management systems contain data on queries to resolve data inconsistencies, missing CRF pages, and how responsive investigators are to addressing issues. Statisticians and programmers may write programs to identify protocol deviations and eligibility violations. Monitors may manually collect data on protocol violations, and other issues at the site that wouldn’t otherwise be available electronically. Once these sources are identified, they need to be integrated. The process to refresh this integrated data set needs to be straightforward so that it can be used on a frequent basis.

What value will companies see with RBM, and what pitfalls will they encounter?
There will be cost savings to companies. I also believe there will be improvements to data quality by focusing efforts on the data most important to the trial, utilizing statistics and graphics to identify important signals efficiently.
I think there will be challenges in integrating data as I described above, but I think the biggest issue will be one of ownership of the RBM process. Who is responsible? Data management? Statistics? Clinical operations? IT? All of them have important roles to play throughout the process, but I don’t think it will be the responsibility of any one department. Clearly defining the needs and responsibilities of each team member will be important for RBM to run like clockwork. There will need to be investment in sufficient training to educate people on new processes, and to get them comfortable with new tools for performing their jobs.

How does RBM and this book ultimately help patients in our healthcare system?
Current monitoring practices are estimated to consume up to a third of the cost of a clinical trial budget. The potential savings could be used to make medications more affordable, allow for greater innovation in established treatment areas or pursue development in areas of unmet need. Further, I think the proactive risk-based approach to quality in clinical trials will more effectively protect the individuals participating in clinical trials.

Post a Comment

Seeking the Divock Origis of desktop discoveries

US_soccer_fanThis is an awkward time to write a pro-Brussels piece. The soccer player of my house (right) is quite upset, pacing around, trying to get over the heartbreaking, nail-biting World Cup loss that ended with the USA down one goal to Belgium’s two – and all of those goals scored in extra time.

But now it’s time to put that loss aside to focus on an upcoming win. That win would be Discovery Summit Brussels. That’s right. The Discovery Summit you’ve grown to love in the US will have its European debut in March 2015. Provocative plenary talks, in-depth paper presentations, beautiful posters and built-in networking opportunities – this Discovery will have all that and more.

I’m writing this now because the call for papers is open. Will the Kevin De Bruynes of data visualization, the Vincent Kompanys of statistical analysis and the Divock Origis of desktop discoveries please come forward? Take a shot at getting your paper or poster accepted. You’d be among the analytic movers and shakers of Europe and of the wider JMP world.

Here are all the details.

I’m looking forward to meeting the JMP champions at Discovery Summit Brussels!

Post a Comment

Jonah Berger on marketing and why things catch on

Berger_Jonah_Hi res.jpg You will not want to miss the featured keynote for the last day of the JMP Discovery Summit: Jonah Berger, professor of marketing at the Wharton School of the University of Pennsylvania, expert on viral marketing, and author of Contagious: Why Things Catch On. The book is an excellent read for anyone interested in learning why some ideas spread like crazy and others don’t. Jonah took the time to answer a few questions about his book and his research.

In your book, you mention that you forwarded a useful email from a financial services company. Do you think email is still and will continue to be an effective way to market?

Jonah: Email is certainly one effective marketing channel. But just like regular advertisements, people are more likely to trust their friends and colleagues than they are communication from companies. Word of mouth is over 10 times as effective as traditional advertising.

Your book describes some of your own research. What research are you working on now?

Jonah: I'm working on over a dozen projects related to word of mouth and consumer behavior. We just published papers on how sharing things online versus offline changes what people pass on and how talking to one person rather than many changes what people share. Another project we’re working on is how characteristics of content impact whether people read it or not. People often point to views as a valuable metric, but just because someone clicked on something doesn’t mean they actually read it. We’re looking at what leads people to keep reading online content vs. click but not read.

Your research is highly collaborative. Who would you love to work with on a research project, and why?

Jonah: My favorite collaborators are people who have very different skills than I do. It’s great to work with someone who approaches problems differently and lots of insight gets generated in filling in the gaps.

What’s your favorite story from among the ones you shared in your book, and why?

Jonah: The book starts with a great story about a $100 cheesesteak. Most cheesesteaks are $5 or $6 at the local sandwich spot, but this one costs around 20 times as much. It’s made of Kobe beef, lobster and truffles, and comes with a half bottle of champagne, but best of all it’s a great Trojan Horse story for the restaurant that sells it -- a great example of how a small business used the power of word of mouth to grow their brand.

In addition to the $100 cheesesteak, Jonah’s book had several other examples of things that went viral. Two that I had to show my daughter were The Force: Volkswagen Commercial and the Dove evolution video (which she had to watch five times each). Jonah is sure to have more interesting things to say on why things catch on at Discovery Summit. Of his STEPPS principles, you are sure to learn some things that will provide you with social currency, practical value and interesting stories.

Note: This is the third blog post in a series on keynote speakers at Discovery Summit 2014. The first post was about David Hand. The second post was about Michael Schrage.

Post a Comment

Sun shines on first Irish JMP User Forum

The sun came out in all its splendour for the first Irish JMP User Forum. As the media reported, Ireland's mini-heatwave rivalled temperatures in Rio. JMP users gathered in one of Ireland's most beautiful stately homes, Carton House, to celebrate the day in style.

Carton HouseThe Chair, Rakhi Baj from AbbVie, kicked off the day with a lively welcome, before the users had three excellent presentations of how JMP is used at major companies across Ireland:

-- Stephen Keegan, Abbott's site Statistician, talked about how the number of diabetes test strips they manufacture has exploded ninefold in as many years to more than 4 billion units. This has resulted in a similar growth in the volume of Abbott's data. Abbott has been proactive by migrating its data exploration from Excel to JMP and automating many activities with JMP Scripting Language (JSL). This has enabled the company to increase its understanding of the manufacturing process, recognize trends and identify critical process parameters.

-- David Connolly, Operational Excellence Lead at Celestica, which manufactures ink cartridges, took JSL one step further by showing how you can use it to "facilitate efficiency (or laziness!) and make yourself look grand by scripting." He finds JMP very useful at turning data into useful information to drive better decision making. He uses JSL to automate routine tasks to provide his management with simple clear summaries, to reduce the time taken and risk involved with repetitive tasks, and to make analyses scalable across Celestica.

-- Melvyn Perry, Statistician Manager at Pfizer, showed how he uses the modelling capabilities of JMP to ensure that the problems don't mushroom when one batch of drug substance becomes many batches of drug product in manufacture, by ensuring better process capability. He did this by showing how he uses assay variability to explain some of the variation in drug product.

After a wonderful lunch in the suJeff Perkinson presentationnshine, the users listened to the top 10 tips of Jeff Perkinson, JMP Customer Care Manager. Jeff also introduced the users to JMP User Community website.

Rakhi ended the day by soliciting feedback. The response was overwhelmingly positive: Enough users put themselves forward to present to justify a second forum in the not-too-distant future. Several users also offered to hold the next forum at their offices, which would make the event even more user-focused.

Post a Comment

Mmm cookies: a tale of discrete numeric variables, disallowed combinations and alias optimality

(Photo courtesy of Trish O'Grady)

How do individual ingredients affect the taste ratings of cookies? Design of experiments can help find out. (Photo courtesy of Trish O'Grady)

When I was in graduate school, one of my hobbies was to bake cookies for the department. For one of the basic chocolate chip cookie recipes, it wasn’t uncommon to switch the chocolate chips with another ingredient that was on sale that week (I was a grad student, after all). That also meant I had enough volunteers to give ratings on cookies to give me a reasonable response value on a batch of cookies. What if I wanted to find out how each of these ingredients individually affects the taste of the cookie?

For this example, I’m looking at walnuts, raisins, chocolate chips, pecans, coconut, toffee and brownie chunks. I can afford to make 14 batches of cookies, so a simple approach would be to make two batches with each ingredient. However, it’s going to be very difficult to pick up the differences between ingredients unless I have very little variation between batches, which is easier said than done. This sounds like a good opportunity to use design of experiments, and specifically Custom Design in JMP, so that I can use multiple ingredients in a batch! However, here are a couple of things I need to consider:

  1. If I’m using multiple ingredients per batch, it’s very likely there are active two-factor interactions that I’m not interested in estimating.
  2. The structure of the cookie is going to break down with too many added ingredients, so I decide not to bake cookies with more than four ingredients in a batch.

To handle the first issue, I can use an Alias Optimal design, but the second issue is a bit trickier.

Ideally, I would treat each ingredient as a two-level categorical factor, with levels indicating presence or absence. However, restricting each batch to have no more than four ingredients would be difficult. Another idea would be to use continuous variables from 0 to 1, and use a linear constraint that the sum should be less than or equal to four. This yields a design that looks good in terms of alias optimality, but the linear constraint makes it tough for the coordinate-exchange algorithm to find whole numbers for the ingredients, and I end up with something that has a lot of decimals that I don’t want to deal with. If only there was a way I could use a continuous variable that wasn’t allowed to be a fraction…

Discrete Numeric Variables

Why don’t I treat them as discrete numeric variables? This way I’m still dealing with a continuous variable, but I’m restricting the number of possible values. I open up a new Custom Design and enter my discrete numeric variables, as shown in the figure below.

factor setup cookies

After clicking the “Continue” button, I can select Alias Optimality from the red triangle at the top of my open Custom Design:

optimality choice

I’m almost ready to create the design – I just need to set up my linear constraint... only to realize that I’m not able to use the linear constraint interface with discrete numeric variables. Now what?

Disallowed Combinations

I previously blogged about disallowed combinations in the context of map shapes and space-filling designs. For this example, I want to use the linear constraint that ensures I use no more than four ingredients – it just needs to be switched to a disallowed combination. That is, we want to disallow whenever the sum of the ingredients is greater than four. Heading back up the red triangle and selecting “Disallowed Combinations,” I tell the Custom Designer to not use any run where the sum exceeds four:

disallowed cookies

My Alias Optimal Design

I can now set the run size to 14, and click the “Make Design” button. I get a design that looks like this:

cookies design

A quick look verifies that each batch of cookies has either three or four ingredients. But now for the moment of truth – how did we do in terms of alias optimality? Looking at the Color Map on Correlations reveals that the main effects are orthogonal to the two-factor interactions – this means that the Alias Matrix has all zeroes except for the intercept.

cookies color map

Final Thoughts

As I mentioned in my previous blog post, using an Alias Optimal design involves a trade-off – there’s a loss of estimation efficiency vs. the D-optimal design. However, the Alias Optimal design gives me worry-free estimation of the main effects even in the presence of two-factor interactions.

To get all the main effects unaliased by any two-factor interaction, you need your design to consist of pairs of runs that are mirror images of each other (that is, each 0 in one row has a 1 in the corresponding column of its paired row and vice versa). This implies that you need an even number of runs in your design. So, it was fortunate that I could afford to do 14 runs!

Now, it's time to bake cookies...

Post a Comment