An eggciting designed eggsperiment

What's the best method for getting hard-boiled eggs that are easy to peel and attractive? (Photos by Carroll Co)

What's the best method for getting hard-boiled eggs that are easy to peel and attractive? (Photos by Caroll Co)

A typical scene in my kitchen: I make a batch of hard-boiled eggs with the hope of an easy peel and a beautifully cooked center. But when I sit down to enjoy my egg, I find that, sadly, it’s not so easy to peel – or I have discoloration around the yolk (or worse yet, sometimes both occur).

Here's how I've been preparing my hard-boiled eggs: I start with the eggs in a pot of cold water. Then, I bring the pot to a boil, remove it from the heat and cover the pot for 12 minutes. After a recent disappointing experience with both overcooked and hard-to-peel eggs, I decided to investigate further in a quest to make better hard-boiled eggs.

My Internet search revealed that almost everyone claims to have a foolproof way to make hard-boiled eggs, but a quick browse through comments shows mixed results. Some common themes and questions appear, so it sounded like the perfect opportunity to use a designed experiment to separate fact from folklore.

For a first try at this eggsperiment, my budget for runs was two dozen eggs – same size/brand, purchased two weeks apart. Perhaps in a future experiment, I will use more eggs, but I wanted the peeler (my wife) to be blinded from knowing how the egg was prepared. Since I wasn’t going to be doing the peeling, 24 eggs seemed to be the limit of asking for help from my wife. I also quailed at the thought of having to eat so many egg salad sandwiches in a short period of time.

While most cooking methods for hard-boiled eggs start with cold water, a recent blog post had me intrigued about putting the eggs directly into boiling water.

So I ultimately decided on the following factors to study:

  1. Cooking method (start with cold water or put into boiling)
  2. Age of the egg (purchased two weeks ago or newly purchased)
  3. Cooling method (ice bath or cold tap water)
  4. Pre-cool crack (yes or no)

The pre-cool crack indicates whether I cracked the egg before using the cooling method in 3. If you’re familiar with design of experiments, you may recognize that not all of these factors are equally easy to change. For factors 2-4, I can assign these on an egg-by-egg basis (that is, they’re easy to change). For the cooking method, it is much more convenient if I cook more than one egg at a time. Thus, cooking method is a hard-to-change variable, or whole plot variable in the parlance of split-plot designs.

This means that the estimate of the effect of the cooking method is based on the number of batches I cook rather than the number of eggs. I ultimately decided on six batches of four eggs, or six whole plots. While this gives me only three batches for each cooking method, I hoped that I would get at least some indication whether changing the cooking method mattered. For the easy-to-change factors, I’m more likely to detect the important effects because of the number of eggs I have.

For the cooking method, I cooked one batch at a time in the same pot. I used the same amount of water in each batch (2 cups). The start with cold water was heated on medium until the water reached 188 degrees Fahrenheit, at which point I turned off the heat and covered the pot for 10 minutes. For the boiling method, I waited until the water just started boiling and put the eggs in for 11 minutes, while reducing the temperature to medium so that the water was simmering.

The Responses

The responses I measured were peel time, attractiveness of the egg and ease of peel.

The responses measured were peel time, attractiveness of the egg and ease of peel.

My main purpose here was to find out about ease of peeling, but there is still the aspect of whether or not a peeled egg is aesthetically pleasing. The final responses measured were:

  1. Peel time (in seconds)
  2. Attractiveness of the egg (rating from 1 to 5)
  3. Ease of peel (rating from 1 to 5)

While 1 and 3 seem similar, the peel time is likely to be very noisy and may not always pick up on frustration that can arise while peeling, which ease of peel should capture.

The Experiment

Now it’s time to design the experiment. The first step is to enter my responses and factors in the Custom Design platform, which is the first item under the DOE menu. We get something that looks like this:

eggs_factors_responses

Notice that all of the factors are set to “Easy” under the Changes column in the Factors table. To change cook start to be hard-to-change, click on the “Easy” under the Changes column for the cook start factor and select “Hard” from the list that comes up.

eggs_make_HTC

If we click the Continue button at the bottom, it’s time to set up the rest of the design. By default, the model is set to be able to estimate the main effects. With 24 eggs, we should be able to look at two-factor interactions, so I select Interactions -> 2nd to have the Custom Designer ensure the design can estimate all the main effects and two-factor interactions.

eggs_make_interactions

Finally, we need to set up the appropriate run size. Recall that we want six batches of four eggs (24 eggs total). Under the Design Generation tab, this means we set the Number of Whole Plots to 6, and the Number of Runs to 24.

eggs_design_generation

Clicking the Make Design button, and the experiment is ready to go. The design will look something like this:

egg_final_design

Any predictions as to the results? I’ll reveal the results next week.

Post a Comment

Scagnostics JMP Add-In: A new way to explore your data

Scagnostics, scatterplot diagnostics, was discovered by John and Paul Tukey and later popularized by Leland Wilkinson in Graph-Theoretic Scagnostics (2005). These analyses were redefined in High-Dimensional Visual Analytics: Interactive Exploration Guided by Pairwise Views of Point Distributions (2006).

The beauty of scagnostics is the ability to visually explore a data set. JMP has the inherent feature called Scatterplot Matrix (SPLOM), which allows the user to simultaneously compare the relationship between many pairs of variables.

However, SPLOMs lose their effectiveness when the number of variables gets too large. Figure 1 shows a portion of the SPLOM report.

Figure 1. SPLOM for Drosophila Aging Data 

Figure 1. SPLOM for Drosophila Aging Data

Let's explore the Drosophila Aging data (in JMP Sample Data), which has 48 observations and 100 numeric variables. Notice in Figure 1 the substantial number of variables in this data set. This can be overwhelming, and our ability to visually observe the data is flawed. In Figure 1, only about 15 percent of the actual SPLOM is shown. In a world where data sets are growing every day, we need to be able to extract meaningful information from the relationships between our variables. That’s where scagnostics comes in! Scagnostics assesses five aspects of scatterplots: outliers, shape, trend, density and coherence.

This summer, I wrote a JMP add-in (which you can download from the File Exchange if you have a free SAS profile) that allows you to interactively explore data using nine graph-theoretic measures. The add-in combines three current features of JMP: Distribution, Scatterplot Matrix and Graph Builder. Each point in the scatterplot represents a 2D scatterplot. When you select a point in the scatterplot matrix in the bottom left, Graph Builder shows the respective scatterplot for the two variables in the bottom right.

As an example, one point has already been selected in the SPLOM in Figure 2. The corresponding variables are log2in_Tsp42Ej and log2in_CG6372. For this pair of variables, there are two discernible clusters of data. This is noted in a high Clumpy value.

Figure 2. Scagnostics for Drosophila Aging Data – Clumpy Example

Figure 2. Scagnostics for Drosophila Aging Data – Clumpy Example

Figure 3 below shows us that if we select a point with a high monotonic value, we can observe a clear association and a strong linear relationship between the variables,  log2in_alpha_Cat and log2in_CG3430der.

Figure 3. Scagnostics for Drosophila Aging Data – Monotonic Example

Figure 3. Scagnostics for Drosophila Aging Data – Monotonic Example

Another key aspect of Scagnostics is outlier detection. Review the Graph Builder plot in Figure 4 below. When we inspect the two variables log2in_CG18178 and log2in_BcDNA_GH04120, we see two data points that visually appear to be outliers. Results with a substantial outlying value, as well as a relatively high skewed value, support the notion that this pair of variables has major outliers overall.

Figure 4. Scagnostics for Drosophila Aging Data – Outlying Example

Figure 4. Scagnostics for Drosophila Aging Data – Outlying Example

As we compare the original SPLOM report in Figure 1 to the recursive SPLOM and Graph Builder reports in Figures 2, 3 and 4, we uncover much more informative and enlightening analyses.

Now it’s time to download the Scagnostics Add-In and begin your own exploration!

Post a Comment

JMP add-in measures distance between 2 points

JMP has many tools and features that allow you to interactively explore and analyze data. But what if you just want to measure the distance between two points? You could compute the distance with the standard distance formula, but what if the coordinates are latitude and longitude pairs? The distance formula would not be a lot of help then. Thanks to the extensibility of JMP, I was able to develop a new add-in to do one simple task: measure distance. The add-in, called Distance Tool, is an interactive tool that enables you to perform quick and effortless measurements.

In addition to Euclidean distance, the Distance Tool has various distance metrics for you to select from. The tool can compute:

  1. Euclidean distance
  2. Absolute difference between coordinate components
  3. Taxicab distance
  4. Great-circle distance
  5. Other various distance metrics

The tool first finds all graphs that contain objects with measurable distances (Figure 1). The graphs are then assigned a unique key value based on the window title and position. You can then make measurements in the current graph by simply clicking and dragging (Figure 2).

The original graph

Figure 1: The original graph

 

Euclidian measure between two points

Figure 2: Euclidian measure between two points

Now what if you have a graph with latitude and longitude as its axes? No problem. The Great-Circle metric allows you to measure geographic distances between geodesic coordinates (Figure 3).

Figure 3

Figure 3

With the tool’s custom scale feature, you can even set your own scale for graphs with arbitrary axes.

Figure 4

Figure 4

The tool even allows you to trace out a path or polygon shapes, as in the image of an animal footprint (Figure 4).

All the measurements are recorded in separate data tables to give you the ability to store, analyze and organize the information you want (Figure 5).

Figure 5

Figure 5

The tool’s various options and features make it a powerful add-in for JMP.

You can download the Distance Tool add-in from the JMP File Exchange.

Post a Comment

Tips for learning JMP Scripting Language (JSL)

After using JMP in my AP Statistics course this past year, I realized what remarkable software it is. With just a few clicks, JMP could help me complete my homework!

In addition to being a homework helper, JMP was capable of handling large data sets, executing every type of analysis or test I’d ever heard of, creating beautiful custom graphs, and countless other features for the more advanced statistician. I was captivated by the software.

Having (self-proclaimed) proficiency in one computer programming language, C#, and brief exposure to Base SAS, I wanted to add to my arsenal of programming languages during my summer internship at JMP. JMP Scripting Language (JSL) was the perfect choice, incorporating both my love of JMP and my interest in programming. This summer, I used JSL to analyze the popularity of some of our marketing assets.

These are a few things I found useful during my plunge into JSL:

Experimentation with JMP: I took time to explore and play with the functions in JMP. That helped me learn some JSL syntax because many of the JSL functions are similar to or abbreviated versions of their JMP interface equivalents. 

Experience with Programming Languages: My knowledge of C# turned out to be helpful for learning JSL as the two shared several characteristics, including the use of loops and Boolean expressions. However, if you have not had exposure to other programming languages, you can still learn JSL.

Use of a Book: I read and replicated the code in the book Jump into JMP Scripting by Wendy Murphrey and Rosemary Lucas. The book explains the basics of what scripts are, how to obtain, run and edit scripts, and brief JSL statements. It also has frequently asked questions with correctly coded answers. All of this information helped jump-start my learning JSL. I used the FAQ section as exercises – reading, copying and running the code. This helped me to memorize the syntax and its uses.

Scripting Guide PDF in JMP: Although I learned the basics of JSL from a helpful introductory book, I was still not quite ready to start writing scripts freehand. So I sifted through the Scripting Guide PDF book under the Help menu in JMP as my next step. The Scripting Guide provided the details I needed to ease into coding, specifically syntax rules.

Viewing, Editing and Experimenting with Scripts Created Using JMP Commands: Once I had a firmer grasp of JSL and felt ready to try writing scripts, I began by running analyses and viewing the scripts. This enabled me to learn by editing how I wanted my reports to look, adding functions onto the scripts and experimenting with new JSL syntax before beginning to write scripts independently.

Capturing Scripts

Sample Data Sets: While I was editing scripts generated by JMP, I used the sample data sets that come with the software. The diverse sample data sets were perfect for trying out different analyses and experimenting with scripts.

JSL Syntax Reference PDF in JMP: When I felt ready to write scripts completely on my own, I used the JSL Syntax Reference PDF under the Help menu in JMP. It’s an excellent resource for learning and searching for JSL functions.

Searching the Help Index: Another helpful resource was the Help Index under the Help menu in JMP. I used it to learn more about the functionalities of JSL while writing scripts.

Help Menu

Another resource I didn’t use but is helpful is the Scripting Index under the Help menu in JMP. It has a dictionary of all JSL syntax and also shows example code of how to use each function, which is useful for learning new functions.

After learning JSL, I believe I have a deeper understanding of how JMP works and the ways I can use JMP. It has been a very enjoyable experience for me, and hopefully for you, too!

Post a Comment

John Sall on less data drudgery

One of the guiding principles for developers of JMP software is to keep the user “in flow.” They try to minimize the disruptions to the discovery process so you can stay focused on solving the problem at hand, rather than having to take multiple steps to overcome a data or analysis obstacle. The goal is to flow like water instead of drudging along.

This year, JMP celebrates 25 years of designing for an ever-smoother user experience. With 25 years of enhancements, we have many examples of capabilities that speed up discovery: one-click bootstrapping, Prediction Profiler, Assess Variable Importance, Fit Y by X, optimal designs, Graph Builder, Recode, Model Comparison — the list goes on and on.

The next version of JMP and JMP Pro, scheduled for release in March 2015, will bring more such clever capabilities to make your analytic journey even smoother. Since JMP 12 will be launched six months after this year’s Discovery Summit, John Sall, Co-Founder and Executive Vice President of SAS — and creator of JMP— will devote his keynote speech to providing a sneak peek of some of the new features. Not to give too much away, but here are a few things to pique your interest about what he might share:

  • Easier import, access and manipulation of data — including big data.
  • Many new data utilities to compress, bin and recode. (The substantial recode enhancements along with the Excel Import Wizard for the Mac are among my most-used new features lately.)
  • New modeling utilities — smart ways to explore and better handle outliers and missing data, new validation options and predictor screening.
  • New and enhanced analysis methods, several of which collapse several steps into one.
  • Easier sharing of analysis results and data movies.
  • Notice anything different about this JMP table of the keynote speakers at Discovery Summit 2014?

Screen Shot 2014-07-25 at 10.18.35 AM

Consider this a tease list as there are many more enhancements and capabilities coming in JMP 12 to decrease your data drudgery and augment ways you can share insights. Hope you'll join us to hear from John Sall at Discovery Summit next month!

Post a Comment

Reliability regression with binary response data (probit analysis) with JMP

Many readers may be familiar with the broad spectrum of reliability platforms and analysis methods for reliability-centric problems available in JMP. The methods an engineer will select – whether to solve a problem, improve a system or gain a deeper understanding of a failure mechanism – are dependent on many things. These dependencies could include whether the system or unit under study is repairable or non-repairable. Is the data censored, and if so, is it right-, interval-, or left- censored? What if there are no failures? How can historical data on the same or similar component be used to augment understanding?

I’d like to address a data issue specific to the response variable. The Reliability Regression with Binary Response technique can be a useful addition to the tools that reliability engineers or medical researchers use to answer critical business and health-related questions. For instance, when the response variable is simply counts of failures, rather than the much more commonly occurring response that is continuous in nature, alternate analytical procedures should be used. For example, say you are testing cell phone for damage due to dropping phone onto floor. You may test 25 phones each at various heights above the floor, e.g. 5 feet, 8 feet etc. Then you simply record the number of failures (damaged phones) per sample set. In a health related field, you may want to test the efficacy of a new drug at differing dosages, or compare different treatment types and record the patient survival counts.

The purpose of this blog post is to help you understand how you can perform regression analysis on reliability and survival data that has counts as the response. This is known as Reliability Regression with Binary Response Data, sometimes referred to as Probit Analysis. The data in Table 1 is a simple example from a class I attended at the University of Michigan a number of years ago. The study is focused on evaluating a new formulation of concrete to determine failure probabilities based on various load levels (stress factor). A failure is defined as a crack of some specified minimum length. Some questions we would like to answer include the following:

  • For a given load, say 4,500 lbs., what percent will fail?
  • What load will cause 10%, 25%, and 50% of the concrete sections to crack?
  • What is the 95% confidence interval that traps the true load where 50% of the concrete sections fail?
Table 1: Concrete Load Study

Table 1: Concrete Load Study

The data contains three columns. The Load column is the amount of pressure, in pounds, applied to the concrete sections. Trials are the number of sections tested, and Failures is the number of sections that failed as a result of crack development under the applied pressure. We will use JMP’s Fit Model Platform to perform the analysis. Depending on the distribution selection you choose to analyze your data with, I refer you to Table 2 below which will assist you in selecting the correct Link Function and appropriate transformation, if required, for your x variable.

Distribution Link Function Transformation on X
Sev Comploglog None
Weibull Comploglog Log
Normal Probit None
Lognormal Probit Log
Logistic Logit None
Loglogistic Logit Log
     

Table 2: Depending on your distribution, this table will guide you to the appropriate Link and Transformation selections in the Fit Model Dialog.

Open the data table and click the JMP Analyze menu, then select Fit Model. Once the dialog window opens, select the Load and Trials column and add to the Y dialog. Add Load as a model effect, and then highlight load in the Construct Model Effects dialog, click the red triangle next to Transform and select Log. Your model effect should now read Log(Load) as seen in the completed Fit Model dialog screen below. Select Generalized Linear Model for Personality, Binomial for Distribution since we are dealing with counts and Comp LogLog for the Link Function since we are using a Weibull fit for this example.

Figure 1: Completed Fit Model Dialog for fitting a Weibull in our example.

Figure 1: Completed Fit Model Dialog for fitting a Weibull in our example.

 

Next select Run. You will see the output in Figure 2:

Figure 2: Initial output with Regression Plot and associated output. Note the Log(Load) parameter estimate of 4.51 is the Weibull shape parameter.

Figure 2: Initial output with Regression Plot and associated output. Note the Log(Load) parameter estimate of 4.51 is the Weibull shape parameter.

So now let’s begin to answer the questions we posed at the beginning. To find out what percent of sections fail at a load of 4,500 lbs, go to the red triangle at the top next to the output heading Generalized Linear Model Fit. Select Profilers > Profiler. See Figure 3. Scroll down in the report window and drag the vertical red dashed line to select 4,500 for load, or highlight the load value on the x-axis and type in 4,500. You will see that at a load of 4,500 pounds, we can expect a 45% failure rate. The associated confidence interval may be of interest as well. With this current sample, results could range from as small as 29% up to as high as 65%.

Figure 3: Prediction Profiler with a load of 4,500 pounds.

Figure 3: Prediction Profiler with a load of 4,500 pounds.

 

Now, to find out what load will cause 10%, 25%, and 50% of the concrete sections to crack, we again go to the red triangle at the top of the report and select Inverse Prediction. You will see the following dialog in Figure 4. Type in 0.1, 0.25 and .50 to obtain results for 10, 25 and 50 percent, respectively.

Figure 4: Dialog for Inverse Prediction

Figure 4: Dialog for Inverse Prediction

Scroll down in the report where you will find the Inverse Prediction output. See Figure 5. The predicted load value, in pounds of pressure, for the B10 is 3055, B25 is 3817and B50 is 4639. A corresponding plot, which includes a visual representation of the confidence intervals, is also provided.

Figure 5: Inverse Prediction output.

Figure 5: Inverse Prediction output.

Finally, we would like to find the 95% confidence interval that traps the true load where 50% of the concrete sections fail. Again, refer to the Inverse Prediction output in figure 5. We find that a lower bound of 3,873 up to an upper bound of 5,192 traps 95% of the true load where 50% of the sections fail.

JMP has numerous capabilities for reliability analysis, with many dedicated platforms such as Life Distribution, Reliability Growth and Reliability Block Diagram, to name just a few. However, as you can see here, you can perform other reliability and survival analysis methods that using other JMP analysis platforms.

Post a Comment

Combining city & state information on map in Graph Builder - Part 1

Showing a map within Graph Builder in JMP has become a popular way to visualize data. This is partly because you can color the geographic area of interest based on a variable in the data table (Figure 1).

CJK_Blog_07-2014_total_crime_Figure-1

Figure 1

Or you can plot cities as points if you have latitude and longitude information (Figure 2).

CJK_Blog_07-2014_pollution_Figure-2

Figure 2

But what if you want to combine both?

A customer wanted to do exactly that. This JMP user was trying to show specific cities with states of interest while coloring those states on a particular property that was in a data table. On top of that, the JMP user wanted to be able to hover over the city to display its name and additional city information.

No problem! I’ll show you how. In my example, I'll use city pollution and population data found in Cities.jmp data set (found in the Sample Data under Help in JMP), and I'll join it with some state-level crime data (total crime, in this case). I'll use the crime data from CrimeData.jmp data set, which is also found in the Sample Data directory in JMP. The goal here is to show crime rate for each state in a given year and be able to see pollution levels for a given city in that state. The purpose is to explore a potential link between the two without plotting too much information.

The desired graph looks like this (Figure 3):

CJK_Blog_07-2014_combined_Figure-3

Figure 3

To create the desired graph, I will need to overlay the cities in their geographic location as points on top of the states, while at the same time making sure that only the states are colored. To make the graph, you would do the following, in order:

  1. Drag Latitude and Longitude to the Y and X areas, respectively.
  2. Drag State to the Map Shape Zone.
  3. Remove Smoother Graph Type by clicking its icon on the top Graph Type Toolbar.
  4. Drag State Total Crime Rate to the Color Zone.
  5. Drag and drop Points Graph Type onto the plot.
  6. Go to the Points section (see Figure 4). Find the Variables subsection, click on the “…” button and uncheck Color and Map Shape (see Figure 4). This option is needed to remove the coloring from the points and to allow them to center on their geographic coordinates instead of being centered on the state.
CJK_Blog_07-2014_step-6_Figure-4

Figure 4

For presentation purposes, I need to remove the axes (they do not add any information here) and change the color of the gradient representing total crime rates to something that is sequential instead of divergent (so I display the information in a more informative way). Right-clicking on each axis and removing tick marks and label gets rid of most of the axis. Next, I right-click on the center of the graph and then go to Graph>Border to uncheck the left and bottom border. If Latitude and Longitude still appear, I can select the text and delete it. I now have the graph/map depicted in Figure 3, but I am not done yet.

I wanted to be able to hover over each city and see the city name and additional meta-data/information found in the other columns. To make this happen, I:

  1. Select the columns of interest on the data table.
  2. Right-click on one of the column headers and choose Label/Unlabel (see Figure 5).
CJK_Blog_07-2014_label-table_Figure-5

Figure 5

When I hover the cursor over the city of interest, I get the information I want. I now have the desired output and behavior, as in Figure 3.

Now I can explore each city of interest without having to plot all the information on the same graph!

However, what if I wanted to show more information about the cities on the map? How would I show something like population size for each city and one of the pollution columns in the map without having to hover over each city? Stay tuned – the answer to these questions will come in a follow-up blog post.

Post a Comment

Teaching with JMP, part 2

After writing the post on Teaching statistics with JMP last month, I didn’t think about a follow-on post since we had so many wonderful comments. But when we heard from Roger Hoerl at Union College about the thesis his student, Keilah Creedon, wrote (using JMP for the designed experiment part), it seemed a great opportunity to call attention to some good work.

When we hosted Roger and Ronald Snee for a webcast last year, Roger had just transitioned from leading GE Global Research to teaching at Union College. Roger and Ronald are the authors of Statistical Thinking: Improving Business Performance, an excellent book and one we recommend.

Roger is also co-author with Presha Neidermeyer of Use What You Have: Resolving the HIV/AIDS Pandemic. Roger kindly shared a copy of this book, which takes a statistical-thinking approach to the HIV/AIDS pandemic. In his words: “We have a disease that’s preventable and it’s treatable and billions of dollars have been spent on it. It’s the most studied disease in history and yet millions of people are still dying. Why? How can this be? It doesn’t add up.”

Thus, he chose to spend a sabbatical he was awarded studying this pandemic and writing about it. He and his co-author take a long-term look at a complex problem, recognizing that change is constant and that you have to look at the big picture with a goal of incremental improvement over time.

Keilah’s thesis, "Evaluating the Connection Between Gender Based Violence and HIV/AIDS," takes a statistical-thinking approach as well. She focused on one of the goals of the United Nations joint program on HIV/AIDS (UNAIDS) of eliminating gender inequalities, which includes addressing violence — a key risk factor for women with HIV.

She expanded one of UNAIDS' Excel-based models to incorporate the effect of gender-based violence with a sensitivity analysis of the revised model, using a designed experiment approach. The results indicate that gender-based violence is a significant contributor to the HIV/AIDS epidemic and that addressing gender-based violence should be an important goal of the HIV/AIDS response. But Keilah’s statistical thinking didn’t stop there. She went on to point out many ways to address gender-based violence and noted a few programs that seem particularly promising (Stepping Stones and One Man Can).

It has been said that teaching is the most noble profession. Students who learn how to think statistically is a gift that can keep on giving, a philosophy of learning and action that makes the world a better place. Our thanks to all the teachers and mentors who inspire statistical thinking and to the students who are motivated to put this skill to good use.

Post a Comment

Identifying re-enrolled subjects in clinical trials, the sequel

This past June, at the Drug Information Association (DIA) annual meeting, I had the opportunity to present and participate in a panel discussion on innovative approaches to ensure quality and compliance in clinical trials. Not surprisingly, a majority of the discussion focused on sponsor responsibilities for building quality into its clinical program, as well as the responsibilities of investigator sites participating in clinical trials. As often happens, there was a lull in questions being asked to the panel, so I took the opportunity to ask a question to the audience: How do we address the issue of patients enrolling multiple times within the same clinical trial, or multiple times within the same clinical program?

Unfortunately, there was no good solution offered to this problem, even from individuals representing regulatory agencies. Patient privacy makes it difficult to identify instances of the “professional patient.” To address individuals who may enroll in multiple clinical trials within the same development program, the best advice was to include exclusion criteria in each protocol that prevents someone from participating a second (or more) time. This might be effective if patients try to enroll within the same site for multiple studies, but it will likely not help if they try to enroll elsewhere.

In a previous post, I discussed ways to try and identify these subjects using birth dates and initials as a way to match potential re-enrollers. Demographic and physical characteristics can help highlight the interesting matches. But rules on patient privacy may mean that data on birth dates and initials are not available in the study database. For example, clinical sites in Germany can only provide the birth year for study participants, which makes this data far less useful to identify potential re-enrollers.

Figure 1.  Summary of Between-Patient Distances

Figure 1. Summary of Between-Patient Distances

Because data on initials or birth date may not be available in the study database, JMP Clinical 5.0 offers a new method to identify potential re-enrollers using data collected at the study site. The Cluster Subjects Across Study Sites analysis calculates the similarity between pairs of patients within subgroups based on gender, race and/or country (user-specified) using pre-treatment data on vital signs, laboratory measurements, physical examinations etc. (Figure 1). Users can further highlight interesting pairs of subjects as those within a few kilograms or centimeters of one another, or those within a small range of age differences (which can vary depending on the duration of the clinical trial). Further, cluster analyses help identify sets of subject IDs for those individuals who may have participated three or more times.

While I have presented these analyses in the context of identifying subjects who re-enroll within the same clinical trial, the same approaches can be used to identify patients who have participated in multiple studies within the same clinical program. If these cases are identified early, sponsors can minimize the amount of data collected on these individuals.

If you want to learn more about the issue of identifying re-enrolled subjects in clinical trials, you may want to pick up a copy of my new book Risk-Based Monitoring and Fraud Detection in Clinical Trials Using JMP and SAS, which is being released today (you can buy it soon now!). Let me know what you think of the book.

Post a Comment

JMP smashes new ground at The Shard

The Shard in London was the venue for a seminar on statistical discovery in consumer and marketing research. (Copyright of The View from the Shard. Image used with permission.)

JMP UK broke new ground last week when we held a new seminar on Statistical Discovery in Consumer and Marketing Research. And what better place to hold it than The Shard, the latest building to smash through London’s skyline. It stands out as the highest building in the city, so it is fitting that JMP, which stands out as the best in desktop data analytics, should choose that venue.

The event was attended by delegates from a wide range of industries, from media to marketing, and from finance to pharmaceuticals. We aimed to show how you can use JMP to:

  • Get deep insight into your consumer and market research data
    • Through the unique marriage of advanced analytics with compelling visuals.
  • Get more from your current environment, be it a database, Excel, SAS, SPSS or some other statistical package.
    • JMP is simple to install and easy to use.
  • Build better models to understand what drives your customers’ behaviour.
    • Perform scenario analysis with your clients and executives.
  • Ultimately, make better marketing decisions faster.

Ian Cox, our European Marketing Manager, introduced the seminar. He described how we would use case studies to show how you can separate the signal from the noise in your data easily. He also acknowledged, to nods round the room, that people in marketing tend not to be statistical experts, so having a simple way to access the right method is important.

Ian Cox introduces the seminar with London's Tower Bridge in the background.

The cases studies covered a wide range of themes from visualizing and exploring general and geographic data, to building models to understand the drives behind customers leaving your business. Robert Anderson explained how you can build better models by the breaking the data set into three portions -- one to train the model, another to validate that it's the best model to avoid overfitting, and one, which is not used in the model-building process, to test how good the model is. He showed how you can use bootstrap forest combined with this model validation method to build a robust model that you can have confidence in -- without having to understand the statistics behind the method. He demonstrated how you could compare models to get an idea of which would give the best return on your campaign. He also showed how you can profile these models so that you, clients and executives can do “what if” analysis to test different scenarios.

JMP Profiler helps you see the drivers behind your analysis and explore different scenarios.

JMP Profiler helps you see the drivers behind your analysis and explore different scenarios.

A member of the audience asked how easy it is to deal with outliers and missing values in JMP. Robert explained that we see these and correlations as “messy data” and that JMP has many ways of dealing with it, such as:

• Using the missing data pattern to understand where you have missing values.
• Using Informative Missing in JMP to understand if “missingness” is important.
• Using the Bootstrap Forest in JMP to build models that are robust to outliers.

We showed how rich the modelling techniques in the software are through two examples of other modelling techniques for particular problems:

• Using Partial Least Squares (PLS) to analyse short and wide data sets where regression techniques would not be effective, for example, with sensory panel data.
• Using Uplift Modelling to target your campaign at the people who are going to respond best to it.

The event proved very popular, with double the number of people registered to the seats available. We are thinking about whether to hold the event again, so if you would be interested in it, let me know by emailing me.

Post a Comment