Alias Optimal versus D-optimal designs

In my previous blog entry, I discussed the purpose of the Alias Matrix in quantifying the potential bias in estimated effects due to the alias terms. In this blog post, we look at an example that creates a D-optimal design and an Alias Optimal design with the same number of factors and runs.

Consider an experiment where we have six two-level categorical factors denoted by X1 to X6, and a budget of 12 runs. Our interest is in the main effects.

We start with the D-optimal design. Using the Custom Designer with six factors and 12 runs for the main effects, JMP will find a D-optimal design by default. If we take a look at the Color Map On Correlations below, we see all the main effects are orthogonal to each other, but correlated with two-factor interactions (absolute correlation of 0.3333). Since our model contains only main effects, looking at the design diagnostics and estimation efficiencies reveal that we are not going to find a better design for getting precise estimates of the effects in this model.

dopt color map

Taking a look at the Alias Matrix, we find the following:

dopt alias

The entries in the Alias Matrix and Color Map On Correlations do not generally correspond directly to each other as they do in this example. If you try this out on your own, you may notice different positions for the positive and negative values in the Alias Matrix.
Read More »

Post a Comment

Q&A with consumer research expert Rob Reul

Anne Milley prepares to interview Rob Reul of Isometric Solutions Inc. about consumer research for an Analytically Speaking webcast.

Anne Milley prepares to interview Rob Reul of Isometric Solutions Inc. about consumer research for an Analytically Speaking webcast.

Rob Reul, founder and Managing Director of Isometric Solutions, has decades of experience helping businesses understand what customers want. In his Analytically Speaking webcast, Rob talked about using data analysis to focus product development in areas that customers consider most critical. Rob demonstrated some great success stories in choice modeling and customer satisfaction surveys. In this interview, we asked Rob some more detailed questions about how these methods can be used to understand customer needs as related to software quality, feature enhancement, and managing a continuing conversation about user experience.

Melinda: Businesses are always looking to grow, which means attracting new customers, but it also means keeping current customers happy. Can you tell us about the differences between consumer research that’s intended to find new markets and consumer research that’s intended to examine performance for existing customers?

Rob: Research that seeks new markets usually coincides with the search for unmet needs. The common phrase “necessity is the mother of invention” rings true. These unknowns can be best isolated by studies of preference. Preference experimentation presents choices that respondents select from based on their interests. These studies often include an economic variable that then introduces a financial dimension that then expresses choice preferences based on a respondent’s willingness to pay. Together, this characterizes a new market venture by coupling economics with the probabilities of preference.

Research that examines performance seeks to increase a company’s competitiveness by evaluating the extent to which existing requirements are met. These are held as expectations. Thus an expectation scale is recommended because it is much more exacting than a satisfaction scale.

Melinda: Quality is an evolving issue for those of us who make software. Software is developed in the equivalent of a laboratory, so there’s a disconnect between how we make the product and how the product is used. What can consumer research teach the software industry about measuring customer happiness?

Rob: This is an interesting slice of the research equation. Looking into “happiness” (although initially overlooked by many) has lately been recast (in software) as the user experience. This new emphasis on the “experience” has been pursued by many with some success, the belief being that the greater the user’s experience the stronger the affinity the user will have toward the software. Stronger affinity then likely would extend to greater levels of overall satisfaction, product loyalty and product referral.

Melinda: Any thoughts on identifying focus areas for product development based on actual customer needs?

Rob: Software product development based on actual needs is known as the study of “use cases.” Here, researchers first seek to understand the very nature of the work task. They study the challenge the software user faces and what he or she seeks to accomplish. With this knowledge, software research focuses ensuing software development on ways to better meet those needs.

Melinda: When developing technical products (like statistical software), what the consumer wants is often only half the picture. What’s desired is often not technically feasible and sometimes does not solve the customer’s true problem. Can you talk about how consumer research can be linked to product research for technical products? How do you build a through-line between what the customer asks for and what their actual needs are? 

Rob: As I touched on earlier, the deconstruction of the “use-case” helps to understand exactly what the software user seeks to accomplish. With that understanding, draw the line between those task needs and specific software functionality. Regarding customers’ lofty desires vs. feasibility, customers will continue to be customers, and those who best meet their true needs (stated or derived) will prevail.

Missed Rob’s Analytically Speaking webcast? View it on demand for an in-depth conversation on consumer and market research.

Update 6/10/14: If you're interested in learning more about consumer research, take a look at the upcoming training on the topic:

Post a Comment

What is an Alias Matrix?

When I create a design, the first place I typically look to evaluate the design is the Color Map On Correlations. Hopefully, I see a lot of blue, implying orthogonality between different terms. The Color Map On Correlations contains both the model terms and alias terms. If you have taken a design of experiments/linear regression class, you may be familiar with the idea that correlation among predictors inflates the standard error of the estimates. However, the majority of the familiar design diagnostics relate to the model terms. What about if we have missed some terms in our model? Not surprisingly, if we are missing terms in the model, these terms can still affect our results. Fortunately, we do have a way to assess this for those terms specified in the list of alias terms.

What effect do missing terms have on the model estimates?

We will dig into the technical details below, but the takeaway message is that active terms not in the model can bias the estimates of terms in the model. If a missing term is specified in the list of alias terms, the Alias Matrix gives us a means of quantifying that bias. The rows of the Alias Matrix correspond to each of the model effects, while the columns represent the different alias terms and how they influence the expected value of the effect estimate for each of those model effects.

Read More »

Post a Comment

Determining chemical concentration with standard addition: An application of linear regression in JMP

One of the most common tasks in chemistry is to determine the concentration of a chemical in an aqueous solution (i.e., the chemical is dissolved in water, with other chemicals possibly in the solution). A common way to accomplish this task is to create a calibration curve by measuring the signals of known quantities of the chemical of interest - often called the analyte - in response to some analytical method (commonly involving absorption spectroscopy, emission spectroscopy or electrochemistry); the calibration curve is then used to interpolate or extrapolate the signal of the solution of interest to obtain the analyte's concentration.

However, what if other components in the solution distort the analyte's signal? This distortion is called a matrix interference or matrix effect, and a solution with a matrix effect would give a different signal compared to a solution containing purely the analyte. Consequently, a calibration curve based on solutions containing only the analyte cannot be used to accurately determine the analyte's concentration.

Overcoming Matrix Interferences with Standard Addition

An effective and commonly used technique to overcome matrix interferences is standard addition. This involves adding known quantities of the analyte (the standard) to the solution of interest and measuring the solution's analytical signals in response to each addition. (Adding the standard to the sample is commonly called "spiking the sample.") Assuming that the analytical signal still changes proportionally to the concentration of the analyte in the presence of matrix effects, a calibration curve can be obtained based on simple linear regression. The analyte's concentration in the solution before any additions of the standard can then be extrapolated from the regression line; I will explain how this extrapolation works later in the post with a plot of the regression line.

Procedurally, here are the steps for preparing the samples for analysis in standard addition:

1) Obtain several samples of the solution containing the analyte in equal volumes.
2) Add increasing and known quantities of the analyte to all but one of the solutions.
3) Dilute the mixture with water so that all solutions have equal volumes.

These three steps are shown in the diagram below. Notice that no standard was added to the first volumetric flask.

This above image was made by Lins4y via Wikimedia with some slight modifications.

At this point, the five solutions are now ready for analysis by some analytical method. The signals are quantified and plotted against the concentrations of the standards that were added to the solutions, including one sample that had no standard added to it. A simple linear regression curve can then be fitted to the data and used to extrapolate the chemical concentration.

Determining the Concentration of Silver in Photographic Waste: An Illustrative Example in JMP

The following example is from pages 117-120 in "Statistics for Analytical Chemistry" by J.C. Miller and J.N. Miller (2nd edition, 1988). The light-sensitive chemicals on photographic film are silver halides (i.e., ionic compounds made of silver and one of the halogens: fluorine, bromine, chlorine and iodine). Thus, silver is often extracted from photographic waste for commercial reclamation. A sample of photographic waste containing an unknown amount of silver was determined by standard addition with atomic absorption spectroscopy. Here are the data in JMP:

I used the "Fit Y by X" platform and the "Fit Line" option under the red-triangle menu to implement simple linear regression. (You can also do this with the "Fit Special" option; just click "OK" without adjusting any settings.) After adjusting the axes and adding some captions, I get the following plot:

This plot illustrates the key idea behind using this calibration curve. The magnitude of the x-intercept is the concentration of the silver in the original solution. To understand why this is so, consider the absorbance at the following two values:

  • at x = 0, the value of y is the absorbance of the solution with no added standard (i.e., it corresponds to the concentration of silver that we ultimately want).
  • at the x-intercept, there is no absorbance.

Thus, the magnitude of the difference between x=0 and the x-intercept is the concentration of silver that is needed to produce the signal for the original solution of interest! Our job now is to determine the x-intercept.

Using a little linear algebra, we can mathematically obtain the x-intercept. However, there is a clever way to find it in JMP using the "Inverse Prediction" function under "Fit Model". (I thank Mark Bailey, another JMP blogger, for his guidance on this trick!)

First, let's run the linear regression again by "Fit Model" under the "Analyze" menu.

fit model - standard addition

Notice how JMP automatically suggests "Standard Least Squares" in the top right of this dialog window.

Here is the output from "Fit Model".

Fit Model Output - standard addition

Now, to get the x-intercept, let's go to the red-triangle menu for "Response Absorbance." Within the "Estimates" sub-menu, choose "Inverse Prediction." This allows us to predict an x-value given a y-value. Since we need the x-intercept, the y-value (absorbance) needs to be zero. I prefer to use a significance level of 1%, so I set my confidence level at 0.99.

inverse prediction (confidence interval) - standard addition

There is an option on the bottom left that says "Confid interval with respect to individual rather than expected response," and you may be wondering what it means. This option allows you to get the prediction interval, which quantifies how certain I am about the x-value (silver concentration) of a new observation at "Absorbance = 0". In contrast, a confidence interval quantifies how certain I am about the mean silver concentration at that particular absorbance. A prediction interval takes into account two sources of variation:

  1. Variation in the estimation of the mean x-value.
  2. Variation in the sampling of a new observation.

A confidence interval takes only the first source of variation into account, so it is narrower than a prediction interval.

Since I am interested in the x-intercept alone and not a new observation at zero absorbance, let's leave that option unchecked and just use a confidence interval.

Here is the output that has been added to the bottom of this results window.

confidence interval of x-intercept - standard addition

The estimate of the x-intercept (concentration of silver in the standard solution at zero absorbance) is 17.2605 µg/mL, and its 99% confidence interval [14.4811, 20.5585].


Standard addition is a simple yet effective method for determining the concentration of an analyte in the presence of other chemicals that interfere with its analytical signal. Its use of simple linear regression can be easily implemented and visualized in JMP using the "Fit Model" platform, and its "Inverse Prediction" function provides an easy way to not only estimate the analyte's concentration, but also to generate a confidence interval for it.


J.C. Miller and J.N. Miller. Statistics for Analytical Chemistry. 2nd Edition, 1988, Ellis Horwood Limited. Pages 117-120.

G. Gruce and P. Gill. "Estimates of Precision in a Standard Addition Analysis." Journal of Chemical Education, Volume 76,  June 1999.

Eric Cai works as a statistician in the Laboratory Program of the British Columbia Centre for Excellence in HIV/AIDS in Vancouver, British Columbia, Canada. He also shares his passion about statistics, machine learning, chemistry and math via his blog, The Chemical Statistician; his Youtube Channel; and his Twitter feed @chemstateric. This is Eric's first post as a guest blogger for JMP.

For more information on how JMP can be used in chemistry and chemical engineering, visit our website.

Post a Comment

See optimal settings with JMP Pareto Efficient Frontier

Pareto Efficient Frontier (PEF) is becoming an increasingly popular tool for measuring and selecting project or design parameters that will yield the highest value at the lowest risk. PEF is being used widely in many industrial areas, such as when selecting the best exploration projects in oil and gas, finding optimum design parameters in consumer product research, and even finding the right pricing of products and service in sales and marketing. This tool is especially useful for anyone involved in project, product or service management, at it allows you to see in a clear visual the most important points that you care about among all the other points in a graph.

We can easily create a PEF in JMP using the features in Graph Builder and Row Selections. Let's look at some design team data from 500 tested units. The team wanted to find those tested units that would provide the lowest level of Battery Voltage (V) that would work for the highest level of Ambient Celsius (C) temperature. We would describe these points as being the most “dominant” for Ambient (C) points across low Battery (V) settings. The reason this would be important for the design team is the need to find those tested units that can operate under the highest operating temperatures with the lowest strain on the battery.

 Table 1: Design Team Data Partial Snapshot

Table 1: Design Team Data Partial Snapshot

Looking at a Distribution of the parameters, we can see the spread of the data for Ambient (C) and Battery (V) against their respective statistics. Most of the tested units seem to be within specifications, so let's go on to the next view to help find the PEF.

 View 1 - Distributions: Specifications

View 1 - Distributions: Specifications

The next graph was created in Graph Builder with a scatterplot and smoother line view of Ambient (C) and Battery (V) points. Now we can use the Row Selection features in the Row menu to help find the PEF points. The Row Selection – Select Dominant option gives us an input box where we can ask to look at dominant points for our parameters. Note that we checked the input box for Ambient (C) so we could see highest coordinating points for this parameter, while leaving the box for Battery (V) unchecked so we could see the lowest coordinating points for this parameter.

  View 2 - Graph Builder: Overlay

View 2 - Graph Builder: Overlay

Row Selection: Select Dominant

Row Selection: Select Dominant

Row Selection: Select Dominant Input Box

Row Selection: Select Dominant Input Box

This allows us to highlight the dominant points with high Ambient (C) temperature at low Battery (V). To make it more visual, we colored the dominant points red. Now we can start to see the PEF, as this is the ridge of red points on our graph.

View 3 – Graph Builder: Overlay w/ PEF Points Selected

View 3 – Graph Builder: Overlay w/ PEF Points Selected

While these dominant points are still selected, we can also use the Row Selection – Name Selection in Column from the Row menu to create a new column in our data tables that will identify the dominant and non-dominant points with an indicator. In this case, we created a new column called PEF (for Pareto Efficient Frontier) where a “1” indicates a dominant point and a “0” indicates a non-dominant point.

Row Selection: Name Selection in Column

Row Selection: Name Selection in Column

Row Selection: Name Selection in Column Input Box

Row Selection: Name Selection in Column Input Box

This will let us use the PEF column as an overlay column in our graph. Combined with a data filter, we can just select the PEF indicator points and clean up our Graph Builder view to just show the dominant point ridge where the tested units performed with the lowest Battery (V) at the highest Ambient (C). We can now easily see where our tested products will operate the most efficiently in our design.

View 4 – Graph Builder: PEF w/ Filter

View 4 – Graph Builder: PEF w/ Filter

Note: This post was co-written with Jeff Perkinson, Customer Care Manager.

Post a Comment

Using JMP to visualize a solid state drive reconditioning process

This past week, I noticed that my computer had seriously slowed down. My usual tasks seems to be taking forever, and even my standard JMP demos were taking quite a bit longer than I was used to. I tried the normal things such as repairing permissions, checking the memory, seeing if there were any corrupt kernel extensions and even going so far as installing a clean version of Mac OS Mavericks to see if that would fix the behavior I was seeing.

Then it dawned on me that it might be something going on with my solid state drive (SSD). Since I still had the original hard drive (a standard spinning drive) that came with my MacBook Pro, I installed that and tried booting from the original drive.

A program that I use for seeing how your computer is performing is Geekbench. It does a number of processor, graphics and disk-intensive tasks and then reports back a single- and multi-core performance number that you can then compare to a database of computers that are similarly specified to yours. Then you can see if you are achieving comparable performance.

Well, as it turns out, the performance of my computer should be around 2,000 for single-core and 10,000 for multi-core. I was getting 700 for single-core and 3,000 for multi-core with the SSD. When I put the original HDD back into my laptop, performance increased to about 2,000 and 8,500, much closer to what it should be.

So obviously something was going on with my drive. Not giving up, I decided to see if I could recondition it. I also thought this was a great opportunity to collect and visualize some data using JMP. I used a program called DiskTester, from the Digiloyd Tools Suite. One of the functions in DiskTester is a recondition SSD function. This writes a large chunk of data to all of the free space of a drive and lets you iterate a number of times. The program reports the chunk offset in MB, average write speed, current write speed as well as a minimum and maximum write speed.

The drive, according to the Digiloyd Tools developer, “responds to this treatment by cleaning up and defragmenting itself.” If this process works on my drive, I should see some pretty bad performance for the first iteration that drastically improves after a few iterations.

So I erased my drive, booted from my other internal drive and started the reconditioning process, letting it run overnight and collecting eight iterations of raw data.

DiskTester gave me the option to copy the raw data to the clipboard, which I did, and then created a .txt file that I will now import into JMP.

I’ll use File > Open to get the .txt file, and then when given the option, I’ll choose Open As: Data (Using Preview), which gives me an option to inspect the data before getting it into JMP.


I’m happy with the way the columns look, and I see by the 123 icon that all my data will be coming in as continuous, which is what I want. In this window, I have the option to give the columns names, which I will do.


And now I have my raw data into JMP, but there is one more step I will need to do before I can visualize the results. You can see I am missing an important column, which is the iteration number. I’ll need this to use as a phase or grouping variable. Fortunately, I can generate this pretty easily in JMP. I'll create a new column and then right-click to get to the column info. When you create a new column, you have the option to initialize data. I'll pick sequence data, and then enter 1 for the From, 8 for the To (as I want data from iteration from 1 to 8) and 1 for the Step.

I know that my last block is 482,176 MB and the program is writing 128 MB chunks, which means each iteration will have 3,767 unique measurements. So I will put 3767 into the Repeat each value N times field.

CreateIterationColumnJMPwithSequenceI can check my work by looking at rows 3,767 and 3,768. Sure enough, 3,767 is labeled iteration 1, and 3,768 is labeled iteration 2. Now I’m ready to go.


I’ll use Graph Builder to see what’s going on in the data. Having the Offset in GBs instead of MBs may be a better way to display the labels on the X-axis for all of my plots, so before I go any further, I'm going to transform MB to GB by right-clicking on the Offset MB, clicking on Formula and then taking Offset MB and dividing by 1000. This creates a custom transform column without having to go back to the data table. I'll want to use this later, so I'll rename it Offset GB, right-click and then select: Add to Data Table. The plot below shows the pattern of performance readings across the drive for each iteration. I’ve turned the transparency of the points down to 0.1 so we can see them better and also added a lower spec limit of 110 MB/sec to the graph.


As you can see in the graph, the performance on the first iteration is all over the place. While the average write performance is decent (91 MB/sec), there are many 128 MB chunks that are being written much more slowly to the disk. By iteration 2, however, things are starting to improve drastically. The average write performance has increased to 115 MB/sec. By iteration 5, things are starting to settle in, and at that point, I seem to be seeing asymptotic behavior in write performance.

What remains through all the iterations, however, is a band of blocks that are in the high 40s for write performance. Even by the end of the test, 130 blocks are in the lower performing sector. This is vastly improved from the first iteration, where 2,294 blocks are below the spec limit. If I add a Local Data Filter to Graph Builder, I can focus on just the first and last iteration, and compare performance. While the write performance seems to be greatly variable on iteration 1, by iteration 8, there are three straight bands, indicating consistent performance over all the tested sectors of the drive. The cause of the lower-performing sectors is still a bit of a mystery to me, but I suspect it may be something in the operating system, where the program is being interrupted by some other system task causing a drop in measured performance. (If anyone has a better hypothesis, leave it in the comments).


So this big question is: “Did it work?” Well, I am happy to report that it did. After running the reconditioning procedure and reinstalling the drive in my laptop, my Geekbench score is back to 2,900/9,500, which is what I should expect given my hardware specifications. And the drastic drop in speed that I noticed on my computer is no longer there.

Post a Comment

Michael Schrage on experimentation, innovation & communicating analytic results

michael-schrageWe are delighted that Michael Schrage, Research Fellow at MIT Sloan School of Management’s Center for Digital Business, will be a keynote speaker at Discovery Summit 2014. I first encountered Michael when he skillfully moderated an Analytics Exchange Panel at the 2009 conference.

Earlier this year, Michael was the featured expert in an Analytically Speaking webcast. We spent extra time together thanks to the wintry weather in North Carolina! That enabled us to have some interesting conversations, and I'm pleased to share more of his perspective on some important topics: experimentation, innovation and communication of analytic results.

Why don't organizations do more (good) experimentation?

Michael: This is a deceptively difficult question. I see many organizations perform "tests" rather than experiments. The web has inspired a whole new generation of A/B tests and testing. You certainly see technical/engineering folks use Taguchi and Box/Fisher "design of experiments" statistical methodologies. But, no, I really don't run across that many organizations or innovation teams culturally committed to "experimentation" as central to how they learn, iteratively design or manage risk. I fear that DOE has turned itself into a "black box" for technical quants rather than a platform for "new value creation" and exploration by entrepreneurs and innovators.

But why? My empirically anecdotal answer would be that most business people think in terms of "ideas" and "plans" and "programs" rather than in terms of "testable hypotheses" and "experiments." Experimentation is for geeks, nerds and quants — not business management and leaders. Designing business experiments is what we delegate, not celebrate or see as strategic. A second reason that I've seen surface when organizations resist the fast, cheap and simple "experiments" option is that experimentation doesn't "solve the problem." It only gives insight. A lot of people in management are much more interested in paying for "three-quarter" solutions — or even half-baked ones — than genuine insights. In other words, they see the products of experiment as more interesting than compelling. Of course, when one sees the digital successes of the Googles, Amazons and Netflixs in using experiments to innovate, as well as optimize, you have to wonder just how much of this resistance reflects generational dysfunction, not simple ignorance.

You've said there is a tension between incremental and disruptive innovation. What are your observations on organizational cultures that foster both kinds of innovation?

Michael: Well, almost by definition, "incremental" is likelier to be easier and less disruptive than "disruptive" innovation. Remember, I'm not a fan of "innovation" for innovation's sake; I believe innovation is means to an end, and we want to make sure we understand and agree about which aspects of that desired "end" are tactical versus strategic. We need to have the courage and honesty to confront the possibility that "disruptive" innovations will be better for our customers, clients and us in the nearer or longer term. We need to be confident that our disruptive innovations will advantage us with our customers while concurrently disadvantaging our competitors. Culturally speaking, companies that are proud to the point of arrogance about their technical skills and competences tend to be understandably reluctant about truly "disruptive" innovation because it disrupts their sense of themselves and what they think they're good at. Organizations more focused on UX, customer service, client relationships and a broader/bigger "mission" seem more culturally comfortable with disruption because it is a means to a greater end rather than just an opportunistic tactic.

You've said sometimes we need to look for approaches versus solutions. Can you relate that to JMP?

Yes, I've largely gotten out of the "solutions business" both in my teaching and advisory work. Almost everyone I work with is pretty smart, so my focus now is less on the transmission of my expertise than on the cultivation of their capabilities. I want my students and clients to be able to embrace, understand, and exploit a new power and capability that lets them find and customize the solutions they want. I am not there to "solve their problems." I'm there to facilitate how they choose to bring both existing and new capabilities to bear on solving problems in the ways that are culturally and economically compatible with their needs, not my expertise and "experience." How does that relate to JMP? That's easy — I've been a part of the JMP community long enough to know that your best customers and users come up with novel and compelling ways to get value from your products. You learn as much from them as they from you. I'm comfortable arguing that mutual/collaborative learning is more about an "approach" than a "solution."

Share with us the importance of communicating analysis results to executives.

That importance can't be overstated, but the heuristic I look forward to offering and discussing is that the purpose of communicating analytical results should not be to make executives feel stupid or ignorant but to make them feel smarter and more engaged. If all your communications do is fairly and accurately convey useful results in an accessible way, you're underperforming.

We hope you will bring your own questions to ask Michael live in September. If you’ve never attended Discovery Summit, perhaps it’s time for you to experiment with a new conference unlike any other.

Note: This is the second blog post in a series on keynote speakers at Discovery Summit 2014. The first post was about David Hand.

Post a Comment

Two kinds of dot plots

The name “dot plot” can refer to a variety of completely different graph styles. Well, they have one thing in common: They all contain dots. For analytic use, the two most prominent styles are what we might call the Wilkinson dot plot and the Cleveland dot plot.

The Wilkinson dot plot displays a distribution of continuous data points, like a histogram, but shows individual data points instead of bins.

Wilkinson Dot Plot

Though variations of such plots have been around for more than 100 years, Leland Wilkinson’s seminal paper “Dot Plots” largely standardized the form. Last summer, the support for Wilkinson dot plots in JMP was greatly enhanced by an add-in, which is now built-in to JMP 11.1 (see the Wilkinson dot plot blog post).

The Cleveland dot plot is featured in William S. Cleveland’s book Elements of Graphing Data and displays a continuous variable versus a categorical variable.

Cleveland Dot Plot

This kind of dot plot is similar to a bar chart, but instead of using length to encode the data values, it uses position. As a result, the dot plot does not need to start its data axis at zero, can use a log axis and is more flexible for overlaying multiple variables. Cleveland breaks down the estimation aspect of graph perception into three parts: discrimination, ranking and ratioing. In general, dot plots help with the first two at the expense of the third, making relative proportions less accessible. For instance, it’s easier to see when one bar is twice as long as another without consulting the axis.

Cleveland’s books, along with Wilkinson’s The Grammar of Graphics, were influential in the creation of Graph Builder, and as a result, the Points element is the default view in Graph Builder for both continuous and categorical data.

Below is a Graph Builder recreation of a Cleveland’s display of barley yields . A challenge: Can you spot the odd feature of the data?


The use of dotted lines is presumably a constraint of black and white printing, and it’s more common to see faint gray lines in dot plots. Beyond the usual drag-and-drop of variables into roles, the Graph Builder steps to make the dot plot above are:

  • Add a Value Ordering property for the Variety column (on the Y axis) to match Cleveland's order.
  • Put the Site variable in the Group Wrap role and set the number of columns to be 1.
  • Turn off Show Title for Site.
  • Turn on grid lines for the Y axis.
  • Change the legend position to the bottom.

And now the answer to the challenge: The odd feature of the data is that the 1931 values are generally greater than the 1932 values except for the Morris site, which suggests the values may have been swapped.

For more discussion of Cleveland dot plots, see the article “Dot Plots: A Useful Alternative to Bar Charts” by Naomi Robbins.

Post a Comment

New contingency analysis add-in for JMP

A contingency analysis determines whether there is a relationship between two categorical variables. A caterer, for example, might be interested in knowing whether entrée selections at an event were related to gender, given the following data:


The contingency platform in JMP requires the X and Y variables to be contained in two columns, with the cell counts in a third:


With the “Contingency (Table Format)” add-in, developed especially for students and others who are new to JMP, you can launch a contingency analysis with data that is arranged in “crosstab” format, without having to stack the data first:


Simply specify the variables as shown here:


The add-in creates the contingency report and includes buttons that let you easily swap the rows and columns of the report if desired:


If the “Show New Table” option is selected, the add-in also generates a stacked table containing a script that will launch the platform in the standard manner:


This add-in, along with many others, is available on the JMP File Exchange. (A free SAS profile is needed for access.)

Post a Comment

Coming in July: Book on centralized monitoring of clinical trials

In the spirit of shameless self-promotion full disclosure with the goal of collecting huge royalty checks promoting the efficient review of clinical trials, I’d like to make everyone aware of the forthcoming SAS Press title Risk-Based Monitoring and Fraud Detection in Clinical Trials Using JMP and SAS.

Clinical trials are expensive. If study costs continue to rise at the current pace, clinical trials to establish efficacy and tolerability will become impossible to conduct, with these potential consequences: making drugs unavailable for areas of unmet need, stifling innovation in established treatment areas, or placing an extreme price burden on consumers and health care systems.

People have suggested many innovations and ways to streamline the development process and improve the likelihood of clinical and regulatory success. For example, adaptive design methodologies allow you to stop a clinical trial early if there is overwhelming efficacy or excess toxicity, or when the novel compound has little chance to distinguish itself from control. Extensive modeling and simulation exercises can suggest the most successful path forward in a clinical program based on the available data and reasonable assumptions based on past development. Patient enrichment based on genomic markers can help select a study population more likely to receive benefit from the drug, resulting in smaller clinical trials.

Other innovations have more to do with the operational aspects of clinical trials. These include electronic case report forms (eCRFs), new technologies for collecting diary data or obtaining laboratory samples, or new software that enables the efficient review of data for quality and safety purposes. And still other innovations involve the regulatory submission and review process through electronic submissions and data standards.

Despite these many advances and innovations, costs continue to rise in many instances. One area for obvious improvement involves how sponsors review trial data. Traditional interpretation of international guidance documents has led to extensive on-site monitoring, including 100 percent source data verification. The cost of these activities is estimated at up to a third of the entire study! This substantial expense has led the industry and numerous regulatory authorities to question the value of traditional approaches.

This book shows how you can use the combination of statistics, graphics and data standards to take a proactive approach to data quality. Numerous examples illustrate the various techniques available within JMP Clinical. Further, I show how you can use JMP add-ins to extend and customize the present capabilities. It will be available in July as a black-and-white print book or full-color e-book.  Makes a great gift for everyone on the trial team.

Post a Comment