## How to create an axis break in JMP

I’ve been asked three times this year about how to make a graph in JMP with an axis break. Before I show how, I want to ask “Why?” The obvious answer to “Why?” is “to show items with very different values in one graph,” but that’s a little unsatisfying. I want to know why they need to be in one graph. The advantage of a graphical representation of data over a text representation is that we can judge values based on graphical properties like position, length and slope. However, once we break the scale, those properties are no longer as comparable. We effectively have two separate graphs after all – which is actually how we can make such views in JMP.

Related to my “Why?” inquiry, I’ve had a difficult time finding a compelling real-world example to illustrate an axis break, so I made some hypothetical data. Say we have timing values for a series of 100 runs of some process. Usually, the process takes a few seconds per run. But sometimes there’s a glitch, and it takes several minutes. Here’s the data on one graph (all on one y scale).

We can see where the glitches are, but we can’t see any of the variation in the normal non-glitch runs. Some would also object to the “wasted” space in the middle of the graph. However, those aren’t necessarily bad attributes. The non-glitch variation is lost because it’s insignificant compared to the glitch times, and the space works to show the difference. Nonetheless, if our audience already understands those features of the data, we can break the graph in two to show both subsets on more natural scales.

Now we can see that the non-glitch times are increasing on some curve. The “trick” in Graph Builder is to add the variable to be split to the graph twice in two different axis slots. Then we can adjust the axes independently, perhaps even making one of them a log axis. The Graph Spacing menu command adds the spacer between the graphs to emphasize the break. It’s easier to show than explain, so here’s an animated GIF of those steps.

I skimmed a few journals looking for examples of broken axes. Here’s an example of a pattern I saw a few times for drug treatment studies where the short-term and long-term responses are both interesting. This graph is from Annals of Internal Medicine and shows two different groups’ responses to an HIV treatment.

Each side of the axis break uses different units of time, which fits perfectly with the idea that there are really two separate axes. One thing that bothers me about this graph, though, is the connection of the lines across the gap. Notice the difference in my JMP version:

With different x scales, the slopes should be different. That is, the change per week (slope on the left) should be flatter than the change per year (slope on the right) for the same transition. Fortunately, Graph Builder takes care of this for you, but it’s something to be aware of when you’re reading these kinds of graphs in the wild.

The broken line from the HIV study is an example of how an axis break can distort the information encoded by the graphic element. A more serious distortion occurs when bar charts are split by a scale break since the bars can no longer do their job of representing values with length. I’m not even going to show a picture of that. Never use a scale break with a bar chart.

When making graphs with scale breaks, make sure each part works on its own, because perceptually they really are separate graphs.

## Father's Day fun with toy cars and DOE

My father and I have collected and customized diecast cars for the past 15 years. But lately, dyeing the cars has become a challenge. (Photos courtesy of Caroll Co)

With Father’s Day fast approaching, it seemed fitting that I should share a story about a father and son bonding over design of experiments (DOE) and toy cars. Full disclosure: Some (including their wives) think both the father and son in this tale are too old to be playing with toy cars.

My father and I began collecting diecast vehicles 15 years ago. To this day, whenever we go to a store that sells toy cars, it’s our first stop, regardless of what we’re shopping for. And both of us have bedrooms in our homes that have officially become toy rooms.

Back when we started collecting cars, my father and I would often customize vehicles using this simple method: We would get a vehicle that came painted white from the factory. We would prepare a popular fabric dye according to its package directions and leave the white car in the dye for 15 minutes. The car would come out looking great. But lately, when my father has tried dyeing some cars using this method, the results have been disappointing.

My parents were recently visiting us in North Carolina from their home in Canada. So my father and I figured, what better way to spend some time together than designing an experiment to see if we can find a new recipe?

The factors

Our initial thought was that it might be a matter of adjusting how much dye to use. However, there are other possibilities to consider at the same time. It may also depend on the color of dye or length of time in the liquid. The dye is now also available in both solid and liquid forms. We had used the solid form in the past, and my father had used liquid in his recent attempts. Some online searches also suggested that we should add vinegar. Of course, there’s also an issue with what car(s) to use. I was hoping to use four different castings, but it turns out that it requires some extra care, as I’ll discuss shortly.

To summarize, our factor list was as follows:

• Car: A/B/C/D
• Dye type: Solid/liquid
• Dye amount: low/high (2 Tbsp liquid/4 Tbsp liquid per half cup, or 1 tsp dry/2 tsp dry per half cup)
• Length of time: 15 mins/30 mins
• Dye color: red/blue/yellow
• Vinegar: yes/no

The need for covariates

As many collectors know, it’s not easy finding a particular vehicle in multiples when searching at stores. Not only are there few vehicles in white at any given time, it’s also extremely unlikely to find them in equal quantities. After enough searching, I was fortunate enough to find four different castings, with five of vehicle A, and four of B, C, and D. These quantities are close enough to balanced that I could probably find a 17-run design and relabel as appropriate, but I would prefer this to be taken care of during the design phase.

Fortunately, this is easy to accomplish through entering the cars as a covariate. Sometimes we use covariates as a means of choosing optimal subsets, but they are also useful when you have underlying set of design runs that you want the design to obey. All I needed to do in this case was to create a data table with a column for “car” and fill out 17 rows with five A’s and four B/C/D’s. When I go to DOE-> Custom Design, under Add Factors there’s an option for Covariate.

This lets me select the column “car” and add it as a factor.

The design

After I load in the covariate, all I need to do is add the rest of the factors:

I keep the number of runs at 17, and when I create the design, it’s accommodating the numbers of each car that I have on hand.

Next time, I’ll talk about the analysis (notice I haven’t even made mention of a response yet). But in the meantime, happy Father’s Day to all the fathers reading this!

## JMP Student Edition 12: Free with leading intro stats textbooks

Introductory Statistics is notorious for being one of the least popular courses required for graduation. Fortunately, modern approaches to teaching statistics are changing the perceptions and popularity of statistics for the better. These approaches are largely driven by data, rather than mathematics, and the modern data-driven interface of JMP is empowering instructors to teach engaging introductory courses.

To meet the needs of these modern courses, JMP has just released the latest version of JMP Student Edition, a streamlined version that contains all of the analysis and visualization tools covered in any introductory statistics course. (Our website shows a comparison of JMP and JMP Student Edition.) Its easy-to-use, point-and-click interface, Windows and Mac compatibility, and intuitive navigation make it an ideal companion for learning statistics, too.

JMP Student Edition contains all the univariate, bivariate and multivariate statistics and graphics covered in most first-year courses.

The latest version is now easier to get. Many leading textbooks include access to download a copy of JMP Student Edition through an authorization code packaged in their print or e-text products. Each downloaded copy provides 24 months of use, so students can continue to enjoy JMP Student Edition well beyond their introductory course.

Based on JMP 12.1, the latest release contains several new and improved features:

• For instructors who wish to emphasize randomization and resampling to motivate inference, Bootstrapping is built-in to the Distribution platform.
• For courses that wish to introduce concepts and applications of big data, we have removed the limit on data size.
• Some courses, such as those in business statistics, are beginning to introduce concepts of analytics. So, we have included elements of the Partition platform to provide classification and regression trees.

• To facilitate student projects and communicating results, JMP Student Edition now features a direct export to PowerPoint as well as an interactive HTML output option.
• “Applet-like” conceptual demonstrations of fundamental ideas such as probability, sampling distributions and confidence intervals, to name a few, are now included and integrated into the help menu. Easy to access, they also feature the ability to use your own data to simulate concepts.

• Mapping tools in JMP Student Edition now include street maps.

Additional effort went into tailoring JMP Student Edition to the standard output seen in textbooks. Default output of one- and two-variable graphs will more closely reflect standard practice.

For high schools that need a license for computer labs or classrooms, we offer a special five-year schoolwide license for middle and high schools. Contact JMP academic sales for more information at academic@jmp.com. Additional teaching resources for AP Statistics are freely available at our website.

## Did LeBron James step up his game in the playoffs?

The Golden State Warriors beat the Cleveland Cavaliers to win the NBA championship despite the best efforts of LeBron James. With the Cavaliers depleted by injuries (particularly to Kevin Love and Kyrie Irving), James was faced with carrying his team against a very talented and well-rounded Warriors team. And he was most certainly up for the challenge, LeBron had an amazing series, shouldering even more responsibility than usual and making it competitive against the Warriors.

LeBron’s performance in the finals got me wondering: Can we pinpoint exactly when he started to increase his output? Did he step up his game for the finals in particular, or had he been ramping it up throughout the playoffs? Or maybe his performance in the finals was nothing unusual, although I seriously doubted that.

First things first. We should plot his data for the entire season. There are many ways to evaluate a basketball player’s impact on the court. But for our purposes, let’s just look at his points scored, rebounds and assists.

The data seem a little too noisy to say confidently where LeBron started to increase his output. It’s probably safe to say that his rebounds started to increase around game number 75 (which happens to be the beginning of the playoffs), but it is hard to say. So let’s see if we can use a statistical model to help us find the changepoints.

Finding the changepoints

One approach to finding changepoints in our response is to fit a model like

E(points in game 1) = $\beta_0$

E(points in game 2) = $\beta_0 + \beta_1$

E(points in game 3) = $\beta_0 + \beta_1 + \beta_2$

and so on. This model generalizes to:

E(points in game $j$ ) = E(points in game $j-1$ ) + $\beta_j$ .

So anytime one of our $\beta_j$ is nonzero, we know that our mean has shifted up or down at game $j$ . We can use a variable selection technique to tell us exactly which of those parameters should be nonzero. If we use the Lasso for estimation and selection (available in the Generalized Regression platform in JMP Pro), this model is a special case of a model called the fused lasso.

And the model says...

Let’s take a look at the results of our fused lasso model for LeBron’s points, rebounds and assists. The prediction functions for these models give us a much clearer picture than when we looked at the raw data. LeBron’s points remained constant throughout the regular season, started to increase throughout the playoffs and peaked during the finals. His rebounds steadily increased over the regular season, but increased more dramatically throughout the playoffs. Likewise, his assists jumped up during the playoffs as well.

You want your superstars to respond on the biggest stage, and I feel like LeBron truly did that. Things looked bleak when both Kevin Love and Kyrie Irving got injured in the playoffs, but the remaining Cavaliers were up for the challenge. The Warriors were expected to run them off the court, but the Cavaliers were able to make it a competitive and entertaining series, thanks in large part to LeBron’s historic performance. And this is high praise considering that the Cavaliers took out my beloved Atlanta Hawks in the Eastern Conference Finals!

Reference

Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91-108.

## How to combine Squarify and Split layouts with hierarchical Treemaps

In my previous blog post, I introduced you to a new layout algorithm for Treemaps in JMP, called Squarify. I explained how Split preserves the order of your data while Squarify orders the data by size.

But what if you have data that is hierarchical? JMP can display hierarchical data in a Treemap by creating groups. Each group is laid out in tiles, just like each category is tiled within the group. If you select Split, then the groups are laid out using Split, preserving the order of your data. Then the categories are laid out using Split within each group. Similarly, selecting Squarify will reorder the groups, displaying the largest group in the upper-left corner and work its way down to the smallest group in the bottom-right corner. Then categories will be laid out the same way within the groups. But is there a way to combine these two techniques in one Treemap? There is with hierarchical data.

Mixed mode

When you have hierarchical data, you can select the third layout option, called Mixed. Mixed mode will layout the groups using Split, which will preserver the order of your groups. But then it will display the categories using Squarify, which will order the categories within each group from largest to smallest.

Again, I will demonstrate this using the San Francisco crime data from the samples data folder installed with JMP. My previous post showed the number of incidents of each type of crime. We will do that again, but this time let's group it by the day of the week and let's select Mixed from the layout menu option.

Looking at this Treemap, we see each day of the week as a grouping. Since Mixed uses Split for the groups, the order of our weekdays is preserved, with Sunday in the top left and Saturday in the bottom-right corner. But the categories are laid out using Squarify. The gives us nicely shaped rectangles that are easy to compare and orders them with the largest value in the top-left corner of each grouping.

In my earlier post on Squarify, we saw that Larceny/Theft was the most common type of crime. Grouping by the day of the week, we see that it is the most common crime every day of the week.

Pop quiz

What if we wanted to know which day of the week has the most crime (which group is the largest)? Well, hopefully by now you know how to find the answer to that question....

Yes! That's right. You would use Squarify, and you'd see that the answer is Monday by looking in the top-left corner. (What is it about Mondays?)

Get this sample data set -- or some other hierarchical data -- and try these new options for yourself.

## Using the Disallowed Combinations Filter in JMP 12

In a previous blog post, I investigated my travel time to work using an estimate from Google Maps. In that post, my possible departure times to and from work were the same every day. However, it’s not uncommon in designs, even when using computer simulators, to have restrictions on the design space. Since JMP 11, this could be accommodated for space filling designs using Disallowed Combinations, but it required a Boolean JSL expression, and you needed to remember that categorical factors have to be specified with an ordinal value. We tried to make specifying disallowed combinations easier in JMP 12 with the new Disallowed Combinations Filter.

In the commute time example, after I move past the first screen, there’s an outline box for Define Factor Constraints underneath the factors. Linear Constraints works as before, and the Disallowed Combinations Script is the option if you want to use Disallowed Combinations via a JSL expression.

Let’s take a look at the Disallowed Combinations Filter. Selecting that option brings up a list of the factors in something that has a similar look to the Data Filter for Data Tables.

For this example, maybe I want to exclude design points in which if I leave at 8:30 a.m. or later (i.e., morning >= 60), then my evening commute should start after 5:00 p.m. (i.e., evening = 60 and evening <= 30 together.). I simply select the morning and evening features, click the “Add” button to have them in the filter, and set the sliders to the appropriate condition, like below:

When I create the design, none of the rows will have both morning >= 60 and evening <= 30 together.

I especially like the Disallowed Combinations Filter with categorical factors. Instead of the above condition, maybe I have a Tuesday afternoon meeting that doesn’t let me leave until after 5:00 p.m. on Tuesdays (i.e., disallow evening <= 30 on Tuesdays) and a morning appointment on Thursdays where I want to leave before 8:30 a.m on Thursdays (i.e., disallow morning >= 60 on Thursdays). I select evening and day first and put in that condition, choose OR to add the second condition, and then add the second condition by selecting morning and day. My filter looks like this:

Now I can go ahead and create the design.

Final Thoughts

The Disallowed Combinations Filter appears in Custom Design, Space Filling Design, Covering Arrays and Augment Design. If you create a design with the Disallowed Combinations Filter, the saved script has Disallowed Combinations converted into the JSL Boolean expression. This means that running the DOE script does not bring up the Data Filter, but rather the “Use Disallowed Combinations Script.” I have found this useful to create larger disallowed combinations scripts with lots of “OR” statements when the Filter method begins to get tedious.

## A case study analyzing and visualizing ethanol fuel data

People have been producing ethanol for about 9000 years[i]. Ethanol is made using a fermentation process. The fermentation process converts sugars into ethanol using yeast. For most of history, the ethanol produced was for consumption purposes. In the last 35 years, there has been an increase in ethanol fuel use[ii]. Currently, ethanol fuel is produced in more than 200 ethanol plants throughout the US[iii].

In any ethanol plant, the goal is to consistently produce high ethanol yield and limit the batch-to-batch variance. Anne Chronic of Phibro Ethanol Performance Group, a division of Phibro Animal Health Corporation, recently presented at the International Fuel Ethanol Workshop & Expo (FEW) on how JMP can be used to help accomplish these goals. Anne’s talk, called “Finding Your Rock Star Operator,” focused on a specific case study. In the case study, the plant was targeting a specific process called Clean In Place (CIP). The CIP process is used to clean out process tanks and piping on a routine basis. Anne discovered that the different operators had a wide range of performance. She found that one was a “rock star” operator, while the others were not performing as consistently.

The "rock star" operator was discovered using JMP

After Anne discovered the trend, the results were communicated back to the plant. The plant then had a discussion about CIP with all of the operators, emphasizing the importance of following the standard operating procedures for CIP. The CIP process often requires recirculation of water (and/or chemicals) and time to reach high temperatures needed for adequate cleaning. The operators who were not performing as consistently had been attempting to save time and water (and/or chemicals) during CIP; they did not realize that doing so was affecting the rest of the fermentation process. After the CIP discussion took place, the plant saw an increase in the overall temperature and a dramatic reduction in variance.

The increase in mean temperature and reduction in variance can been seen with the phased control chart

To summarize the results of the case study, Anne used the Partition platform in JMP to show the improvement by the operators after the CIP discussion.

The ethanol fuel industry is very dependent on yield and reproducibility. Even small improvements in yield lead to substantial profits. Several talks at the FEW conference mentioned that a 0.1% yield improvement can equal ~$50,000-$500,000 annually, which highlighted the large impact that JMP can have in analyzing and visualizing ethanol fuel industry data.

Notes

[i]  Gately, Iain (2009). Drink: A Cultural History of Alcohol. New York: Gotham Books. ISBN 1592404642.

[ii] Renewable Fuels Association (6 March 2012). "Accelerating Industry Innovation – 2012 Ethanol Industry Outlook" (PDF). Renewable Fuels Association. Retrieved 18 March 2012. See pp. 3, 8, 10 22 and 23.

## AMA Advanced Research Techniques (ART) Forum is next week!

There’s a saying in publishing that you measure a book’s impact in two ways: how many people buy it, and how many people read it. There’s a similar saying in statistics: You measure a technique’s impact by how many people know about it and by how many people actually use it. One of the things I love about being a developer at JMP is that I get to make statistical techniques that might otherwise be difficult to use or time-consuming in practice accessible to a wide audience of users.

That is also why I’m excited to be on the committee for the American Marketing Association’s Advanced Research Techniques (ART) Forum that’s taking place next week in San Diego. ART is dedicated to bringing academics and applied researchers together. The conference sessions and format are designed to encourage discussion. In every session, academic researchers present alongside applied researchers who are working in industry. Post-presentation discussions from market leaders show how to work through problems that might happen when implementing a new technique, as well as how to analyze the gains from doing so.

The conference also hosts tutorials that serve as refresher courses for established techniques. This year, the topics include Machine Learning for Marketing, Knowledge Representation, Choice Modeling, and others. This makes the ART Forum a great training opportunity for people who are new to marketing research, as well as for experienced professionals.

Last year, when I attended ART for the first time, I remember thinking, “These folks are exactly the kind of market researchers I want to work with in this field.” This year in San Diego, I’ll be participating as part of the committee, and I couldn’t be more excited.

There’s still time to register if you’re interested in attending this year, and if you’re already registered, I’ll see you in San Diego!

## Using a covering array to identify the cause of a failure

My last blog entry discussed using a covering array to test the Preferences for the Categorical platform. While the hope is that all of the tests pass, in this blog we consider what we can do when one of those tests fails. When you use “Make Table” after creating a design in the Covering Array platform, there are two important pieces to pay attention to: the first column with missing values labelled “Response”, and a table script called “Analysis”. The Response column uses values of 1 and 0 to correspond to whether or not a particular run passed or failed according to what we’re measuring. For the Categorical platform, a value of 1 would be recorded if the platform behaves as expected, and 0 if there’s a fault of some kind. The “Analysis” script is to be used once the Response column is filled in.

When performing the test, ideally we observe all 1’s in the Response column. In the Data Table provided on the File Exchange,I went through and collected some hypothetical data: each of the runs passed except for the last one. What do we do with this failure? It would be nice to narrow down the potential causes.

Analysis

The first place to look for a cause would be if any single factors could have caused the failure. Since each preference option occurs in more than one row and everything else passed, it’s not a single factor causing the issue. The next likely candidate would be a 2-option cause. We have 45 Choose 2 = 990 different 2-option combinations in that row. That doesn’t seem very informative. However, many of these combinations appear elsewhere in the design and the platform passed those tests, so those can be eliminated as potential causes. Going through the list of potential causes and eliminating those that have appeared elsewhere would be a tedious task – which is what the Analysis script takes care of for us. Running that script:

The Analysis report shows potential causes for the failure, which is greatly reduced from the 990 pairs contained within the row containing a failure to just 16. What’s more, the testing has been simplified to having pairs to check.

Any kind of information would make this process even easier. In our example, the tester has knowledge that the “ChiSquare Test Choices” were recently updated, and can first look at those 2 cases. It’s worth noting that clicking on any of the potential causes highlights the failure row and columns corresponding to it. This is useful if you’re dealing with many rows and/or columns and want a quick way to be able to subset the table.

Final Thoughts

We went from a task of testing preferences that looked impossible to something that gave reasonable coverage in just 16 runs. We could go even further by creating a strength 3 covering array – with some optimizing, I found a design that had 65 runs (and the 4-coverage was over 96%). Constraints where some combinations cannot be run together can also be accommodated with disallowed combinations. Any luck using covering arrays in your own work? Leave me a comment and let me know. Thanks for reading!

## Why are some dogs adopted faster than others?

Carah Gilmore entered her school's science fair competition and advanced to higher levels with her project analyzing data about pet adoptions. (Images courtesy of Ryan Gilmore)

Here at JMP, we love pets. So we were thrilled to hear that a young scientist used our software to explore data about pet adoptions from local animal shelters. The project is adorably titled "Furever Friends."

How young is this scientist? She is 10 years old, and her name is Carah Gilmore. Her father, Ryan Gilmore, works at SAS as a Senior Application Developer.

Carah's school, Mills Park Elementary in Cary, NC, strongly recommended that students take part in science fair, so Carah chose a topic that would help a charity dear to her heart.

"My dad and I take pictures for an organization named Rescue Ur Forever Friend to help pets get adopted quicker. I was wondering how I could help, so I thought of this project – what factors determine how fast a dog gets adopted," the rising sixth-grader says.

To identify those factors, she needed some data. She was able to get data from Rescue Ur Forever Friend and Second Chance Pet Adoptions. Ryan says the final data table had around 1,400 rows and 12 columns.

Carah looked at adoption time by breed.

Ryan had worked in technical support for JMP and therefore knew the software very well. That's why he decided to teach Carah how to use Graph Builder and Tabulate to perform her analysis.

But before starting on her analysis, Carah had to do a bit of data preparation.

"The data from both organizations was given as Excel files. Carah combined the data into a single Excel file, which was then imported into JMP. New columns were created to compute the length of stay and age of the animals. Other columns were also created to categorize the dogs based on color and breed," Ryan says.

What Carah learned

Black Dog Syndrome is the idea that black dogs spend more time waiting for a new home than lighter-colored dogs. Carah aimed to test that hypothesis and see what else she could learn.

"My results showed that it took longer for black dogs to be adopted," Carah says.

Carah's graph from her science fair project showing that black dogs take more days to get adopted than dogs of other colors.

Blacks dogs took 83 days on average to be adopted, whereas brown dogs took an average of 65 days. Gray dogs fared the best, waiting only 38 days on average for a new home.

Carah also found that female dogs were more quickly adopted than male dogs. As might be expected, large dogs took more days to be adopted than medium or small dogs. But what's surprising is that extra-large dogs spent the fewest days in the shelter before adoption, when looking at size alone.

Competing at science fairs, and beyond

At her elementary school science fair in January, Carah placed third, which qualified her to compete at the regional science fair the following month. She placed in the top eight among more than 100 students there and competed at the state level in March. While she did not place in the state competition, she did learn a lot along the way and helped spread the word about pet adoption.

What was the best part of doing this project for Carah?

"Helping the organizations. Creating the formulas and graphs in JMP. Also, finding out how long it does take different dogs to get adopted," Carah says.

The science fairs are over for this year, but she hopes all who have seen her findings (including you, dear reader) will "try to adopt a rescue dog and take a look at the black dogs because they’re just like the other dogs.”

Bravo, Carah!