1,000 posts and counting

I'm happy to report that we posted our 1,000th post at the JMP Blog yesterday, with Ryan Lekivetz's post on using space filling designs to explore data about his work commute!

Ryan's post is an example of the technical content that JMP Blog readers love best. So, I looked back to 2007 when this blog began to find the top 10 posts on technical topics. Here they are, in order of most views:

  1. Saving graphs, tables and reports in JMP by Mary Loveless (2010)
  2. Graphical output options using JMP by Daniel Valente (2011)
  3. Introducing definitive screening designs by Bradley Jones (2012)
  4. What factors affect office temperature? A design in JMP by Audrey Shull (2010)
  5. What good are error bars? by Mark Bailey (2008)
  6. Visualizing Derek Jeter's 3,000-hit milestone by Lou Valente (2011)
  7. Histogram color by Xan Gregg (2007)
  8. JSL tip: Finding unique values in a list by Xan Gregg (2011)
  9. Image analysis of an elephant's foot in JMP by John Ponte (2013)
  10. "The desktop computer is dead" and other myths by John Sall (2013)

Thanks for reading, and let us know what you would like us to write about!

Post a Comment

Using a space filling design and Google Maps to plan my commute: Part 1

Morning commute with traffic

Can a space filling design help make my commute to work easier? (Photo by Geoffrey Arduini via Unsplash: https://unsplash.com/geoffreyarduini)

Whenever I’m getting ready to go to or from work, I always have a notion that if I leave by a certain time, my drive will be smoother due to less traffic. Of course, even if I leave at the same time every day, my drive time is still going to vary. When Google Maps recently provided the feature to give an estimated travel time range for a future trip, I decided to see if I could find anything interesting about my daily commute. This also seemed like a perfect opportunity to try out the new space filling design features in JMP 12.

While I don’t know all of the details of how the time is estimated, I can treat the response as coming from a computer simulator with a deterministic response: If I put in the same departure time/day over and over, I keep getting the same expected time. So, it’s a waste of resources to use replicate runs. I collected the data all in one sitting, since it’s unclear to me how/when Google Maps updates these estimates.

To demonstrate the utility of the space filling design, let’s assume that all I have available to me is the total travel time based on the time I leave in the morning and the time that I leave work in the evening (i.e., I don’t have the individual morning/evening trip times available). I then want a design that has a good mix of morning and evening departure times, ideally over different weekdays. My morning departure can occur between 7:30 a.m. and 9:00 a.m., while the evening departure is between 4:30 p.m. and 6:00 p.m. For simplicity, I let the time variables range from 0 to 90, representing the number of minutes from the start of the range.

Suppose I only have the resources to collect data from 150 points. Why even use a space filling design, when I could just spread out the points over evenly spaced intervals? Since I can specify departure times in increments of whole minutes, to collect data over the whole range would require 91 x 91 x 5 = 41,405 points. Even using 10 minute increments, I would need 11 x 11 x 5 = 605 points, and I lose any chance of observing if travel times change in between the 10 minute intervals. I also want to make sure that I don’t use the same times on each day, because if there are no differences between days, I would be collecting the same results each time.

In JMP, I go to DOE -> Space Filling Design and change the continuous factors to morning and evening with ranges of 0 to 90. I add a five-level categorical factor for day (note this feature is only available beginning in JMP 12):

map_p11

When I hit Continue to go to the next screen, the only option with categorical factors is Fast Flexible Filling, which uses the MaxPro criterion in JMP 12 to better explore the boundaries. The morning and evening times will come out as fractions, which I’ll need to round off when collecting the data, but let’s take a look at the 150-run design I found. Ignoring the days, we get a good mix of departure times:

map_p2
And if we look at the departure times day by day, we also see a good spread:

map_p3

The observant reader may notice the scale on the axes is no longer from 0-90. This is a matter of changing the Column Info and is best left for another blog post if there’s any interest (just leave me a comment to let me know).

Now that we’ve created a design, it is time to collect some data. My next blog post will show the analysis to see if anything interesting results.

Post a Comment

New in JMP 12: Data table support for images

When most people hear the word “data,” their thoughts turn to numbers or text. Increasingly, however, the data we may wish to analyze includes images. Consider the following data, in which a dozen voters ranked the top 25 college football teams at the end of the 2014 season. Teams receive (26-k) points for each kth-place vote. That is, 25 points for each 1st-place vote, 24 points for each 2nd-place vote, etc., down to 1 point for a 25th-place vote. While voter, rank and points have been captured numerically, the college names are nowhere to be seen — only a logo is present.

Before JMP 12, in order to analyze data like this we would have had to create a new column, typing in each school’s name (and this assumes that we know each school’s icon, which we may not.) With the support for images within data tables in JMP 12, we can simply analyze the table in its present state.

Using the Summary platform, we cast the Icon column as a Group column, and request the statistics of interest.

pic2

Notice I’ve used another of the new features in JMP 12: the Histogram statistic, available at the bottom of the Summary platform's Statistics drop-down menu. This writes images of histograms (of the ranking values, in this case) to each row in the summary table.

Sorting the results in descending order of the Sum(Points) column produces the table below. Notice that the histograms give us a quick overview of the trends in the data, while the statistics provide supplementary detail.

pic3

Using the scroll bar to move quickly down the rows of the table, we can see the distributions move from left to right, and it is also easy to see the increased variability in the voters’ opinions of teams whose point totals fell in the middle of the distribution. No. 1 Ohio State exhibits no variability in its rankings, but Ole Miss' histogram indicates a considerable spread of opinion, as that school was ranked as highly as 11th by one voter, while another voter left it unranked (as the N Rows value is 11, but there were 12 voters.)

pic4

We can also use these images as labels in graphs. Below, I’ve plotted the maximum vs. minimum rankings using Fit Y by X, and I’ve drawn a line at Y = X. Since each point’s vertical distance from the line gives us the range of its rankings, it is easy to pick out the teams with the greatest difference between their maximum and minimum rankings. By turning on the Icon column’s Label property, and labeling all rows, we can hover over any point to see the icon associated with it.

pic5

While hovering, you’ll see an icon for a pin in the upper right corner. Click it to keep the information visible as you move elsewhere on the graph.

pic6

Once the icon is pinned, you can right-click on the icon to access other options, such as linking to the point with a tag line and inserting text.

pic7

As you can see, the support for images within data tables is a powerful feature in JMP 12. Next time, I’ll have more on using images within graphs, such as how to create images from analysis results to make a linked graph like the one below. See you then!

pic8

To learn more about what's new in JMP 12, watch the live webcast on June 11 or the on-demand version anytime.

Post a Comment

Excel Import Wizard in JMP 12

The Excel Import Wizard was introduced in JMP 11 for Windows. The goal of the Wizard was to simplify the task of importing data that often did not follow the rule of “line 1 contains headers, line 2 starts the data.” The feedback we've received suggests that we were successful.

This blog post does not detail how to use the Wizard. The documentation is a good resource for this, as is Chuck Pirrello’s video “Working with Excel Data.” Rather, I will aim to give you an overview of the changes to the Excel Import Wizard in JMP 12 and show you how easy it is to import your Excel data into JMP.

First, the wizard is available on the Mac in JMP 12. We apologize for the delay with this, but please know that all of the functionality that you have had in the Windows release of JMP is now available on the Mac. All of the new features that I describe in this post are in the Mac version of the wizard.  Potentially of even greater significance for Mac users is that the wizard allows you to open .XLSX formatted files. XLSX is the newer Excel format introduced with Excel 2007. Prior versions of JMP required Mac users to open .XLSX files via an ODBC driver, which could be cumbersome.

The Mac support is the biggest news in the latest version of JMP. Because of user feedback, we have added a few more features. The first is the “Show all rows” support. This aims to address two requests. Some users want to see all of the data in large tables, not just the first 100 rows. By checking this box, you will now get a preview of the entire worksheet. This also makes another important adjustment that you can’t see. It uses all of the data in the column to guess the data type for the column. Users were running into cases where they had a large number of rows, and the last few rows contained some character data. Since JMP would use the first 100 rows to guess the data type, it could guess a numeric type. If row 205 contained character data, it could be lost when the worksheet was imported because the character data could not be represented in a numeric cell. By checking the “Show all rows” option, character data that is discovered far down in the column will cause the column to be typed as Character to preserve all of the data. Please know that if you select this option, large tables could take a considerable time to load.

ShowAllRows

The green plus signs that you see next to the settings labels are new in JMP 12. If you first select a row or column, and then press the green plus button next to the desired setting, JMP calculates the correct row or column number for the setting. This can be important if have hidden rows or columns, or where the data does not start on the second row but you are no longer viewing the empty rows prior to the start of the data.

An example of using the green plus is the data below. There are observations and text below the last country, Kenya, that I don’t want to import into JMP. I can select the row that highlights Kenya.

DataEndsOne

JMP will figure out that the row in question is in fact row 40, and it will fill in the field and update the display.

DataEndsTwo

Finally, the screenshot above also shows one final option. Sometimes workbooks contain columns that are empty, but the column header itself conveys important information. By unchecking the “Suppress empty columns” option, the empty column will be imported. By default, JMP tries to remove this type of column.

Note: A version of this blog post appeared first in Brian's JMP User Community blog, The Rest of the World.

Post a Comment

Promoting interdisciplinary research at QPRC 2015

The Quality & Productivity Research Conference is about to take place in our backyard (so to speak) at the School of Textiles at North Carolina State University. And it includes a field trip to our front yard.

The theme of the meeting is Creativity and Innovation for a Connected World. As the conference website says, QPRC 2015 aims "to stimulate interdisciplinary research among statisticians, scientists, and engineers in quality and productivity, industrial needs, and the physical and engineering sciences." JMP is all about that.

Attendees who peruse the schedule will notice lots of involvement by JMP and SAS (including many JMP bloggers). In fact, JMP is a co-host, and conference co-chairs are Di Michelson and Mia Stephens.

Here's what JMP and SAS folks will be doing during the conference:

Wednesday, June 10

  • 8:30-9:00 a.m. Welcome by Di Michelson and John Sall
  • 9:00-10:00 a.m. Plenary speech by Chris Gotwalt
  • 10:30 a.m.-noon Applications of Text Analytics, invited session organized by Mark Bailey and chaired by Cat Truxillo; presenters include Murali Pagolu
  • 1:30-3:00 p.m. Modern Regression Methods, invited session organized and chaired by Di Michelson; presenters include Clay Barker

Thursday, June 11

  • 3:30-4:00 p.m. JMP demo by Byron Wingerd
  • 4:00-5:30 p.m. Modern DOE Training Methods, panel organized by Di Michelson; panelists include Mark Bailey
  • 4:00-5:30 p.m. Holistic DOE: Learning in the Face of Uncertainty, contributed session by Scott Wise and Dan Valente
  • 5:30-6:00 p.m. Bus tour of SAS campus
  • 6:00-7:00 p.m. Reception at SAS campus
  • 7:00-8:00 p.m. Plenary speech by Jerry Williams

 Friday, June 12

 

Friends of JMP at QPRC 2015

William Meeker of Iowa State University, whom you may have seen in some of our webcasts, is presenting a short course, Experiences and Pitfalls in Reliability Data Analysis and Test Planning, on June 9 before the conference begins. Dr. Meeker is also a panelist in a session on Reliability and Quality Control on June 10. That same day, JMP Discovery Summit Europe presenter Heath Rushing of Adsurgo is on the Applications in Text Analytics panel. And Dennis Lin of Penn State University, who appears on Analytically Speaking on June 10, serves on the panel on Developments in Design the next day, June 11.

For the full schedule, view the PDF. And if you like what you see, register. Online registration closes June 5, but you can also register at the door when the conference begins.

Also, note that QPRC attendees receive 20% off JMP Training and Books. Stop by the JMP Training table while at the conference June 11 and 12 for more information.

Post a Comment

The QbD column: A QbD factorial experiment

A quick review of QbD

The first blog post in this series described Quality by Design (QbD) in the pharmaceutical industry as  a systematic approach for developing drug products and drug manufacturing processes. Under QbD, statistically designed experiments are used to efficiently and effectively investigate how process and product factors affect critical quality attributes. They lead to determining a “design space,” a collection of production conditions that reliably provide a quality product.

A QbD case study

In this post, we present a QbD case study that focuses on setting up the design space. In this case study, the formulation of a steroid lotion of a generic product is designed to match the properties of an existing brand using in-vitro tests. In-vitro release is one of several standard methods that can be used to characterize performance of a finished topical dosage form. Important changes in the characteristics of a drug product formula (or the chemical and physical properties of the drug it contains) should show up as a difference in the drug release profile. A plot of the amount of drug released per unit area (mcg/cm2) against the square root of time should yield a straight line. The slope of the line represents the release rate, which is formulation-specific and can be used to monitor product quality. The typical in-vitro release testing apparatus has six cells in which the tested generic product is compared to the brand product. A 90% confidence interval for the ratio of the median in-vitro release rate in the generic and brand products is computed and then expressed as a percentage. If the interval falls within the limits of 75% to 133.33%, the generic and brand products are considered equivalent.

Initial assessment

An initial risk assessment maps the risks in meeting specifications of critical quality attributes (CQA). Table 1 presents expert opinions on the impact of manufacturing process variables on various CQAs. Cooling temperature was considered to have low impact; the order of ingredient addition was determined by considering the risk of contamination. Later, these two factors were not studied in setting up the process design space.

Table 1: Risk assessment of manufacturing process variables

Table 1: Risk assessment of manufacturing process variables

The responses that will be considered in setting up the process design space include eight quality attributes:

  • Assay of active ingredient.
  • In-vitro permeability lower confidence interval.
  • In-vitro permeability upper confidence interval.
  • 90th percentile of particle size.
  • Assay of material A.
  • Assay of material B.
  • Viscosity.
  • pH values.

Three process factors are considered: temperature of reaction, blending time, and cooling time.

The experiment

To elicit the effect of the three factors on the eight responses, the company used a full factorial experiment with two center points. (See Table 2, which presents the experimental array in standard order.)

Table 2: Full factorial design with two center points

Table 2: Full factorial design with two center points

The JMP Prediction Variance Profile is useful to study the properties of the design. It shows the ratio of the prediction variance to the error variance, also called the relative variance of prediction, at various factor level combinations (Figure 1). Relative variance is minimized at the center of the design. As expected, if we choose to use a half-fraction replication with four experimental runs on the edge of the cube, instead of the eight-point full factorial, the variance will double.

Figure 1: Prediction variance profile for full factorial design of Table 1 (top) and half-fraction design (bottom)

Figure 1: Prediction variance profile for full factorial design of Table 1 (top) and half-fraction design (bottom)

Conclusions about the factor effects

Each response is analyzed separately. Using the analysis of viscosity as an example, we see that the target value was 5000 and that results outside the range 4000-5600 were considered totally unacceptable. The 10 experimental results, shown in Figure 2, ranged from 4100 to 5600.

Figure 2: Cube display of viscosity responses

Figure 2: Cube display of viscosity responses

The strongest effect was associated with temperature, which was positively correlated with viscosity. However, none of the effects achieved statistical significance at the 5% level. (See Table 3)

Table 3: Analysis of viscosity using second order interaction model

Table 3: Analysis of viscosity using second order interaction model

The case for center points

The two center points in the experimental array allow us to test for nonlinearity in the response surface by comparing the average responses at the center points (which for viscosity is 4135.5) to the average of the responses on the corners of the cube (in this case, 4851.6). We can obtain a formal significance test of nonlinearity by adding an “indicator” column – which has the value 1 for the center points and 0 for all other points – to the spreadsheet in Figure 2.

Combining all the response variables

The goal is to map out a design space that simultaneously satisfies requirements on eight responses: Active Assay, In-Vitro Lower, In-Vitro Upper, D90, A Assay, B Assay, Viscosity, pH. To achieve this objective, we apply a popular solution called the desirability function (Derringer and Suich, 1980), which combines the eight responses into a single index. For each response, Yi(x), i = 1, . . . , 8, the univariate desirability function di(Yi) assigns numbers between 0 and 1 to the possible values of Yi, with di(Yi) = 0 representing a completely undesirable value of Yi, and di(Yi) = 1 representing a completely desirable or ideal response value.

The desirability functions for the eight responses are presented graphically in Figure 3. For Active Assay, we aim to be above 95% and up to 105%. Assay values below 95% yield desirability of 0, assays above 105% yield desirability of 1. For In-Vitro Upper we do not want to be above 133%. Our target for D90 is 1.5 with results above 2 and below 1 having zero desirability. The design space can be assessed by an overall desirability index using the geometric mean of the individual desirabilities: Desirability Index = [d1(Y1) ∗ d2(Y2) ∗ . . . dk(Yk)] 1/ k with k denoting the number of measures (in our case k = 8). Notice that if any response Yi is completely undesirable (di(Yi) = 0), then the overall desirability is zero. From Figure 2 we can see that setting Temp=65, Blending Time=2.5 and Cooling Time=150 gives us an overall Desirability Index=0.31.

 

Figure 3 Prediction profiler with individual and overall desirability function and variability in factor levels: (JMP) temp=65, blending time −2.5, cooling time −150  Figure 3 Prediction profiler with individual and overall desirability function and variability in factor levels: (JMP) temp=65, blending time −2.5, cooling time −150  

Figure 3 Prediction profiler with individual and overall desirability function and variability in factor levels:
(JMP) temp=65, blending time −2.5, cooling time −150

How variation in the production conditions affects desirability

JMP allows us to study the consequences of variability in the factor levels. Based on past experience, the production team was convinced that the inputs would follow independent normal distributions about their target settings, with standard deviations of 3 for Temp, 0.6 for Blending Time, and 30 for Cooling Time. Figure 3 shows how this input variability is transferred to the eight responses and to the overall desirability index via the statistical models from the experiments. JMP allows us to simulate responses and visualize how the variability in factor-level settings affects the variability in response. Viscosity and In-Vitro Upper show the smallest variability relative to the experimental range.

Finding the design space

To conclude the analysis, we applied JMP Contour Profiler to the experimental data, fitting a model with main effects and two-factor interactions. The overlay surface is limited by the area with In-Vitro Upper being above 133, which is not acceptable. As a design space, we identified operating regions with blending time below 2.5 minutes, cooling time above 150 minutes, and temperature ranging from 60-75 degrees Celsius. Once approved by the regulator, these areas of operations are defined as the normal range of operation. Under QbD, operating changes within this region do not require preapproval, only post-change notification. This change in regulatory strategy is considered a breakthrough in traditional inspection doctrines and provides significant regulatory relief for the manufacturer.

Monitoring production quality

An essential component of QbD submissions to the FDA is the design of a control strategy. Control is established by determining expected results and tracking actual results in the context of expected results. The expected results are used to set up upper and control limits as well as present the design space (see Figure 4). A final step in a QbD submission is to revise the risk assessment analysis. At this stage, the experts agreed that with the defined design space and an effective control strategy accounting for the variability presented in Figure 3, all risks in Table 1 have been reset as low.

Figure 4: Contour Profiler in JMP with overlap of eight responses

Figure 4: Contour Profiler in JMP with overlap of eight responses

References

  • Derringer, G., and Suich, R., (1980), Simultaneous Optimization of Several Response Variables, Journal of Quality Technology, 12, 4, 214-219.
  • Kenett and Zacks, Modern Industrial Statistics with Applications in R, MINITAB and JMP, Wiley, 2014.
Post a Comment

Improvements to Design of Experiments book in JMP 12

In JMP 12, you will find new examples, improved organization and enhanced content in the Design of Experiments (DOE) book. You can find this book under the Help menu: Help > Books > Design of Experiments Guide.

New Examples

Baristas at Counter Culture Coffee in Durham, NC, helped with our coffee experiment.

Baristas at Counter Culture Coffee in Durham, NC, helped with our experiment.

In the Starting Out chapter, a new example describes an experiment carried out at Counter Culture Coffee, a local coffee roaster in Durham, North Carolina.

With help from the baristas, the experiment was conducted by two of our DOE developers, Bradley Jones and Ryan Lekivetz, and the manager of the documentation team, Sheila Loring. The goal of the experiment was to determine which factors had an effect on the strength of brewed coffee and to optimize the strength.

We walk you through the design and analysis of the experiment, from start to finish, following the framework for experimental design (describe, specify, design, collect, fit, predict).

The chapter on the powerful Custom Design platform has been completely revised, along with the chapters on Definitive Screening Designs and Evaluate Designs.

We’ve added new examples and enhanced existing examples throughout, but particularly in the Examples of Custom Designs chapter. In that chapter, you might find the examples dealing with covariates and randomization restrictions of particular interest. In cases where a design is created using a random seed, we have added that information so that you can recreate the exact design that appears in the documentation.

Improved Organization

We noticed that our users could benefit from a more streamlined organization of the content. We developed the following outline for the chapters:

  • Introduction – Decide whether this platform is right for your needs
  • Overview – Understand what the platform does
  • Example – Learn how to use the platform
  • Window Descriptions – Understand the process of implementing DOE in this platform
  • Option Descriptions – Find out what the features in the platform do
  • Technical Details – Learn more about implementation details like formulas and algorithms that apply to the platform

You might be new to DOE and want to walk through how to use a platform. Or, you might be an experienced user looking for more detail about a particular feature. Whatever your background or need, our goal in restructuring the content is to enable you to find what you are looking for quickly.

Enhanced Content

Not everyone is familiar with design of experiments. With new users in mind, we added:

  • A short introductory chapter that briefly explains each DOE platform.
  • A comprehensive Starting Out chapter that covers the workflow of DOE and includes an in-depth example to walk new users through the process of using DOE.
  • Even more examples to step users through different types of scenarios they might encounter or specific features they may want to use in DOE. Examples are an efficient way to help users decide if a particular approach is right for them, and to guide them through how to use it.

To meet the needs of all users, we identified concepts whose descriptions could be improved with further explanation and added more details accordingly.

We hope that these improvements will enhance your experience with DOE and help you achieve your goals. For JMP 12, we were able to add these enhancements to several chapters in the DOE book. Our goal is to continue these improvements to the DOE book.

We would love to hear your feedback. Please let us know your thoughts on the changes and share your suggestions for improvement.

Post a Comment

Receive this message from Phil Simon

Do you ever take the time to write book reviews on Amazon? I’ve thought about it many times, but I never seemed to find the time to follow through — until I read Phil Simon’s latest book, Message Not Received: Why Business Communication is Broken and How to Fix It. The more people who read this book, the better off we will all be.

Many of Phil’s books have compelling reviews. This excerpt from a five-star review on Amazon by Jack Spain for The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions sums up that “…we must:

  • Value and appreciate the power of speaking with data
  • Unleash the power of the data we have access to
  • Improve our ability to communicate succinctly, effectively, and with the right context on a timely basis
  • Increase the leverage of our data resources with innovative data visualization tools on the market today.”

Multitalented JMP colleague Melinda Theilbar, Senior Research Statistician Developer, got her own paragraph in the acknowledgements for The Visual Organization for her contributions to chapter 6, part of which you can see here, from page 122:

Phi_Simon_graph

If you join us for Analytically Speaking with Phil Simon on May 27 — and we hope you will — you will hear more about his latest books and a variety of other topics for which Phil is a sought-after consultant and speaker — strategy, technology, organizational culture, communication, big data, and more. For a small sampling got the kinds of things Phil has to say, check out his guest blog post last year on numeracy.  If you are unable to join the live webcast, you can always view the archived version.

Wide data discriminant analysis

With multivariate methods, you have to do things differently when you have wide (many thousands of columns) data.

The Achilles heel for the traditional approaches for wide data is that they start with a covariance matrix, which grows with the square of the number of columns. Genomics research, for example, tends to have wide data:

  • In one genomics study, there are 10,787 gene expression columns, though there are only 230 rows. That covariance matrix has more than 58 million elements and takes 465MB of memory.
  • Another study we work with has 49,041 genes on 249 individuals, which results in a covariance matrix that uses 9.6GB of memory.
  • Yet another study has 1,824,184 gene columns, yielding a covariance matrix of 13.3TB. Computers just don’t have that much memory – in fact, they don’t have that much disk space.

Even if you have the memory for these wide problems, it would take a very long time to calculate the covariance matrix, on the order of n*p2, where n is the number of rows, and p is the number of columns. Furthermore, it gets even worse when you have to invert the covariance matrix, which would cost on the order of p3. In addition, the covariance matrix is very singular, since there are far fewer rows than columns in the data.

So you would think that multivariate methods are impractical or impossible for these situations. Fortunately, there is another way, one that makes wide problems not only practical, but reasonably fast while yielding the same answers that you would get with a huge covariance matrix.

The number of effective dimensions in the multivariate space is on the order of the number of rows, since that is far smaller than the number of columns. All the variation in that wide matrix can be represented by the principal components, those associated with positive variance, i.e., non-zero eigenvalues. There is another way to get the principal components using the singular value decomposition of the (standardized) data. Now that you are in principal component space, the dimension of these principal components is small; in fact, all the calculations simplify because the covariance matrix of principal components is diagonal, which is trivial to invert.

For example, let’s look at that genomics data set with 1,824,184 columns across 249 rows. I was able to do the principal components of that data using JMP 12 on my 6-core Mac Pro in only 94 minutes. Doing it the other way would clearly have been impossible.

With discriminant analysis, the main task is to calculate the distance from each point to the (multivariate) mean of each group. However, that distance is not the simple Euclidean distance, but a multivariate form of the distance, one called the Mahalanobis distance, which uses the inverse covariance matrix.

But think about the standardized data – the principal components are the same data, but rotated through various angles. The difference is that the original data may have 10,000 columns, but the principal components only have a few hundred, in the wide data case. So if we want to measure distances between points or between a point and a multivariate mean (centroid), then we can do it in far fewer dimensions. Not only that, but the principal coordinates are uncorrelated, which means that the matrix in the middle of the Mahalanobis distance calculations is diagonal, so the distance is just a sum of squares of individual coordinate distances.

We take the principal components of the mean-centered data by group. Then we add back the mean to this new reduced-dimension data. The transformed mean is just the principal components transformation of the original group means.

Example

There is a set of breast cancer genome expression data that is a benchmark used by the Microarray Quality Consortium with 230 rows and 10,787 gene expressions. A discriminant analysis using the wide linear method takes only 5.6 seconds (on Mac Pro) and produces a cross-validation misclassification rate of only 13.7% on the hardest-to-classify category, ER_status. This misclassification rate is close to the best achieved for this problem, given that the other methods are expensive.

Sall_Blog_Wide_Data

Note: This is part of a Big Statistics series of blog posts by John Sall. Read all of his Big Statistics posts.

Post a Comment

Creating a custom map for my workout data in JMP 12

Custom map of female human body showing muscle groups

Using an add-in, I created this custom map in JMP. See how to make this, below.

In the previous post in this fitness and food series, I showed some examples of how I have been using local data filter to customize graphs of my workout data. JMP 12 offers a new alternative to list-based data filters in the form of selection filters, or graphs that can act as filters for one or more other graphs. JMP developer Dan Schikore posted about this new capability, and showed a few examples that used geographic maps as filters. I will be presenting visualizations of my workout data at Discovery Summit 2015 in San Diego, and Dan will be presenting a poster on selection filters. We hope you will come to the conference to learn more about this feature and all the other cool additions made for JMP 12!

When I mentioned to JMP visualization expert Xan Gregg that I wanted to use selection filters in my workout data project, he told me about an add-in in the File Exchange in the JMP User Community  called the Custom Map Creator, described in this post by Justin Mosiman. The Custom Map Creator simplifies the process of creating a set of new map shape files that JMP can use to define objects or areas. Justin blogged about creating a custom map of the floor containing JMP staff office. JMP testing manager Audrey Ventura later blogged about combining this map with other information to understand various factors that influenced temperature patterns on the JMP development office floor.

Xan thought I could use Justin's add-in to make a custom muscle structure map for visualizing and filtering my workout data. After I downloaded and ran the Custom Map Creator, a JMP graph window opened. I located a picture of female muscle structures, saved it and dragged the picture into the graph window. Then I clicked to define shapes like "Shoulders" and "Back." As I clicked to define the various muscle areas, the add-in automatically generated the shape file and X-Y coordinate file that JMP requires to define a custom map. Within an hour, I had generated a detailed custom map shape file defining front and back views of muscle areas. A picture of my completed map shape file is shown below.

Custom map creator 5-7-15

In retrospect, I could have been more specific with the muscle structures I traced, for example, outlining individual back muscles or creating different shapes for the front, back and side of the shoulder area. However, I thought that a map with muscle shape areas for all primary muscle areas that I work in the gym was a great starting point!

To make my custom map visible to JMP, I placed the shape files created by the add-in into the folder C:\Users\~myusername\AppData\Roaming\SAS\JMP\Maps.

In a previous post, I described how I used a JMP formula column to calculate a Total Weight Lifted metric for every row in my workout data table. Total Weight Lifted is the product of the amount of weight, number of weights used, repetitions and sets, and can be summarized at various levels, from row, to exercise, to primary body part and body zone. My data table already included primary body part as a grouping variable, which enabled me to match up my data with the shapes defined by my custom map. First, I created a graph of primary body part colored by the proportion of total weight lifted. To make this graph, I opened a new Graph Builder window, then performed the following steps:

  • Dragged Total Weight Lifted to the Color variable zone on the right
  • Chose % of Total as my Summary Statistic under the Map Shapes properties section
  • Dragged Primary Body Part into the Map Shape zone in the lower left part of the graph
  • Right clicked on the legend, chose Gradient, and clicked on the color bar next to Color Theme to select Muted Yellow to Red

Body shape colored by weight lifted setup

To finalize the graph, I clicked Done in the Graph Builder window. This graph summarizes across all the data on my workouts that I have entered so far, and it's obvious that I lifted the greatest percentage of total weight during back exercises, followed by chest and shoulders, then various lower body areas, with smaller percentages attributable to calf and arm exercises. The proportions of weight lifted in this overall graph will likely change as I continue to enter data on older workouts where I lifted more weight for lower body exercises.

By adding year as a wrap variable and using a Local Data Filter to restrict the months and years shown, I can create custom views, like the one below that contrasts the proportion of weight I lifted for each body area during six different Januaries. This graph shows me that in 1999 and 2000, I was lifting a greater percentage of total weight when training quadriceps and calves. I focus much less on those body areas in recent years, as my focus has shifted to shoulders, chest and back.

Body shape colored by weight lifted by year

Stay tuned for my next blog post, where I will show some examples of how I used my custom muscle map as both a selection filter and a target for selection filtering by other graphs.

Post a Comment