Wide data discriminant analysis

With multivariate methods, you have to do things differently when you have wide (many thousands of columns) data.

The Achilles heel for the traditional approaches for wide data is that they start with a covariance matrix, which grows with the square of the number of columns. Genomics research, for example, tends to have wide data:

  • In one genomics study, there are 10,787 gene expression columns, though there are only 230 rows. That covariance matrix has more than 58 million elements and takes 465MB of memory.
  • Another study we work with has 49,041 genes on 249 individuals, which results in a covariance matrix that uses 9.6GB of memory.
  • Yet another study has 1,824,184 gene columns, yielding a covariance matrix of 13.3TB. Computers just don’t have that much memory – in fact, they don’t have that much disk space.

Even if you have the memory for these wide problems, it would take a very long time to calculate the covariance matrix, on the order of n*p2, where n is the number of rows, and p is the number of columns. Furthermore, it gets even worse when you have to invert the covariance matrix, which would cost on the order of p3. In addition, the covariance matrix is very singular, since there are far fewer rows than columns in the data.

So you would think that multivariate methods are impractical or impossible for these situations. Fortunately, there is another way, one that makes wide problems not only practical, but reasonably fast while yielding the same answers that you would get with a huge covariance matrix.

The number of effective dimensions in the multivariate space is on the order of the number of rows, since that is far smaller than the number of columns. All the variation in that wide matrix can be represented by the principal components, those associated with positive variance, i.e., non-zero eigenvalues. There is another way to get the principal components using the singular value decomposition of the (standardized) data. Now that you are in principal component space, the dimension of these principal components is small; in fact, all the calculations simplify because the covariance matrix of principal components is diagonal, which is trivial to invert.

For example, let’s look at that genomics data set with 1,824,184 columns across 249 rows. I was able to do the principal components of that data using JMP 12 on my 6-core Mac Pro in only 94 minutes. Doing it the other way would clearly have been impossible.

With discriminant analysis, the main task is to calculate the distance from each point to the (multivariate) mean of each group. However, that distance is not the simple Euclidean distance, but a multivariate form of the distance, one called the Mahalanobis distance, which uses the inverse covariance matrix.

But think about the standardized data – the principal components are the same data, but rotated through various angles. The difference is that the original data may have 10,000 columns, but the principal components only have a few hundred, in the wide data case. So if we want to measure distances between points or between a point and a multivariate mean (centroid), then we can do it in far fewer dimensions. Not only that, but the principal coordinates are uncorrelated, which means that the matrix in the middle of the Mahalanobis distance calculations is diagonal, so the distance is just a sum of squares of individual coordinate distances.

We take the principal components of the mean-centered data by group. Then we add back the mean to this new reduced-dimension data. The transformed mean is just the principal components transformation of the original group means.

Example

There is a set of breast cancer genome expression data that is a benchmark used by the Microarray Quality Consortium with 230 rows and 10,787 gene expressions. A discriminant analysis using the wide linear method takes only 5.6 seconds (on Mac Pro) and produces a cross-validation misclassification rate of only 13.7% on the hardest-to-classify category, ER_status. This misclassification rate is close to the best achieved for this problem, given that the other methods are expensive.

Sall_Blog_Wide_Data

Note: This is part of a Big Statistics series of blog posts by John Sall. Read all of his Big Statistics posts.

Post a Comment

Creating a custom map for my workout data in JMP 12

Custom map of female human body showing muscle groups

Using an add-in, I created this custom map in JMP. See how to make this, below.

In the previous post in this fitness and food series, I showed some examples of how I have been using local data filter to customize graphs of my workout data. JMP 12 offers a new alternative to list-based data filters in the form of selection filters, or graphs that can act as filters for one or more other graphs. JMP developer Dan Schikore posted about this new capability, and showed a few examples that used geographic maps as filters. I will be presenting visualizations of my workout data at Discovery Summit 2015 in San Diego, and Dan will be presenting a poster on selection filters. We hope you will come to the conference to learn more about this feature and all the other cool additions made for JMP 12!

When I mentioned to JMP visualization expert Xan Gregg that I wanted to use selection filters in my workout data project, he told me about an add-in in the File Exchange in the JMP User Community  called the Custom Map Creator, described in this post by Justin Mosiman. The Custom Map Creator simplifies the process of creating a set of new map shape files that JMP can use to define objects or areas. Justin blogged about creating a custom map of the floor containing JMP staff office. JMP testing manager Audrey Ventura later blogged about combining this map with other information to understand various factors that influenced temperature patterns on the JMP development office floor.

Xan thought I could use Justin's add-in to make a custom muscle structure map for visualizing and filtering my workout data. After I downloaded and ran the Custom Map Creator, a JMP graph window opened. I located a picture of female muscle structures, saved it and dragged the picture into the graph window. Then I clicked to define shapes like "Shoulders" and "Back." As I clicked to define the various muscle areas, the add-in automatically generated the shape file and X-Y coordinate file that JMP requires to define a custom map. Within an hour, I had generated a detailed custom map shape file defining front and back views of muscle areas. A picture of my completed map shape file is shown below.

Custom map creator 5-7-15

In retrospect, I could have been more specific with the muscle structures I traced, for example, outlining individual back muscles or creating different shapes for the front, back and side of the shoulder area. However, I thought that a map with muscle shape areas for all primary muscle areas that I work in the gym was a great starting point!

To make my custom map visible to JMP, I placed the shape files created by the add-in into the folder C:\Users\~myusername\AppData\Roaming\SAS\JMP\Maps.

In a previous post, I described how I used a JMP formula column to calculate a Total Weight Lifted metric for every row in my workout data table. Total Weight Lifted is the product of the amount of weight, number of weights used, repetitions and sets, and can be summarized at various levels, from row, to exercise, to primary body part and body zone. My data table already included primary body part as a grouping variable, which enabled me to match up my data with the shapes defined by my custom map. First, I created a graph of primary body part colored by the proportion of total weight lifted. To make this graph, I opened a new Graph Builder window, then performed the following steps:

  • Dragged Total Weight Lifted to the Color variable zone on the right
  • Chose % of Total as my Summary Statistic under the Map Shapes properties section
  • Dragged Primary Body Part into the Map Shape zone in the lower left part of the graph
  • Right clicked on the legend, chose Gradient, and clicked on the color bar next to Color Theme to select Muted Yellow to Red

Body shape colored by weight lifted setup

To finalize the graph, I clicked Done in the Graph Builder window. This graph summarizes across all the data on my workouts that I have entered so far, and it's obvious that I lifted the greatest percentage of total weight during back exercises, followed by chest and shoulders, then various lower body areas, with smaller percentages attributable to calf and arm exercises. The proportions of weight lifted in this overall graph will likely change as I continue to enter data on older workouts where I lifted more weight for lower body exercises.

By adding year as a wrap variable and using a Local Data Filter to restrict the months and years shown, I can create custom views, like the one below that contrasts the proportion of weight I lifted for each body area during six different Januaries. This graph shows me that in 1999 and 2000, I was lifting a greater percentage of total weight when training quadriceps and calves. I focus much less on those body areas in recent years, as my focus has shifted to shoulders, chest and back.

Body shape colored by weight lifted by year

Stay tuned for my next blog post, where I will show some examples of how I used my custom muscle map as both a selection filter and a target for selection filtering by other graphs.

Post a Comment

The QbD Column: Overview of Quality by Design

Developing new drugs is a complex, lengthy and expensive endeavor. When the process leads to an approved drug, the result is improved patient care and great benefits for the developers. But many promising drugs never live up to expectations. The US Food and Drug Administration (FDA), observing that new drug approvals were decreasing and development costs were steeply increasing, reviewed the drug development process. In an unusual industry-government partnership, the FDA, with support from major pharmaceutical companies, launched a Quality by Design (QbD) initiative to help streamline the drug development and approval process. Since statistically designed experiments and general multivariate methods play a central role in QbD, it is a fertile arena for statistical applications.

This blog post, the first in a series dedicated to issues related to Quality by Design, gives a broad overview of QbD. Subsequent blog posts will go into detail on some of the statistical issues in QbD and specifically how JMP can help to solve them.

Read More »

Using the Interactive HTML Profiler in JMP 12

The Interactive HTML Profiler output is new to the latest version of JMP. It is meant to replace and improve upon the interactive output provided by the Flash (SWF) export facility in previous versions of JMP, one major advantage being that the HTML5 technology used is supported on mobile devices such as the iPad whereas Flash is not.

If you are unfamiliar with the Profiler platform, here's what you need to know: It enables you to explore cross sections of predicted responses across multiple factors. The Profiler guide in JMP documentation provides further details on how to use this platform. Interactive HTML is a feature we introduced in JMP 11, and it is supported on most modern browsers. For more information on Interactive HTML export capabilities in general see Save the Report as Interactive HTML.

Give it a try yourself! Export an Interactive HTML Profiler and open it in a web browser. You'll see that you can hover your mouse over the reference lines or the curves themselves on the desktop in order to display tooltips that show the values at that point while in identification mode (see Heman Robinson's recent post for a description of the mobile interface and the different modes). This example was prepared using the Diamonds Data.jmp sample data set.

Interactive HTML Profiler tooltips

An Interactive HTML Profiler displaying a tooltip for the curve Pred Formula Price for the continuous factor Carat Weight.

Interacting with a curve of the Interactive HTML Profiler sets its reference variable and updates all related profile curves. Clicking anywhere within the graph (or touching on a mobile device) moves the reference variable to that value, or you can click and drag a reference line to update the curves. The Profiler behaves a little differently than other Interactive HTML plots in that it has dragging of the reference lines in both identification and brushing modes.

Alternatively, you can click on either the edit box for continuous variables or dropdown box for categorical variables below the chart to manually edit. This allows you to enter a precise value, or for continuous variables, to explore the curve outside of the currently displayed range of values. For mobile devices, this method of entry is particularly useful since a precise value is harder to achieve with a finger compared to a mouse.

Interactive HTML Profiler categorical control

Selecting a value for a categorical factor from the dropdown control.

In addition to these two types of data, the Interactive HTML Profiler can handle simple mixture constraints from JMP, just as Flash could. When this type of data is present, additional controls are present to lock and unlock a constrained variable. The following Profiler using the Plasticizer.jmp sample data has a mixture constraint that the three factors must sum to 1.

Interactive HTML Profiler with mixture constraint

Mixture constraint profiler showing unlock/lock factor toggle controls.

The unlock/lock toggle button pairs allow you to lock a factor in this mixture at its current value so that when changing the other factors it does not change. For example, if you were to click the lock button below the factor p1, the reference line becomes a solid red line and the lock button is toggled down to indicate this factor is locked. Note how the range of the curves for p2 and p3 become restricted further since p1 is locked at 0.6615 -- but all three must still sum to 1 (see the figure below).

Interactive HTML Profiler with locked mixture factor

By clicking the lock button, the factor is toggled to the locked state.

You can still click and drag p1, but the other factors will be constrained with p1 fixed at the new value.

The Profiler provides a menu next to its collapsible outline box, which allows you to reset the Profiler to its initial state. This is especially useful in the new embedded Profiler (described under New Features below) if you have collapsed other sections of the report or selected markers in other charts and don’t want to lose these changes by refreshing your browser. Just click the menu item next to the Profiler and select “Reset.”

Interactive HTML Profiler reset

Resetting a Profiler is a useful way to get back to the initial state without refreshing the entire webpage.

New Features Compared to Flash

Interactive HTML Profilers cover the same features as the Flash version (which is still present in JMP 12). In addition to using HTML5, which makes these Profilers available on mobile devices, we included a few new features that weren’t present in Flash.

Mobile-Friendly User Interface

A mobile accessible interactive report requires a mobile friendly user interface, and Interactive HTML reports were designed in JMP 12 for ease of use on tablets and phones.  See Using JMP 12 Interactive HTML reports on mobile devices for more details on how these reports work on mobile.

Embedded Profiler in Fit Model Least Squares Platform

While Flash allowed you to export only the Profiler Platform, Interactive HTML also allows you to export a Fit Model Least Squares report with an interactive Profiler alongside the other interactive tables and charts.

Remembered Settings

The Fit Model picture displayed above also shows exported remembered settings, another new feature in the Interactive HTML Profiler. While preparing a Profiler in JMP, if you find a combination of factors that is interesting, you can use the Remember Settings feature under Factor Settings in JMP to save them to a table below the Profiler. In the exported Interactive HTML, these values can be applied to your Profiler by clicking the Apply button to the left of the row of values as seen in the picture below.

Embedded Interactive HTML Profiler

Fit Model results for a famous mixture experiment.

The Profiler provides a wealth of information about your model. And in JMP 12, exporting to Interactive HTML lets you communicate with colleagues who don’t have JMP yet. The best way to get familiar with this exciting new platform on the web is to try a few examples: JMP 12 Profiler Examples. The pictures in this blog post are available there as live Interactive HTML files to explore, as well as a few other examples.

Post a Comment

Handling outliers at scale

In an earlier blog post, we looked at cleaning up dirty data in categories. This time, we look at cleaning dirty data in the form of outliers for continuous columns.

In industry, it’s not unusual to have most of your values in a narrow range (for example between .1 and .7) with a few values well outside that range (such as 99, which is the missing value code). Some of the most common examples of outliers include:
• Missing values codes, often a string of 9s.
• Obviously miscoded, manually entered values, especially when the decimal point is misplaced.
• Instrument failures, which result in an instrument failure code inserted where the data should be.
• Unusual events, such as an electrical short causing a current surge or a voltage dropout.
• Long-tailed distributions, such as time-to-failure distributions, where large values may be rare, but real.
• Contamination outliers, in which a few items have crept into a collection of items that was presumed to have been clean.

If we had only a few columns to explore and clean up, it would be reasonable to do it by hand. However, with dozens, or thousands of columns, you need some help. The “Explore Outliers” facility in Cols->Modeling Utilities is the new tool to use.

The most straightforward approach is to get some tail quantiles and then measure some scale of the interquantile range beyond the tail quantiles. The ANSUR Anthopometry data provides a good illustration; it measures 131 body measurements across 3,982 individuals. Before analyzing this data, you need to first check for and be aware of outliers.

For each selected variable, Explore Outliers first calculates the 10% and 90% quantiles to determine a good estimate for the range of most of the data. If there is a huge number of ties, meaning that the 10% and 90% quantiles are the same, the utility reaches further into the tail to obtain a value. It then multiplies the interquantile range by 3 (the default ‘Q’ scaling factor) and looks for any values that are further out from the 10% and 90% quantiles than three times the interquantile range. This is a pretty extreme default outlier selection criterion, but you are probably most interested in the most remote outliers, at least at first. You can adjust the quantiles and multiplier later to be more sensitive.

The default scan of the ANSUR data revealed two columns with a value of -999, which is more than three interquantile ranges past the 10% and 90% tails. In this case, it is obvious that they are missing-value codes, not real data.

Since the other 129 variables did not show outliers, we can more easily focus on the columns that did by checking “Show only columns with outliers”. Selecting these columns in the display will select the columns in the data table.

Next, we need to determine which action to take to address these outliers. There are six action buttons. You can select the rows, and use the Rows command “Next Selected” to advance the data table view to each selected row. You can color the rows to emphasize that the row had an outlier. You can color the cells to make them stand out as you browse the data table, shown below. You can also exclude the rows to hold them back from an analysis.

Exploring Outliers in JMP 12

In this case, since the outliers are really representing missing data, the two remaining actions present the obvious choices: Either click “Add to missing value codes”, which installs a “missing value code” column property for that column, or click “Change to missing”, which changes it to a hardware (NAN) missing value that JMP always recognizes as missing. You might want to use the former button if there are multiple missing value codes that have distinct meanings.

Outliers_JMP_2

We also check for high-nines, since these are often used as missing value codes, but all the high-nines are not far from the upper tail of the data, and are thus not likely to represent missing values.

Sometimes outliers are outlying in a multivariate sense, meaning that they may be nicely in the quantile range of each column, but the data is in clusters, and some points are far away from other points. Using Fisher’s Iris data as an example, we see that there are really three clusters, representing the three species. Because the species are widely separated, there is plenty of near-empty space where an outlier can be distant from all the other points, even though not a univariate outlier. We have the k-nearest-neighbor facility to find these outliers. For each point, it finds the distance of that point to the nearest point, the second-nearest point, the third-nearest point, and so on. Here, we identify a point that is far away from the nearest point.

Outliers_JMP_3

If we look at this point in a Scatterplot Matrix, we see that it is not unusual in any of the single coordinates, but it is far away from other points. It is far away in Sepal width when considering its location with respect to Petal length and Petal width. We happen to know the identities of these clusters, and the unusual feature is that it has a very low Sepal width compared to others in its species.

Outliers_JMP_4

To continue, we might have multiple outliers, i.e., those points that are not far from their nearest neighbors but are still far from the data. Here is the distance of each point from its third near-neighbor.

Outliers_JMP_5

If you look at the Scatterplot Matrix of the two blue k=3 outliers, you see that they are near each other, but separate from other points, focusing on the Sepal width by Sepal length scatterplot.

Outliers_JMP_6

Thus, multivariate outliers can be explored and considered to discern what is happening. Outliers in the k-nearest sense can happen in contamination situations, for example, if an iris from a totally separate species was mixed in and misidentified in the Fisher Iris data.

What to do about outliers depends upon the situation, though some actions are clear. If you have missing value codes, you should add a missing value code property so that the analysis platforms will treat it as missing. If you have a misplaced decimal point, you should correct it if it is a clear case, or change it to a missing value if it is not clear.

Sometimes finding outliers can lead to important discoveries. They shouldn’t automatically be treated like dirt that doesn’t fit.

This topic was also addressed in a previous blog post.

Note: This is part of a Big Statistics series of blog posts by John Sall. Read all of his Big Statistics posts.

Post a Comment

Using Local Data Filter to customize workout graphs in JMP 12

My latest quantified self data visualization project uses JMP 12 to clean up and visualize my workout data. As I described in my previous post, I cleaned up exercise names in my data table using the JMP 12 Recode platform. I also used Recode to create hierarchical exercise groupings, assigning exercises to primary body areas, and body areas to general body zones, including upper body, lower body and core. (Check out other recent posts on JMP 12 Recode updates.)

The stacked bar chart below summarizes all of the January weight workout data that I recorded in my workout notebooks, showing the total weight I lifted (calculated by a formula column) by body zone. Of course, I have data for other months and years, but I thought I would start my data entry project with January, the traditional month for exercise-related New Years' Resolutions. I found weight training data in my notebooks for 13 different Januaries, spanning the time periods 1999-2005 and 2010-2015.

Below, I used the JMP Local Data Filter to restrict the graph to show only data from January workouts and only exercises I classified as Lower Body or Upper Body exercises. I excluded two other exercise categories: Core, including mainly body weight exercises that contributed very little to my total weight lifted metric, and rows where exercise name was missing, which were cardio workouts.

January total weight lifted by body zone LDF

To recreate this graph:

  • Drag Total Weight Lifted to the Y axis. (For details on how I calculated this metric with a formula column, see my previous post.)
  • Drag Year to the X axis, then drag Month above it to create a nested axis. You can create Year and Month columns from a continuous date variable as Transform columns within a launch dialog, or by right clicking on a continuous Date column and choosing New Formula. Column > Date > Year, then repeating these steps to create a Month Abbr column.
  • Drag Body Zone into to the Graph Builder Overlay drop zone. I customized my colors by adding Value Colors as a column property after double-clicking on the Body Zone column.

I added a Local Data Filter by choosing that option from the Graph Builder red triangle Script menu. Local Data Filter is also available in many other JMP platforms, and you can use it to filter your graphs without affecting the underlying row states in your table.

The body zone bar graph I created gave me an initial glimpse of how much my lifting patterns have shifted over time. In 1999, I was working out very regularly at a gym, usually following a body part split routine with a three day rotation that included a heavy leg training day. In fact, lower body exercises accounted for ~60 percent of the total weight I lifted in January 1999, with the main contribution coming from heavy leg presses and calf raises, as you can see below in the bar chart by exercise. I usually performed sets of leg presses and calf raises one after the other. Between the heavy weights, multiple sets, and a typical rep range of 15-20 rep range, the total weight lifted numbers for these lower body exercises added up quickly.

Stacked bar by exercise for Jan 1999

My focus on lower body training had lessened by January 2000, and I reduced my leg training even more in the following year as I began working out almost exclusively at home with free weights and a Bowflex. Although my husband and I picked up additional workout equipment over the next few years, I never went back to heavy leg training. As life got busy, I didn't use the equipment that we had as consistently either. This is quite obvious from the drop in my total weight lifted metric starting in January 2001, as I spent less and less time working out during graduate school and early parenthood. I became pregnant in early 2004 and lifted the least weight ever in a January in 2005 -- except of course for the years 2006-2009, when I didn't record any workouts at all during the first month of the year.

Below is a graph where I cleared the Local Data Filter restriction on Month and showed the data I've entered so far. In recent years, most of the weight I lift is attributable to upper body exercises. While my total weight lifted in December 2014 was close to the total amount I lifted in January 1999, lower body exercises now only account for only ~20 percent of my total weight lifted metric.

Totals by body zone all data 3-10-15

I also used Local Data Filter to customize my graphs to show the maximum amount of weight I lifted for a subset of exercises by year. I decided to focus on my first two years back to consistent training (1999 and 2000) and my last two years (2013 and 2014) of data. I chose a short list of dumbbell exercises I had performed regularly in at least three of those four years, including staple exercises like dumbbell (DB) hammer bicep curls, flat and incline dumbbell chest presses, lateral raises and shoulder presses. Interestingly, the maximum amount of weight I lifted was greatest for nearly all exercises in 2014, with the exception of a single exercise where I used a lot of weight during one workout in 2013. I had to check the original data for that workout to verify that number wasn't a mistake (it wasn't). My training programs last year have included more low-rep sets than have done in the past, for which I tend to use heavier weights.

Max weight in 1999-2000 2013-2014

Creating this graph was a bit trickier, mostly because I had to select markers that could be overlaid, unjittered, but without completely obscuring one another. To make it, I performed the following steps in Graph Builder:

  • Dragged DB or BB weight to the Y axis.
  • Dragged Exercise Name to the X axis.
  • Dragged Year to the Overlay field.
  • Chose Max as the Summary Statistic in the left hand properties pane.
  • Filtered to show only data for the years 1999, 2000, 2013 and 2014.
  • Changed the marker color, shapes, and transparencies so that I could distinguish year by marker and color.

Stay tuned for my next blog post, where I'll show how I created a custom muscle map to visualize my workout data!

Post a Comment

Exploring hard drive test data

Which computer hard drives are most reliable? And how often should you replace a hard drive? Those are some of the questions I hope to answer in exploring and analyzing data about hard drive tests. A company called Backblaze, which offers online data backup services, generously makes this data available to the public via its website. According to the description of the Backblaze Hard Drive Test Data, the company needs to have adequate hard drives that are both reasonably reliable and economically feasible. Over the last two years, Backblaze collected daily data on hard drives that were in service. Company researchers have published some conclusions on the best drive and a  replacement schedule for drives, but they also were curious about what else could be found in the data. So let's see if my analysis agrees, disagrees, and/or finds something else.

My first impression of the data is: It is big. How big? There are 631 daily files, each is about 4 MB or 9 MB. Newer files are bigger. The total size of all CSV files is around 3.5 GB. Each CSV file seems to have the same format, which is very good. They all look like this:plhdimg0101

Instead of concatenating the files in SQLite, I do that in JMP. The resulting JMP table has more than 17 million rows. The columns are: “date”, when the row was recorded; “serial_number”, which is the hard drive ID; “capacity_bytes” which is the size of the hard drive; “failure” which indicates whether that hard drive failed on that day; 40 columns of raw SMART statistics about the hard drive on that day; and 40 columns of normalized SMART statistics.

And the resulting JMP table is more than 12 GB! Hmm, what just happened? The description of the data source mentioned that many SMART statistics in the data are missing; missing values in JMP table are double precision floating numbers, but occupy no spaces between two commas in a CSV file. I guess there are a lot of missing values in the CSV files. That explains why the binary data file ends up much larger than the total of all CSV files. We will come back to the missing value issue later.

Now I want to look at the life distribution of a randomly selected hard drive available in the Backblaze warehouse, regardless of manufacturer and model. (I am assuming the recorded hard drives form a good sample that represents all the hard drives that they have.)

From the resulting data, we can collect the number of days to failure or censoring, for every hard drive. In case you are not familiar with the terminology “censoring,” it means that the hard drive did not fail at the time when its last record was saved. We usually say that failure and censoring are events. We use “time-to-event” to refer time to failure or censoring. The calculation of time-to-event is carried out by computing the date range by serial number. Using the calculated time-to-event values, we can compute a nonparametric estimate of the failure distribution from all the hard drives that they have used.

The next screenshot shows the estimate. How do we interpret the plot? Each dot in the plot has two coordinates, the x-coordinate is the time in Days, and the y-coordinate is the expected probability that a randomly chosen hard drive will fail before that time. For example, a point around 300 days has a probability around 0.03. So if we have 100 randomly chosen hard drives start running at day one, we should expect 3 of them to fail by the time of 300 days.

As a statement, we say the failure probability (sometimes also called failure rate) at 300 days is 0.03. Notice that the failure probability (failure rate) here is different from the failure rate that Backblaze Hard Drive Test Data web page discusses; see their best hard drive blog. Their failure rate is related to, but does not seem exactly is, the recurrent rate in a renewal process in our terminology. I believe that I can use my failure rate to derive something similar to their rate, but not the other way around. I will look closer later.

plhdimg0102

If life were always easy, we could probably fit the data using a parametric distribution, e.g., Weibull, Lognormal, etc. If so, we should see the nonparametric estimate riding along a smooth curve without bumps. But we're not so lucky this time!

However, it is not surprising to see there are bumps or turning points in the estimate. According to the Wikipedia page on S.M.A.R.T., hard drive failures are of two types: predictable failures and unpredictable failures. It sounds like they are talking about failure modes: wear outs, and all others. We could assume that each failure mode can be modeled by a single distribution, e.g., Weibull, Lognormal. Then we can apply the famous bathtub failure rate.

This is not that easy, either. It is surprising to see at least two obvious turning points in the nonparametric estimate. One is around 150 days, and the other one around 550 days. If the later one divides the predictable and unpredictable ones, then does the first turning point tell us that we may classify the unpredictable failures into more than one mode?

Assuming we are considering the turning point around 550 days as the place where failures become more predictable, it is surprising that the failures are building up gradually faster before that point. Isn’t that counter intuitive? I would expect unexpected failures to slow down, according to what we should expect from the bathtub failure rate phenomenon.

Before we dive deeper, I want to take a look about the SMART statistics in the data. I chose to look at the statistic labelled as SMART 187, because it is highlighted in a Backblaze blog post, Hard Drive SMART Stats, as one of the most promising variables for deciding whether a hard drive needs replacement.

I’ve made a simple scatterplot of the SMART 187 raw values against the ages of the hard drives when the corresponding SMART 187 values were recorded. The left panel draws a scatterplot for the good hard drives that did not fail through their last records. The right panel draws a scatterplot for the failed hard drives. According to the explanations on Hard Drive SMART Stats, the differences in the two plots are expected, i.e., the SMART 187 values in the right hand panels appear to be larger than those in the left hand panel. That is consistent with the claim that the higher the SMART 187 value is, the more likely the hard drive will fail.

plhdimg0103

No hard-to-understand surprises so far. Good. But those are probably well-known facts. Now what? We can’t be afraid to get our hands dirty, I guess.

To be continued…

Post a Comment

Interactive HTML Bubble Plot platform in JMP 12

The Interactive HTML Bubble Plot output is new to the latest version of JMP. It is meant to replace and improve upon the interactive output provided by the Flash (SWF) export facility in previous versions of JMP. If you are unfamiliar with the Bubble Plot platform, the Essential Graphing  guide in the JMP Documentation provides details on how to use this platform.

Here are a few pictures of Interactive HTML Bubble Plot examples running in a browser:

BPElection BPFlu
BPCrime Hurricanes Bubble Plot

But to really appreciate the platform, you need to interact with it. So, please try out some web-based Bubble Plot platforms to see for yourself.

If you already have JMP 12 and want to share your bubble plots with someone who doesn't have JMP, or just want to interact with them yourself using an iPad or Android tablet, all you need to do is save as Interactive HTML. Once you open the file in a browser, you can interact with it as follows:

Interacting with bubbles within the plot area
Selection, brushing, tooltips, and moving the time label are all provided in the Interactive HTML Bubble Plot. The toolbar provides modes to make it possible to extend selections in a mobile-friendly way where users don't have Ctrl and Shift keys.

JMPHTML5Toggle

Selecting bubbles
Click on a bubble to select it. Ctrl-Click on another bubble to add to the selection or click the third button from the left (shown under cursor in the image above) before clicking on other bubbles. Tooltips will appear, providing information about the selection. Selected bubbles will also appear brighter than unselected bubbles and may display differently during animation according to the “Trail Lines” and “Trail Bubbles” settings found in the Bubble Plot Platform menu.

Brushing bubbles
Click on the button to the left of the help button, then click in the bubble plot and drag out a rectangle to select bubbles touched by the rectangle. Click within the rectangle and drag to move the rectangle to touch and select other bubbles. To add to the selected bubbles while dragging the rectangle, hold down the Ctrl key or click on the third button from the left (if it’s not already depressed) before dragging the rectangle (“brushing”).

Tooltips
On a desktop browser, they behave as usual. On mobile devices, tap a bubble to display the tooltip.

Time Label
In dynamic bubble plots, where a Time role is assigned, a time label will appear in the plot. Drag the time label within the plot to prevent it from obscuring bubbles or hide the time label by disabling the "Show Time Label" menu item.

Controls under the plot area
Several controls appear under the plot area that control animation, bubble size, and the splitting/combining of bubble groups.

BPControls

They have been styled to be easier to use on mobile devices and behave the way they do in the desktop version. The first slider controls time and is named by the column associated with the Time role.

  • Tap or click the slider track to jump forward or backward in time.
  • Drag the slider's thumb to advance or rewind.

The "Speed" slider controls the animation speed when you press the "Play" button.

  • Tap or click the slider track to quickly increase or decrease the animation speed.
  • Drag the slider's thumb to increase or decrease the animation speed gradually.

The "Bubble Size" slider controls the size of the bubbles.

  • Tap or click the slider track to quickly increase or decrease the bubble size.
  • Drag the slider's thumb to increase or decrease the bubble size gradually.

Click or tap the "Back" button backButton to step the animation backward.

Click or tap the "Step" button stepButton to step the animation forward.

Click or tap the "Play" button playButton to start the animation. The "Pause" button pauseButton will appear to allow you to pause the animation.

Click or tap the "Split" button to split the selected bubbles.

Click or tap the "Combine" button to combine selected bubbles.

Here are some examples of the controls in action:

Animation

Split and combine

As in the desktop version, the controls only appear when they apply, as described in the following table:

Animation Controls
Slider or Button Description Visibility
<Time variable> Controls which time values appear in the bubble plot. You manually drag the slider to see a progression of time. Click and drag on the time variable in the bubble plot to move its position. Only appears if you have specified a variable for Time.
Speed Adjusts the speed of the animation. Only appears if you have specified a variable for Time.
Bubble Size Adjusts the size of the bubbles. The bubbles maintain their relative size, but their absolute size can be adjusted. Appears on all bubble plots.
backButton Adjusts the time value by one unit and shows the previous time value. Only appears if you have specified a variable for Time.
playButton
pauseButton
Press play to animate the bubble plot. Moves through all of the time values in order, and loops back to the beginning when the last time period is reached. Press pause to stop the animation. Only appears if you have specified a variable for Time.
stepButton Adjusts the time value by one unit and shows the next time value. Only appears if you have specified a variable for Time.
Split Splits selected bubbles. Only appears if you have specified two ID variables.
Combine Combines selected bubbles. Only appears if you have specified two ID variables.

Menu interface
The interactive HTML bubble plot menu provides items that control the display of bubbles, labels, trails, and how roles are aggregated.

MenuButton

Use the "Draw" menu to select whether bubbles are filled, outlined, or both.

Use the "Label" menu to select whether labels are shown on all bubbles, selected bubbles, or none.

Use the "Trail Bubbles" menu to select whether all bubbles, selected bubbles, or no bubbles leave trailing bubbles as they animate over time.

Use the "Trail Lines" menu to select whether all bubbles, selected bubbles, or no bubbles leave trailing lines as they animate over time.

Use "Show Time Label" to Show or hide the time label.

Use the "Aggregate" menu to select whether X, Y, Color, or Size roles represent sums or mean values.

Use "Select All" to select all bubbles.

Here are some examples using the menu:

Line Trails

Bubble Trails

What's included?

In my previous post, Coming in JMP 12: Interactive HTML Bubble Plot , I showed a comparison of the desktop version and the Interactive HTML version, but I didn't cover how much of the rich feature set is available in the Interactive HTML version. We tried to capture the essential features of the Bubble Plot platform, many of which appear in the Bubble Plot platform's red triangle menu. So here's a comparison of the red triangle menu and the Interactive HTML version's menu.

RedTriangleMenu Menu Button

We hope you like the features we chose to support in the Interactive HTML Bubble Plot Platform. Please let us know if we missed something you would really like to see included.

Post a Comment

How to create custom menus in JMP

Two of the biggest reasons our users love JMP are its interactivity and its ability to dramatically reduce the time needed to perform routine yet critical analysis tasks. Indeed, the primary strengths of JMP lie in the ways in which it differs from conventional software, and we repeatedly hear from users who can’t imagine doing their jobs with any other tool.

New users, however, may want a little assistance as they make the transition to JMP, and the JMP Starter (View > JMP Starter in Windows and Window > JMP Starter on Macs) is often where they begin. Options are categorized and described. And by clicking a button, the appropriate platform in JMP launches.

JMP Starter Menu

JMP Starter Menu

The case for custom menus

Periodically, users have explained that it would be nice to have the ability to create a modified version of the JMP Starter, or a general-purpose menu, for reasons like these:

  • Providing context-specific menus for a variety of job roles.
  • Linking to documents, web pages and other resources.
  • Launching custom applications.
  • Including additional instructive or descriptive text.

Now, you can quickly and easily create and manage customized menus like the ones in the JMP Starter, with a script, data table and example add-in that I’ve uploaded to the JMP User Community’s File Exchange (a free SAS.com Profile is required to access the files.) While the add-in is chiefly for illustrative purposes, it also installs with many icons that you may find useful as you create your custom menus -- see the download page for details.

Creating a custom menu

To create a custom menu, first place your information into the first five columns of the Specifications data table:

  • Category: Give a name for the submenu in which you want the selection to appear.
  • Icon: If you would like an icon to appear next to the button, include a link to it here. NOTE: Icons are not resized!
  • Button: Enter the text you’d like to appear on the button.
  • Script: Enter the JSL you want to execute when the button is pressed. This could be anything from a single-word platform launch, to an entire JSL script.
  • Description: Enter any helpful text you’d like to include next to the button.

The next two images show a specifications table and the custom menu produced by it.

Specification Table

Specification Table

Sample Menu

Sample Menu

Once you’ve entered in all the information for each button, create a column of 1/0 flags for each menu you’d like to create. Use a 1 to show the item, and a 0 if you do not want to show the item. In the screenshot below, we see that the Beginner, Intermediate and Advanced menus will each present a different combination of buttons.

Menu Flags in Specifications File

Menu Flags in Specifications File

Now, run the Create Custom Menu Scripts data table script (included in the table), which will prompt you to select one or more menus to create:

pic5

When you press OK, a script is produced for each of the selected menus. Run the appropriate script to produce each menu. (The example below corresponds to the Intermediate menu.)

pic6

It is also easy to package a collection of menu scripts as an add-in, allowing users to access any of them as desired.

pic7

Post a Comment

Accessing data at scale from databases

tease_tabsel_resizedMany JMP users get their data from databases.

A few releases ago, we introduced an interactive wizard import dialog to make it easier to import from text files. In a subsequent release, we created a feature that lets you import Web page tables into JMP data tables. In JMP 11, we introduced an interactive wizard to simplify importing data from Excel spreadsheets, covering such functionality as going across tabs and stacking groups of columns. But there was one area that we had yet to improve: making it easier to import data from databases – until now.

So what are the problems to solve? Data from databases usually means an abundance of data from many different tables. To get at the data you want, you need to join and filter, something that requires too much knowledge about the data and too much expertise with the SQL language.

The amazing thing about how a database stores data is how regularized it is, so that the same data is not duplicated across multiple tables. There are three huge advantages to this:

1. Since data is not duplicated, there is less wasted storage space.
2. Updating a field is a matter of updating one place instead of many.
3. Data is naturally more consistent.

Databases are organized for efficient storage and efficient transactions, but not usually for efficient analysis access.

Often the data you need to analyze is scattered across multiple tables. You have to join them to obtain what you need, which can make for significant work.

You could just import all the data from each table, and then join them inside JMP, but there are two reasons why this isn’t the best method. First, you usually do not need all the data from each table, so you waste time and space when you import more than you need. Second, it is a burden to specify how to join the tables together.

Alternatively, you could use SQL to specify a join and then select the rows and columns to import. While this makes for more efficient use of the hardware, it isn’t always easy since you have to learn SQL, become familiar with the details of each table, and write and debug the SQL code to send to the database.

But JMP can help you do all of these things with ease.

Databases are organized to help with access. Each table usually has a primary key to uniquely identify each row of the table. The table also has “foreign keys,” which are columns made to match primary keys in other tables — this relates that table to other tables. The typical join is then an “outer join,” matching the foreign key in one table to the primary key in another table. This is a one-to-many (outer) join, since a foreign key may not be unique but a primary key is unique to its table.

In JMP, the new Query Builder dialog is organized to make this process easy. First, you specify the primary table, the one whose rows make up the unit of analysis. This primary table will usually have columns designated as foreign keys. Simply specify some secondary tables, and suddenly, the join becomes obvious and automatic.

Next, you need to select the variables and rows to keep. Selecting the variables is easy in Query Builder — select them from a list that shows the table from which each variable comes, along with the data type. Selecting rows involves specifying filters, but this is easy, too. Query Builder retrieves all the categories for categorical variables and the ranges for continuous variables, and then presents controls so you can pick just the categories and ranges you want.

But it gets even better, since Query Builder goes several steps further. You can preview any table or prospective join output right in the dialog window. You can also examine the resulting SQL code, plus you can save the query to a file so you can use it again to get the data. When you save the query, you can specify that some of the filtering conditions be set to prompt you, so that the next time you run the query, you can change the categories or ranges you specify to filter the data.

You can read about Query Builder in this blog post from Eric Hill, the lead developer of the facility.

We live in the age of big data, and most of this data is stored in databases. Making it easy to get this data into an analysis helps turn “big data” into “big statistics.”

Note: This is part of a Big Statistics series of blog posts by John Sall. Read all of his Big Statistics posts.

Post a Comment