Summarizing patient safety with standardized MedDRA queries (SMQs)

If you work or have worked within the pharmaceutical industry, then you are likely familiar with MedDRA, the Medical Dictionary for Regulatory Activities. This dictionary makes it possible for drug and device companies to perform analyses of adverse events or medical history. First, MedDRA provides a way to consistently map verbatim terms described by a patient and subsequently reported by an investigator. For example, at a clinic visit, the physician may ask the patient if he or she experienced any ill effects since the patient’s exam. The patient may reply with any or all of the following: headache, head throbbing, pounding in head, head pressure, head pain, pain in head.

To perform an analysis on these verbatim terms would be difficult, since information provided to describe what most would call a “headache” is described in many different ways. Prior to analysis, these verbatim terms are coded so that similar terms are described in a consistent fashion. A company may maintain a list of synonyms that are applied to each new study to auto-encode verbatim terms as a first step. Terms that aren’t initially coded (such as those terms that may include misspellings) may have appropriate MedDRA terms applied by a clinician or other medical reviewer. These MedDRA-coded terms are referred to as Preferred Terms (PTs).

Figure 1. Hierarchy of Hepatic Disorders

Figure 1. Hierarchy of Hepatic Disorders

The second benefit of the MedDRA dictionary is that the terms are grouped within a medically relevant hierarchy. PTs are grouped within Higher Level Terms (HLTs), which are grouped within Higher Level Group Terms (HLGTs), which are grouped within System Organ Classes (SOCs), which essentially describe the major systems of the body. Lower Level Terms (LLTs) represent the greatest granularity and are generally thought of as synonyms contributing to the PTs. Most analyses within the pharmaceutical industry present the frequency and percentage of PTs summarized within SOCs, though the frequency and percentage of patients experiencing events within each SOC are also often summarized.

Figure 2. Distribution of SMQs

Figure 2. Distribution of SMQs

One facet of the MedDRA dictionary that is less frequently utilized is the Standardised MedDRA Queries (SMQs). SMQs are groupings of PTs and LLTs that describe a medical condition, syndrome or disease. For example, if you wanted to identify subjects with Hepatic Disorders, there are numerous PTs and LLTs that could identify an individual as having hepatic disease. Identifying these patients can help determine whether they are experiencing drug-induced liver injury (DILI) or if they had underlying liver disease when they entered the clinical trial. SMQs can exist individually, or they can describe disease areas complex enough to have their own hierarchies (Figure 1).

Figure 3.  PTs Contributing to Neuroleptic Malignant Syndrome

Figure 3. PTs Contributing to Neuroleptic Malignant Syndrome

JMP Clinical 5.0 has new features to summarize MedDRA SMQs directly from the user’s set of MedDRA dictionary files. Available analyses allow the team to easily identify and present the distribution of SMQs occurring on trial (Figure 2), summarize the LLTs and PTs contributing to observed SMQs (Figure 3), as well as compare the incidence of SMQs between treatment arms (Figure 4). Further, summary data are provided to allow the analyst to present hierarchies like Figure 1 with frequencies and percentages of subjects by treatment, with values summarized at either the individual SMQ level or cumulatively considering whether a patient belonged to any sub-SMQs. Analyses can use Narrow sets of terms for specificity or Broad sets of terms for sensitivity. JMP Clinical can even handle the SMQs with more complicated algorithms. These analyses can provide greater insight into the underlying safety of patients within the clinical trial.

Figure 4.  Incidence Analysis of Narrowly-Defined SMQs

Figure 4. Incidence Analysis of Narrowly-Defined SMQs

 

Post a Comment

Get more from Graph Builder: Points with error bars

Error bars are a great way to visually show the variability in your data. They are often seen overlaid on top of bars and points, and a wide range of error bar views is available in Graph Builder in JMP.

To see what's possible, let's first build a bar chart using the “Car Physical Data” set from the JMP Sample Data files library. Upon opening Graph Builder, we add the continuous parameter “Gas Tank Size” to the left-hand y-axis and the categorical parameter “Type” to the bottom x-axis. Then we select the Bar icon from the top graph elements menu.  Lastly, we go to the bottom left-hand statistics dialog box for Bar and select Mean under “Summary Statistic” and Standard Error under “Error Bars.” This presents a view like the one below. Note that in this graph, we also reduced the transparency of the bar shading so that you can see the error bars more easily when they are overlaid on top of the bars.

Graph Builder View 0: Error Bars on Top of Points

Graph Builder View 0: Error Bars on Top of Points

But what if we wanted to add error bars not to a bar chart, but instead to a display with points? This is an increasingly popular view that can give both a feel for the spread of the individual points as well as showing the added information of error bars. One way to do this is to take the same Graph Builder view we generated above and just change from the Bar to the Points icon from the top graph elements menu and ask again for Mean as “Summary Statistics”and Error Bars as “Standard Error.” However, as you can see in the graph below, we now lose the points and get only the error bars on the graph.

Graph Builder View 1: Point Error Bars

Graph Builder View 1: Point Error Bars

To create a graph with both error bars and points, we can add another points element from the top graph element menu and drag it down into the graph. Notice that it creates two Points dialog boxes in the left-hand graph statistics section. For the second Points dialog box, we can now select “Summary Statistic” as Mean and “Error Bars” as Standard Error. This presents a view as shown below. We again reduced the transparency of the points so you can see the error bars overlaid on the points better. (Note that another way of adding the second Points dialog is to right-click on the graph, select Add and then choose Points.)

Graph Builder View 2: Points & Error Bars Mixed

Graph Builder View 2: Points & Error Bars Mixed

As you can see in the resulting graph, this visual is definitely not very easy to interpret as the error bars are buried within the points. An innovative solution to our problem would be to shift the error bars to the right of the points so we can see both easily on the same graph. We will explore three different ways to do this:

  • Solution 1: Use “label spacers” rows to enable side-by-side views of the points and error bars.
  • Solution 2: Create a “dummy variable” column in our data to create side-by-side views of the points and error bars.
  • Solution 3: Use JMP scripting to automate a custom view.

Solution 1: Add “Label Spacer” Rows

  • First, we add a new categorical column to our data called "Label." Then, we enter one label (in this case, “A”) for the entire column.
  • Then, we add 15 blank rows to the bottom of the data table and number them 1 through 15 on the label column. This column helps create space between the points and error bar visuals.
Table View 1: Data Table with “Label Spacers”

Table View 1: Data Table with “Label Spacers”

  • Now we create the graph by again bringing in two points elements into the graph and asking for points statistics on the first one and error bar statistics on the second.
  • We now space over the error bars on the graph by putting the “Label” column on the overlay. Notice how this shifts over the error bar to the left of the points, but maintains its orientation under the right x-axis label.
  • To clean up the view, we will also turn off the legend (from the top red-triangle menu) and un-jitter the points in the graph statistics area.
Graph Builder View 3: Points & Error Bars Via Label Spacers

Graph Builder View 3: Points & Error Bars Via Label Spacers

Solution 2: Create a “Dummy Variable” Column

  • We need to create a dummy variable for our output by adding another output column and effectively doubling our data set size.
  • Then a label column creates separation on our graph. To clean up the view, we labeled the two label types “(“ or “)”.
Table View 2: Data Table with Dummy Variable

Table View 2: Data Table with Dummy Variable

  • Next, we build the graph by bringing in both output columns and asking for points statistics on the first one and error bar statistics on the second.
  • Then, we create a nested label on the y-axis by bringing in the Label column to create the separation.
Graph Builder View 4: Error Bars via Dummy Variable

Graph Builder View 4: Error Bars via Dummy Variable

Solution 3: JMP Scripting for a Custom View

As the saying goes, “All things are possible with JMP JSL custom scripting!” Below is an example of a custom-scripted view with side-by-side points and error bars. This script does not do any table manipulation or dummy variable creation like the first two solutions. It manipulates the graph points themselves using scripting. So the end product lets us immediately generate our desired view with just the click of a button.

Graph Builder View 5: JMP Scripted Custom View

Graph Builder View 5: JMP Scripted Custom View

Whichever way you choose to create a graph with both points and error bars, it is easy to see how you can use the versatile Graph Builder platform (along with JMP scripting) to create new and exciting data visualizations!

Note: Brady Brady, Chris Kirchberg and Xan Gregg contributed to this blog post.

Post a Comment

JMP Pro for linear mixed models — Part 1

JMP Pro 11 has added a new modeling personality, Mixed Model, to its Fit Model platform. What’s a mixed model? How does JMP Pro fit such a model? What are the key applications where mixed models can be applied? In this and future blog posts, I will try to dispel myths about mixed models and illustrate the software’s capabilities with real-life examples.

What’s a Linear Mixed Model?

Linear mixed models are a generalization of linear regression models, y=Xβ+ε . This model is fit to a sample of cross-sectional data by standard least squares to estimate the fixed-effect parameters, β. Extending the model to allow for random effects, Z, the new regression model becomes y=Xβ+Zγ+ε. It’s called the mixed model because there are both fixed effects and random effects.

We make the following assumptions about random effect parameters, γ and random error ε : (1) γ and ε are normally distributed, and (2) there are no covariance between γ and ε. JMP provides an unstructured covariance structure for γ, and several commonly used structures for ε. Using the restricted maximum likelihood method (REML), JMP jointly estimates β as well as covariance matrices for γ and ε . In order to fit such a model, additional data on each subject is required, or, in case of modeling spatial data, dimensions of measurements are needed. (In recent years, mixed model theory has been extended to encompass such statistical methods as empirical Bayes, ridge regression, time series and smoothing splines. However, I limit the scope of my discussion to the “traditional” use of linear mixed models.)

Why Mixed Models?

When there exists correlation between responses or an important causal factor is omitted, failure to account for that leads to under- or overestimating the effects of treatment and other factors.

Here are some of common use cases for mixed models:

  • Allowing coefficients (e.g., intercept and slope) to vary randomly across subjects (random coefficient models). A variant is the individual growth model, which can be applied to predict individual growth trajectory and degradation data analysis.
  • Analysis of randomized block designs, and split-plot designs where hard-to-change and easy-to-change factors result in multiple error terms.
  • Controlling for unobserved individual heterogeneity in the form of random effects (panel data models).
  • Analysis of repeated measures where within-subject errors are correlated.
  • Correlated responses where different measures are taken from the same subjects.
  • Subjects are hierarchical (e.g., students within schools). This is known as a hierarchical linear model or multilevel model.
  • Spatial variability (geostatistics).

The list goes on and on. With JMP Pro 11, you can easily specify and fit all of these models using the point-and-click interface and review the results in a user-friendly way. Before I turn to my first example, let me outline the general steps for specifying your mixed model in JMP Pro.

Steps for Specifying Mixed Models

  1. Select Analyze =>Fit Model, and choose Mixed Model Personality.
  2. Select a continuous response variable as Y and construct fixed effects as you normally would do with a standard least squares fit.
  3. Use the Random Effects tab to specify random coefficients or random effects.
  4. Use the Repeated Structure tab to select a covariance structure for model errors.
  5. Click Run.

Example 1: Random Coefficient Models — Allowing Coefficients to Vary Randomly Across Subjects

In this example, we’re interested in estimating the effect on wheat yield of pre-planting moisture in the soil while allowing each variety to have random deviation from population effects. So, a random coefficient model is called for. The experiment randomly selects 10 varieties from the wheat population and assigns each to six one-acre plots of land. In total, 60 observations with six measurements of yield for each variety are collected. (The data, “Wheat,” is available in JMP’s Sample Data folder.)

I followed the steps laid out above to specify my random coefficient model. From the Fixed Effects tab, I added fixed effects (i.e., population intercept and population Moisture effect).

From the Random Effects tab, I used the Nest Random Coefficients button to specify that a variety’s intercept and Moisture effect vary randomly from one to another. Note that JMP’s covariance structure for random coefficients is unstructured.

From the Repeated Structure tab, I selected Residual for the model error term.

This example is detailed in the JMP documentation. Let's examine the results. First, take a look at the Random Effects Covariance Parameter Estimates report.

The variance estimate for Intercept is 18.89 with a standard error estimate of 9.11, so the z-score is 2.07 (=18.89/9.11). Using the Normal Distribution function from JMP Formula Editor (or look up in a standard normal distribution table in any statistics text book), we can find the p-value to be 0.0192, indicating that the variation in baseline yield (i.e., without any pre-planting watering) across varieties is statistically significant. Similarly, we obtain the-p-value for Cov(Moisture, Intercept), 0.3777, and p-value forVar(Moisture), 0.0380. Although the sign on the covariance estimate is negative, there is no statistical evidence that this negative correlation is significant. The variation in Yield across different moisture levels is significant at α=0.05.

The Random Coefficients report gives the BLUP (Best Linear Unbiased Predictor) values for how each variety is different from the population intercept and population Moisture effect (reported in Fixed Effects Parameter Estimates). For Variety 1, the estimated moisture effect on its yield is 0.61 (=0.66-0.05), baseline yield is 34.39 (=33.43+0.96), and the predicted yield equation is Yield=34.39+0.61*Moisture.

Combining both the fixed effects and random coefficient estimates, we find a significant overall effect on wheat yield of moisture and discover significant variation in the moisture effect across different varieties. The random coefficient model produces a BLUP prediction equation for yield for each variety.

Other Specifications of Random Coefficient Models

Individual Growth Model is a type of random coefficient model in which random time effect is estimated for each individual. After adding a continuous time variable (e.g., day, month, etc.) as a random effect, use the Nest Random Coefficients button to request a separate slope and intercept for each individual.

In education research, subjects are often nested in a hierarchical order. By adding multiple groups of random effect statements you can fit hierarchical linear models/multilevel models.

Stay tuned. In my next blog post, I will discuss using mixed models for panel data, repeated measures and spatial regression.

Post a Comment

Using Neural platform in JMP Pro for automated creation of validation column

JMP Pro is a great tool for quickly building multiple models with your data using a variety of techniques, namely tree-based methods (Boostrap Forest, Boosted Tree options in the Partition platform), neural networks and penalized regression (using the Generalized Regression personality in Fit Model).

When building predictive models, you need sound ways to validate your model, or you can easily get into trouble overfitting. Many modeling platforms in JMP Pro support a validation column. The validation column is used to split your data up in training and validation portions. Training data is used to build the model, and validation data is used to tune the model. Sometimes a third split – test – is used to simulate new data that has come in so that you may see how the model performs with data previously unseen by the model.

Let’s say you want to use 70 percent of your data to build the model and save 30 percent to validate or fine-tune the model. You might think that taking a random sample of all of your rows might be the best way to go – but that can easily lead to problems if you are dealing with lots of outliers or a rare event. A simple random sample could easily place all of the important data points (like the rare events) in the training set or validation set. This creates suboptimal modeling conditions and may lead you to build models that are not very useful.

The Neural platform in JMP Pro can help you create an unbiased and balanced validation column automatically in just a few steps. I’m using the Boston Housing data set, which is in the Sample Data in JMP (Help > Sample Data). For this data, I want to predict a house’s mvalue based on a number of possible predictors. The Distribution below shows that the response has a number of high mvalues that we want to make sure are equally divided into our training and validation sets.

In the Neural Model Launch, I can specify a holdback proportion for the validation method. Because I want a 70/30 split, I’ll put .3 into the field. The Neural platform will automatically sort the response from lowest to highest value and then randomly assign the record to either the training or validation set based on my proportion desired.

I’m not particularly interested in the actual Neural model here, so I can just accept the defaults and click: Go. Then from the red triangle menu on the fit, I am going to select the “Save Validation” option.

This will automatically create a new Validation column in my data table, with each row tagged “Training” or “Validation.”

If we again fit a Distribution to the response with validation as the By, you can see that the properties of the training and validation sets are very close. The Neural platform has done a great job of dividing my data up in an intelligent way –  automatically.

Now, I’m ready to go about the process of building my models – knowing that I have a solid data splitting scheme that will let me build the most informative and useful model with my data. Thanks to Chris Gotwalt, the developer of the Neural Platform, for showing me this powerful capability. It has certainly been the quickest and most reliable way to build a validation column in JMP Pro that I have found so far.

Post a Comment

My 4 little-known favorites for the JMP data table

If you work with JMP, then one of your primary points of contact is through the JMP data table. The data table is exceedingly rich in functionality in order to accommodate all different kinds of data and applications.

My work with JMP as a software tester requires me to try many different data sources with many different settings and preferences. However, I always find myself coming back to my favorite core group of settings and features that I use over and over. You all probably know about the big ones such as Recode, Standardize Attributes and Value Ordering. I'd like to share with you some of the little-known favorites of mine.

  1. Keypad Enter key moves down in JMP preferences. When you find yourself needing to enter data into a JMP data table manually, you might enter by row or by column. JMP assumes you're entering all the data on one row before moving to the next row. I tend to prefer working with columns, so when I press Enter, I want to move to the next row and not the next column. I use this so frequently that this is the first thing I double-check any time I install a new version of JMP.
  2. Quickly adjust to appropriate column widths. Simply press the Alt key (Option key for Macintosh users) and drag-resize to change all column widths at once. I do this all the time since I learned about it a few months ago.
  3. Data View. Especially when working with large data, I often use the Data Filter or other methods to explore. When I need to view a quick subset, I use the Data View. With some subset of rows selected, right-click on the "Selected" item inside the Rows pane (lower left corner) in your data table. You'll see a submenu; choose Data View. A new table displays with just the subset rows showing. For additional fun, this is a linked subset, which means that any change you make in this subtable will also occur in your main table. I find this super-useful for managing my large tables.
  4. Column compression. Tables with many rows or columns can take up a lot of disk space. In some cases, you can reduce the table disk size (and also the memory consumed by your table) by compressing the space taken up by eligible columns. Select the columns you'd like to compress, then choose Cols->Compress Selected Columns. JMP will compress the ones it can and leave the rest alone.

These are only a few of the data table gems. What are some of your most-used but perhaps little-known favorites?

Post a Comment

Lookup tables in JMP

A few days ago, I showed a customer how she could use lookup tables in JMP, and I thought it would be a good idea to share this with everyone.

Those of you who have used lookup tables elsewhere already know how handy they can be. For those who have never used one, let’s first look at a simple example: assigning a letter grade based on a numeric grade. (By the way, for more on using JMP with grade books, look out for an upcoming blog post on that topic by JMP Academic team member Julian Parris.)

Suppose we have the following correspondence between numeric and letter grades:
E: x < 60
D: 60 <= x < 70
C: 70 <= x < 80
B: 80 <= x < 90
A: 90 <= x

We’d like to automatically assign letter grades to the numeric grades below — but how?

The most common way is to use a formula column with an IF clause. Statements are evaluated in the order encountered, and once a statement evaluates to “True,” execution stops — so the clause must be constructed with this in mind:

IF() statement approach

Letter grades after formula execution

This technique works perfectly well, but unfortunately, it can become cumbersome in certain instances:

  • What happens if we need to consider a list of many values, not just a few? Writing a long IF clause is tedious and error-prone.
  • What if the list of values is not known in advance but determined during run-time? We can handle this by using JSL to write code dynamically, but most people prefer to avoid this if they can.

Fortunately, we can tackle cases like these with the help of a powerful matrix function: Loc Sorted().

How does Loc Sorted() work?

  • Loc Sorted(x, y) takes as inputs a matrix x, sorted from low to high, and y, which can be either of matrix or scalar type.
  • Loc Sorted() returns the index (or indices, if y is a matrix) of the last value in x that is less than or equal to y.

For example, running this code gives a result of [3], because 14 is greater than or equal to the number stored in the 3rd position of the x matrix, but less than the number stored in the 4th position of the x matrix:

x = [0, 5,10,15,20,25];
y = 14;
show(loc sorted (x, y));

 

Similarly, running this code gives a result of [4,6,6,1]:

x = [0, 5,10,15,20,25];
y = [17, 25, 50, 3];
show(loc sorted (x, y));

 

Note that the minimum of the x matrix should be the lowest value of y you expect to encounter, because 1 is returned for any y value that is less than all of the x values:

 

In the letter grade example, we can use loc sorted() to pick the grade from list {“E”, “D”, “C”, “B”, “A”}:

This is great if we’ve got only a few categories to consider. But what happens when we have a whole table full of options, or need to not merely look up a single value, but look up several different values and use them together?

Fortunately, this is easy to do. All we need is an extra table to hold the information. Computing tax from a tax table is a classic (and timely) example of such a case.

*** Warning: I am not a tax professional. I do not play one on TV. Please obtain your 2013 tax tables from the IRS.

In today’s example, which is for illustrative purposes only, should not be construed as tax advice, and is not from a tax professional, we will use the table below, which I made from information I found online.

Sample Tax Bracket Table

Our goal is to compute the tax owed, given the table above and the amount of income being taxed.

First, we need to place this information into a data table (notice that we will lower the contents of the first column by $1, because of the way Loc Sorted () works.) To follow along with the example, use the following names for the table and columns, or download the example from the JMP File Exchange (download requires free SAS login).

Table name: MarriedFilingJointlyTable

Column names:

  • Cuts (income cutoff for each of the tax brackets)
  • PreTax (sum of tax owed on income in all lower brackets)
  • MargRate (tax rate on last “block” of income)

We want to use the table above to determine the tax for the taxable incomes in the following table, named MarriedFilingJointlyTable:

Armed with Loc Sorted() and the two tables above, we’re ready to begin.

In the income table, we add a column by selecting Cols > New Column…

We enter “Tax” as the Column Name and select “Currency” as the Format.
We then select “Formula” from the Column Properties drop-down, at which point the Formula Editor appears.

Did you know that we can actually enter a program into a column formula? That is what we will do here — and even though our script is only three lines, the easiest and least error-prone way to do this is by copying and pasting from a script window.

So, open up a script window (File > New > New Script) and type the following:

Line 1 points the variable dt to the tax table:

  • Whenever we want to use a column from the tax table — rather than the income table, where the formula actually will reside — we need to preface that column name with dt.

Line 3 determines which row of the tax table we will use for a given amount of income:

  • We call Loc Sorted() using all of the values from the cuts column in the tax table as x, and the value in the current row of the Taxable Income table as y.
  • We store the result of this call as Bracket, which we will use to find the row we need in the tax table.
  • For example, for the first income value, because $ 41,196.47 is between the cutoffs contained in rows 2 and 3 of the tax table, Bracket equals 2, so we will end up using row 2 in the tax table to compute the tax.

Line 5 computes the tax. Using the first income value of $41,196.47 as an example:

  • dt:PreTax[Bracket] is the value stored in the appropriate row of the PreTax column of the tax table. For the first income value listed, this is the sum of all taxes on income below the cutoff closest to, but not exceeding, $41,196.47. (In this case, it is $1,785.)

       To this amount, we must add the product of:

  • dt:MargRate[Bracket], the marginal tax rate in the highest bracket at which $ 41,196.47 is taxed (In this case, it is 0.150), and
  • :Taxable Income – dt:cuts[Bracket], which is the amount of income taxed at this rate (In this case, it is $41,196.47 - $17,850.)

Once we’ve entered the script into the script window, we select all of it, copy it and paste it into the red box in the formula editor (which until now has contained nothing):

After pressing “OK” to close the formula editor and “OK” to close the column properties editor, we find that the tax has been computed for each row in the income table, and see by the “+” icon that the tax is formula-based; the lookup table needs to be open whenever you wish to re-evaluate the formula. (If you prefer to remove the formula at this point, simply click on the “+” icon, select “Clear” when the formula editor opens, then click “OK.”)

For those of you who prefer scripting, the solution is similar. In the script below, I’ve written values directly to the table without a formula, but using the formula() option in the << New Column () message would work just as well.

And there you have it. That’s all there is to using lookup tables in JMP! If you’re like many of our customers, you’ll see plenty of opportunities for their use — enjoy.

Post a Comment

Dinosaurs, chaos and statistics (oh, my!)

“They believed that prediction was just a function of keeping track of things. If you knew enough, you could predict anything. That's been cherished scientific belief since Newton.”

“And?”

“Chaos theory throws it right out the window.”

- Michael Crichton, Jurassic Park

I’ve been thinking a lot about chaos lately.

Perhaps it’s because I have two young sons (ages 4 and 7) who thwart my every attempt to predict what they’ll do next. Maybe it’s a natural consequence of my chosen profession. After all, statistics as a discipline exists because of the variability inherent in all things and our natural desire to identify the order in the chaos. Or maybe it has to do with my recent discovery of this song, which features Jeff Goldblum’s chaotician, Dr. Ian Malcolm, from the movie Jurassic Park.

Yeah, it’s definitely that last one.

Dr. Malcolm studies chaos theory (Nonlinear equations? Strange attractions?) and uses his knowledge of the unpredictability of complex systems to try and convince John Hammond not to open Jurassic Park due to the potential danger. Sure enough, his prediction comes true (ironic?), and the all-female dino population takes advantage of the frog DNA used to complete their genetic code to swap genders in order to mate and lay eggs because “life finds a way.”

(And increasingly terrible sequels need new dinosaurs… no one could have predicted that. Ahem.)

However, you don’t need to be a student of chaos theory to appreciate the important role statistics plays in our everyday lives, which was evident last year through the International Year of Statistics -- and this year, as we celebrate the 175th Anniversary of the American Statistical Association.

The ASA was originally founded in 1839 as the American Statistical Society. It is a little known fact (read: fabrication) that later that same year, a puckish Englishman named Sir William A. Cronym developed a means to abbreviate text by using the first letter of each word in a phrase to form a new word. The development of these “acronyms” (as they came to be known) revolutionized how people communicated with one another, particularly over the telegraph, a fledgling technology at the time. These acronyms necessitated a name-change of the recently-founded American Statistical Society to the American Statistical Association in 1840. Better to change the name the organization then be known as that “gaggle of American derrières.”

Today, the ASA is the second oldest continuously operating professional association in the United States, second only to the American Philosophical Society founded in 1743 by Benjamin Franklin. Its members serve in industry, government and academia in more than 90 countries. With close to 18,000 members, the ASA is the largest professional statistics organization in the world.

So, happy birthday, ASA! I predict another 175 years of statistical leadership and excellence. That is, of course, until someone clones a bunch of dinosaurs… then, well, we’ll just see what happens.

Post a Comment

Is big data a big deal?

Maybe… but messy data is a bigger deal.

Big data hit mainstream over the past year or so. I know this because the BBC has produced several programmes covering it. What I’ve heard is that there is no clear definition of what big data is and why it is important. When I ask people if they have big data, they overwhelming say “yes,” whether they have a thousand or many millions of rows of data or observations. So who is right? It depends.

Nowadays, statistical software, including those designed to maximise the power of the desktop like JMP Pro, can easily handle data sets with millions of rows. What is a more important is the number of columns. Very tall and very wide data sets are truly big. These may require standard statistical methods such as sampling to build useful models, bringing model building within the power of a desktop computer.

So if big data is easily manageable, what are the real challenges faced by today's analysts, engineers and scientists? We surveyed delegates at the two model-building seminars held recently in Marlow and Edinburgh and uncovered an interesting finding: All of the delegates had messy data.

Make the most of messy data

You have messy data if you have missing data, empty cells, outliers or wrong entries. Traditional statistical methods, such as logistic and linear regression, throw out rows where cells are missing, resulting in a poorer model. Outliers also throw the model off, making it less useful.

John Sall discussed a new way of dealing with messy data called "Informative Missing" in his blog post. This takes the use of missing data beyond imputation to a new realm: Missing data might actually be informing you of something that is important and so must be included in your model. An example would be a loan applicant leaving part of their application blank in order to hide a poor credit history; this would be a critical finding for a credit analyst to model. If you are working in a manufacturing setting, data might be missing because the result was literally off-the-scale, which could be useful information to capture in the model. If you are modelling the activity of substances based on their chemical properties, you might have missing data for, say, decomposition temperature if the material was not seen to decompose over the measured temperature range; so if you include this information in the model, you would get better predictions of activity.

There is a new class of modelling techniques called shrinkage methods that are designed to provide you with a model that predicts well and has the smallest number of variables, even when you have strong correlations between input variables. The Generalised Regression personality allows you to use these methods from within the Fit Model platform. Used along with Informative Missing, it has the added benefit of using all rows of data when building the model -- even with messy data.

Decision tree-based methods are good for dealing with outliers, because the point at which the split occurs is not biased. JMP Pro users are telling us that they are building useful models without having to clean their data because of this and Informative Missing: With robust modelling techniques, you might be able to skip data cleansing and still produce a good model. Now that is truly revolutionary. Decision trees also have the added advantage of being visual, allowing you to explain your findings to execs.

What do I do if I have messy data?

JMP Pro is the software designed to deal with your messy data.

We will be running an exclusive, hands-on workshop in the UK for new users of this software on 12 June, so if you would would like join us in Marlow, let me know.

If you would like your managers to see how JMP Pro deals with these problems, you can ask them to join the webcast on 3 April when we will be showing two case studies.

Post a Comment

NCSU leads the pack in statistics conference contest

I zone out when my colleagues go on and on about North Carolina college basketball. With mascots like Blue Devils and Demon Deacons, you’d think it would be more fun to me than it is.

Add to that the fact that my university days were not spent at any of the big schools in North Carolina or surrounding areas. So I don’t really get all of this rivalry stuff.

However, I am surprised to see that North Carolina State University is dominating so thoroughly in the competition for free admission to the Women in Statistics conference May 15-17. Where are those brilliant stats students at UNC, Duke, NC Central, Wake Forest University, East Carolina University, UNC Greensboro, UNC Wilmington and UNC Charlotte?

It may be because SAS has its roots at NC State. Or maybe it is because NC State’s analytics programs are superior. I don’t really think that, but I’ll put that out there if it’ll get more entries coming in from other area schools.

In case you missed my earlier blog post announcing the competition, here’s the scoop: The JMP team and WIN (the SAS Women’s Initiatives Network) want to empower three statistics students by helping them attend the Women in Statistics conference. Entering is as easy as writing a short essay telling us why you want to attend this conference, what you hope to get out of it and how you plan to use statistics in your career. You can submit your entry at jmp.com/wis. You can learn more about the conference here, and you can read the contest rules here.

Says one entry from NC State, “I would love to attend this conference in order to hear speakers talk about their experience in their particular field in order to try to hone in on which route would be best for my career path.” Sounds like a winning reason to get a free pass.

OK, rival schools: Are you going to let NC State’s stat students take all of the free passes? To be considered in the competition, you need to get your entries in before 5:00 p.m. ET April 11, 2014. Submit your essay today!

Post a Comment

Munich votes for new mayor – First run-off election in 36 years

Elections are a beloved but controversial topic worldwide. Feelings are often strong, debates intensify on a daily basis, and positions become polarized. Elections for presidents or parliaments get a lot of attention from both news media and citizens. But I believe local elections can be fascinating as well. That’s why I invite you to take a deeper look into the current elections for a new mayor in Munich, Germany.

Munich was voted the world’s most livable city in 2007 and 2010. Munich is the green city of the Oktoberfest, of beer, of Bavarian veal sausage and of coziness! So what? you might say.

It’s the end of an era: Munich’s mayor for the past 20 years has been the social democrat Christian Ude, who was not allowed to run for office again because of his age. Twelve (!) politicians were hoping they would win the opportunity to replace Ude on March 16. But nobody won a majority of the votes. On March 30, we will have have a historic vote: the first run-off election in 36 years.

But which parties have already had the opportunity to lead Bavaria’s main city for a six-year term as mayor? The graph below shows you the winners since 1952.

Voting results since 1952

Figure 1 – JMP Graph Builder stacked diagram: Voting results (in percentage) of the larger (above) and smaller (middle) parties, as well turnout (in percentage) for all mayor elections since 1952.

Recalling the past

Let’s take a trip down memory lane with Figure 1: Just twice in post-war history has Munich gone “black,” the color of the conservative party CSU (Christian Social Union). Right after the war, Karl Scharnagl of the CSU was installed by the American occupying power as mayor in 1947. Then in 1948, the city council voted for the “red” Thomas Wimmer of SPD (Social Democratic Party) as the new mayor. Since then, the “red” party has led Munich almost continuously. Only once, in 1978, did Erich Kiesel of the CSU benefit from a candidate’s change of SPD party and win in a run-off election. But this lasted for only one legislative period.

The era of Christian Ude started in 1993, and he managed to gain more and more votes over the years despite decreasing turnout. In 2008, he won 66 percent of the vote. Since then, some things have changed. The “black” party has regained some strength in both Germany’s and Bavaria’s elections. Although Munich's citizens still voted for Ude, they became increasingly dissatisfied. Munich has been growing, causing more housing, traffic, child care and education problems. Now 36 years after the last CSU mayor in Munich, that party has a chance to win again.

There were two TV debates before the election on March 16, in which the four main candidates participated: Dieter Reiter (SPD), Josef Schmid (CSU), Sabine Nallinger (Alliance '90/The Greens (GRUENE)) and Michael Mattar (FDP, Free Democratic Party). (Eight other parties also sent their candidates, although they have very little chance of winning the election.) All candidates said they would make it a high priority to work on the issues of growth, housing, traffic, child care and education. So their messages were very similar. Everyone felt it was a time for a change and that Rieter was unlikely to repeat Ude’s results.

And if we look at historic data, which party other than the CSU would have any chance of winning at all? The graph below illustrates the answer to this.

Multivariate Analysis for votes of the main parties in the City Council in the election 2008

Figure 2 – JMP Multivariate Analysis: Showing correlation between the three largest parties with mayor candidates at the election for the city council 2008, colored by eight clusters specifying different voting behavior in the voting districts. All but districts in Cluster 2 show the same correlation trend between SPD and CSU. Districts in Cluster 1 and 4 or Clusters 6, 7 and 8 show almost no correlation between GRUENE and SPD and between GRUENE and CSU, respectively. In case of a moderate or strong relationship between parties, data tend into an increasing or decreasing direction; horizontal, vertical or almost circled distributions show low or no correlation.

It is difficult to predict the outcome when clustering the districts based on their voting behavior and visualizing the relationships between the parties in a multivariate analysis (see Figure 2). The graph shows that most districts voted for either SPD or CSU, although in the districts in Cluster 2, GRUENE were neck-and-neck with CSU. Supposing that many people from GRUENE voted for UDE to prevent a CSU victory, the “Losing UDE” effect might be a game-changer for Nallinger and GRUENE in other districts also.

The here and now

On March 16, 2014, Munich voted, and for the parties, the results were shocking as well as exciting -- especially since the turnout was even lower than in 2008: about 42 percent. Reiter saw a 26 percent drop-off compared to Ude’s result in 2008, ending with 40.4 percent. It was expected that he would lose votes, but it was surprising how much he lost. Schmid received 36.7 percent, a 12.3 percent increase compared to his 2008 result. Because the mayor has to have an absolute majority, Munich must go to the polls again at the end of March for a run-off election between Reiter and Schmid.

Cluster Analysis of voting behaviour in all polling stations for Munich's mayor election 2014

Figure 3 – JMP Cluster Analysis with Dendogramm (left) und Parallel Plot (right): Voting behavior of people at polling station only for Munich’s mayor election 2014 split into nine clusters, Y-axis: absolute number of people voted per district, X-axis: Parties.

Although Nallinger is not a real competitor, it would be grossly negligent not to note Nallinger's valiant fight that achieved 14.7 percent of the vote. This is a much bigger surprise than the neck-and-neck race between Reiter and Schmid. If you take a deeper look into the city districts (Figure 3), you can see a much more differentiated scenario than in 2008, where SPD was almost always on top. In 2014, there are many more districts where CSU is ahead of or in the same range as SPD (clusters 3, 4, 5, 7, 8 and 9), although some districts are without doubt still led by SPD (clusters 1, 2 and 6). However, many more districts are also a close race between CSU and GRUENE for second place (clusters 1 and 2).

Looking at the overall win-loss results, you can see that GRUENE won over many citizens, especially non-voters, as did CSU. In contrast, SPD couldn’t replicate their 2008 results in 2014 and lost support, both to non-voters and their competitors CSU and GRUENE (Figure 4). Out of 150,000 lost SPD voters, almost 96,000 didn’t vote at all.

Bar Charts: Win-Loss analysis Munich's mayor elections 2014 for parties

Figure 4 – JMP Graph Builder: Bar charts of win-loss of votes by party for Munich’s mayor election 2014 (left), voting proportion of base, swing and non-voters (right): From the people who voted for CSU in 2014, 69 percent also did that in 2008, 20.9 percent voted in 2008 for a mayor of a different party, and 10.1 percent were non-voters in 2008.

Nallinger’s mayoral race enriched her result as well as her party’s result for the city council. The Greens achieved more than 15 percent (+2.3 percent). However, the big surprise is that the biggest party has changed from SPD (31.4 percent, minus 8.3 percent) to CSU (35.2 percent, plus 7.5 percent). Now it’s clear that the Ude advantage is gone.

Foresight

On March 30, the new mayor will finally be selected in a run-off election. Dieter Reiter is less than 6 percent ahead of Josef Schmid. Of course, the Conservatives are hoping their results will give them the win. I’m skeptical. Based on historic political closeness and alliance, the SPD probably can count on the supporters of GRUENE. Nallinger’s result of almost 15 percent of the vote, if added to Reiter's, could create a majority.

At the same time, taking into account that there is no true majority anymore in the city council for SPD and Alliance '90/The Greens, it may be challenging for a “red” mayor (in case he makes it). He would have to take care of all Munich citizens, including many of those who didn't vote for him. In any case, there will not be another era like Ude’s for SPD.

Post a Comment