“The desktop computer is dead” and other myths

The desktop or laptop is now in decline, squeezed from one side by mobile platforms and from the other side by the cloud. As a developer of desktop software, I believe it is time to address the challenges to our viability. Is software for the desktop PC now the living […]

Post a Comment

Big real data is different from big simulated data: Benchmarking

To benchmark computer performance on statistical methods with big data, we can just generate random data and measure performance on that, right? Well, it could be that simulated data may not act the same as real data. Let’s find out. Logistic Regression Suppose that we are benchmarking logistic regression. So […]

Post a Comment

It’s not just what you say, but what you don’t say: Informative missing values

Sometimes emptiness is meaningful. If a loan applicant leaves his debt and salary fields empty, don’t you think that emptiness is meaningful? If a job applicant leaves this previous job field empty, don’t you think that emptiness is meaningful? If a political candidate fills out a form that has an […]

Post a Comment

Big Data always has significant differences but not always practical differences: Practical significance and equivalence

When you have millions of observations of real data and do a simple fit across two variables, if you don’t get a significant test, then it is strong evidence of fraud. The one kind of data that is reliably non-significant for very tall data tables is simulated data. We live […]

Post a Comment

Bad data happens to good people: Robust to outliers

In semiconductor data, it is common for probe measurements that encounter an electrical short to exhibit measurements that are far out in the distribution, i.e., they are outliers. When we test that means are the same, these outlying values inflate our estimate of the standard deviation [sigma]. Remember that the […]

Post a Comment

Not just filtering coincidences: False discovery rate

Purely random data has a 5% chance of being significant. Choose the most significant p-values from many tests of random data, and you will filter out the tests that are significant by chance alone. Suppose we have a process that we know is stable and consistent. We measure lots of […]

Post a Comment

Violating Anna Karenina Principle: LogWorth scaling

The first line of Leo Tolstoy’s classic novel, Anna Karenina, begins: “Happy families are all alike; every unhappy family is unhappy in its own way.” This is a very memorable line. But looking at the Wikipedia entry for the Anna Karenina Principle, I saw this version for statistical significance tests: […]

Post a Comment

Each statistic should have a graph to go with it – not!

When we thought about starting a new software system many years ago, we were very enamored of an article published in 1973 in American Statistician authored by Frank Anscombe called “Graphs in Statistical Analysis.” As you can read in Wikipedia, Anscombe cleverly devised four sets of data that had identical […]

Post a Comment

Big data = Dirty data

Note: JMP 11 launched last week. Today, we begin a series on Tuesdays of Big Statistics blog posts by John Sall about what has to change when you have Big Data, with an emphasis on screening. Data preparation is a big part of an analyst’s job, and when you have […]

Post a Comment