Monday, August 4. 2008Experiments on Experiments, Models of Models
(NOTE: This is part three of three-part series on stochastic optimization.)
Over the last two weeks, I introduced robust process engineering and stochastic optimization – the effort to achieve good product in the face of variation among the factors. Last week, I gave a cooking example. This week, I present a solution to the optimization problem. In-Silico Surrogate The inspiration for the solution comes from the world of computer experiments, also called in-silico science. Suppose you want to build the optimal passenger jet. You have factors like wing length, wing pitch, engine size and body composition, and you have responses like fuel economy, passenger volume, noise and speed. You create an experimental design with 64 runs, and you are ready to go. No problem. Each plane will cost around $85 million, so that makes the experiment cost around $5.44 billion. Oops. Your experimental budget is just $40,000. What do you do? You don’t build those planes. You create computer models of them and run those computer models to determine the performance characteristics. The planes are flown in-silico. If the models are good, they will report responses that are reasonably close to the real values. So you could, theoretically, optimize the characteristics using these models. But those in-silico models are expensive, too. Each run for the computational fluid dynamics could take hours on a supercomputer. So you develop what is called a surrogate model, or meta model. You make a space-filling experimental design that samples 100 or more factor combinations in the factor space. Then you carry out the runs on the supercomputer. Then you fit an interpolation model to those points, and now, instead of taking hours for each point in the visualization or optimization, the interpolation model takes a fraction of a second to evaluate. We now have a three-stage model for a computer experiment: Real World << Expensive Computer Model << Cheap Surrogate Model So let's return to the stochastic optimization problem. 'Cooking' a Chemical Process We needed to determine whether we should “cook” a chemical process “hot and fast” or “warm and slow.” If the factors could be fixed, the hot and fast settings would be best. But because the factors are subject to variation, we already know that “hot and fast” is not a good setting; the variation will cause about 4 percent of the batches to yield below the minimum of .55, and those batches will have to be discarded. So now we develop a surrogate model of the defect rate: Chemical Process << Model + Variation Simulaton << Surrogate Model for Defects Though our model of chemical yield is cheap and easy, the model of the defect rate in the presence of variation is not cheap and easy, and it is obtained through Monte Carlo simulation. We generate 10,000 runs of random factor data at given factor center settings; we then calculate the yield and then develop a defect rate based on the portion of the simulations that fall below the lower specification limit. We already know defect rates for two points:
Remember that these are not fixed factor settings, but centers of the distribution of the factor settings, which have underlying variation with standard deviations of 1 and .03, respectively. Those defect rates are just estimates based on simulation. If you do a new Monte Carlo simulation, you will get slightly different values. The Surrogate Experiment So now we need to systematically vary the centers of the temperature and time distributions according to a space-filling experimental design. We use space-filling experimental designs because we expect a complex surface, and we can afford to investigate that surface. The workhorse space-filling design is the Latin Hypercube. These are easy to make. You just make an evenly spaced set of values for each factor and scramble them individually. The result will have a uniform distribution across each factor and at least a random joint distribution. The JMP Design of Experiments platform actually optimizes the scrambles to fill the space better. If the runs are computer experiments, you don’t have to worry about randomization and replication because there are no outside factors to randomize against. The Profiler Simulator in JMP has a built-in feature called “Simulation Experiment” that makes all this very easy. It prompts you to enter the number of runs and to identify the portion of the factor space you want to investigate (around current settings), and it performs the simulations and estimates the defect rates. In our case, we will ask it to run a computer experiment in 80 runs across the whole factor space. This is a lot of work. For each of 80 defect rate estimates, the software does 10,000 runs. Fortunately, computers are fast, so this takes less than a half a minute. Here is how the space-filling design arranges the points and what the defect rates are at each point. ![]() Now we need to model this defect surface. The emerging standard fitting technique for computer models is the Gaussian Process model. This model essentially calculates a weighted average of the neighboring points to predict each point on the surface. (Kriging and radial basis function neural nets are close relatives of Gaussian Process models.) After we fit the surface, we now call the optimizer to find the minimum on this surface. ![]() Now we know that to minimize defects, we cook it warm (526 degrees) and slow (.287) -- the opposite of the optimum for fixed-factor settings, which was hot and fast. The log10 defect rate predicted is 10^-3.206, which is 0.000622, clearly much smaller than the 4 percent defect rate at the fixed-factor optimum. This is a cross-section at the minimum of the surface that looks like this: ![]() Now let’s use simulation again to see whether the defect rate holds up to this prediction. ![]() The actual rate in this simulation is .0007. We have dropped our defect rate from 4 percent to .07 percent, which is one-sixtieth of the defects from the previous settings. How about the average yield? Before, the average yield was .602; now it is .595, a small sacrifice to pay for the decreased variation. Conclusion This new technique worked when previous techniques -- which involved finding the flats -- didn’t work. Not only did it work, but it also enabled us to build an understanding of the defect rate behavior as a separate response surface that can be visualized, as well as optimized. What about the older techniques? If the variation is small relative to the curvature in the response surface, then local methods using the derivatives still work well. If the variation is large enough to be affected by the curvature (second derivative) of the response surface, then you need to switch to simulation experiments. With surrogate models, we now have a great new way to do stochastic optimization. Now we can tune our processes to be robust to variation in the factors, improving quality and reducing waste. Monday, July 28. 2008Cooking Optimization: Should You Cook Hot and Fast, or Warm and Slow?
(NOTE: This is part two of a three-part series on stochastic optimization.)
In my previous post, I introduced stochastic optimization. In this post, I show a real example. This example was reported in the classic text by George Box and Norman Draper: Empirical Model-Building and Response Surfaces (page 32), and JMP's Statistical R&D Director Brad Jones noticed that it works as a great robust process engineering example. Imagine you are doing serious cooking, but instead of making food, you are cooking up chemicals, perhaps even a life-saving drug. Your cooking pot is really a chemical reactor, and people are going to depend on your product to save lives. The reaction that cooks your chemical product has two big controllable factors:
The reaction you make converts the initial ingredient, A, into the chemical you want, B. But if you cook it too hot and long, the B that you make will turn into another chemical, C. Here is the picture. Remember that we want to maximize the green B, and minimize the blue and red, A and C: ![]() This fits a classic optimization framework that is certain to have a nice optimum that maximizes the yield of B. Here are the formulas as they are in the JMP table. The yield formula is a function of time and the reaction rates; the reaction rates are also formulas, functions of the temperature. We don’t even have to estimate the parameters theta1 to theta2; they are already known. The reaction temperature is already in Kelvin, so these are basically Arrhenius-type models, well-known to chemists. Yield ![]() k1 ![]() k2 ![]() So let's optimize. In JMP, we use the Profiler to visualize cross-sections of the response surface for yield, and we use a command there to find the settings that maximize yield. Here, we see that we must cook hot and fast to maximize yield at .621 (temperature at 539.95 degrees, time at .1158). ![]() Another perspective, using horizontal cross-sections, is available with the contour profiler, where we can see various combinations of temperature and time that will produce good yield of at least 60 (unshaded) or 61 (inside the red contour line), with the crosshairs at the optimal settings to produce a yield of .621. ![]() But we can’t really control the temperature or the reaction time exactly. The temperature and time vary, at least in a production situation. Suppose that the standard deviation of temperature is 1 and the standard deviation of time is .03. In the contour plot, that is represented by the black ellipse, which would contain 95 percent of the variation in the two factors. Notice that the variation on time is going to mean that many batches will fall into the pink zone and fail to achieve even a yield of 60. How bad will it be? The Profiler has a built-in simulation facility, so we enter the standard deviations there and click the Simulate button. ![]() We have a lower specification limit of .55 for yield, which the Profiler's simulator shows as a red line on the histogram. If a batch fails to achieve .55, it must be discarded. At the current settings for the center of temperature and time, it is producing 4.2 percent bad batches. That is not good. Let's try other settings. Suppose that I lower the temperature to 535 and then set time to the point that maximizes yield for that temperature. There, my defect rate goes down to around 1.9 percent — much better. So the combinations that maximize a fixed yield do not minimize a defect rate in the presence of variation in the inputs. ![]() Remember my blog post about finding the “flats”? Most optimization ends up on a hill against some component limit. But if we find a flatter place, it will reduce the variation. The definition of flatness is that the slopes are very small or zero in every direction. We can model those slopes (gradients, derivatives). There is a built-in feature of the Profiler to specify that one or more factors are “noise factors” and that the Profiler should model the derivatives of the response surface with respect to those noise factors, and see if it can jointly optimize to maximize yield and minimize the slope. After maximizing this, we see that we are now on a flat area where the gradient is near zero in both directions. ![]() Now we use the simulator to calculate the defect rate. It is 3.3 percent. This is not much different from the fixed optimum – in fact, note that the factor settings are not much different from the fixed-optimal hot-and-fast settings. ![]() Haven’t we landed on a flat spot? Take a look at a surface plot. ![]() The two grids intersect at the current values, and you see that we have landed on a relatively flat spot near the top of the hill. But it is on the top of a fairly narrow ridge. Even though the first derivatives may be small here, the second derivative here is large because the sharp bending leads to a steep drop-off from that point. So we might consider finding a flat spot in a second-degree sense. But there are better ways to go about finding the stochastic optimum — finding the factor centers to minimize the defect rate. Stochastic programming does this. But stochastic programming is hard. How can we make this simple? The answer will be in my next blog post. It turns out that we can reduce defects an order of magnitude smaller with this technique, so it is very valuable. We need to move from hot-and-fast to cooler-and-slower to achieve this, and there is a great way to find the best settings for this. UPDATE: The third blog post in this series is also available. Thursday, July 17. 2008Follow-Up on Tornado Charts for Data Visualization
Last week I showed how to make tornado charts in JMP and asked for input on the utility of these types of visualization. Here are thumbnails of the two alternative views of US population by age and sex.
![]() ![]() One commenter pointed out that the back-to-back bars of the tornado style chart makes it easier to see the baby boom population bump and the smaller next generation bump 25 years later. While you can also see that general trend in the "normal" chart with side-by-side bars, it's not as obvious (especially the secondary bump) because the alternating gender values add choppiness to the trend curve. Besides the trend of population versus age, the chart also shows the breakdown of population by gender for each age group. The back-to-back bars are good enough to see major variations, such as the larger proportion of females in the older age groups, but the side-by-side bars are better if you want to highlight smaller variations by gender. In the example, only the side-by-side bars make it clear that males are more common in the younger age groups. The bottom line is that the appropriate visualization depends on the message you want to communicate.
Posted by Xan Gregg
in Data Visualization, JMP - General, JMP 7
at
00:00
| Comments (0)
| Trackbacks (0)
Wednesday, July 9. 2008The Challenge of Optimizing Products and Processes
(NOTE: This is part one of a three-part series on stochastic optimization.)
To get to the top of a hill, you just keep going up. However, hills can have subpeaks, so sometimes you have to hunt around to keep going up. But going up is still the basic idea. This is what optimization is — finding the top of the hill. Operations research is about solving optimization problems more generally, with higher dimensional hills that might have fenced areas that are off-limits. Now imagine that instead of climbing the hill, you ride on a helicopter; you just tell the helicopter where to go, and then you parachute down from 5,000 feet above that location. Sounds easy. But there are clouds, so you can't see the hill itself, and there are random gusts of wind that can blow you hundreds of meters in any direction. Also, you have to land above a certain altitude, or you will get sick. You do get a few trial drops at different GPS locations, but you have to live around that target location, and you get one jump a day. Welcome to the world of stochastic optimization. Getting to high altitude is now a very messy business. Why study something that behaves this strangely and is this frustratingly difficult to understand? Well, it turns out that the future quality of the world's products and processes depends on just this type of situation. We try to optimize our products and processes, but then it turns out that the input factors vary, and the products and processes are no longer optimal. The input factors might change due to environmental factors: You know how to grow the best yielding corn crop, but unfortunately, you can't seem to control the weather to get the optimal yield. The input factors may vary due to natural variation: Your ingredients are the output of some other process, and you can't get all the variability out of that process — you can often control where the center of the distribution is of each factor, but you can't reduce the variation. The literature on this kind of optimization is not particularly rich. The field of study for this application is called robust process engineering — the struggle to make products and processes that behave well in the face of variation. The first good attempt at solving this kind of problem came from a Japanese engineer, Genichi Taguchi. He said that you construct an experiment in two directions. There are the Control factors that you assume are fixed, not subject to random variation. Then there are the Noise factors that in production you can't control completely — they have random variation. In an experiment, you might be able to control them, e.g., you can control the weather for a corn crop by growing it inside and controlling light and water. (In agriculture, that kind of experimental place has a name: phytotron.) Then you cross the experiment across both the Control factors and Noise factors. Next you derive the noise variation across the Noise design for each Control setting. Then you optimize with respect to both mean and variation, or some combined measure, a so-called signal-to-noise ratio. This worked. Taguchi clubs sprouted up all over the world, and engineers learned Taguchi’s method. Some Western statisticians looked at the method and said, “We can do better.” Various schemes emerged along with a recognition of what you should be looking for, which was this: There may be a lot of places on the hill that have good altitude, but among those good places, try to find the place that has the widest, flattest area around it. Then when you are randomly dropped around that target, you are likely to land in a narrow range of altitudes. For example, below you see the contours around Longs Peak in Rocky Mountain National Park. If you want to parachute to above 13,400 feet, then — rather than aiming for the peak above 14,200 feet, risking going off-course and landing at 12,400 feet off the northwest face — you aim for “The Loft,” which is a wide target above 13,400 feet. ![]() In my next blog post, we will see a real example involving some high-tech cooking, a chemical reaction example, and how a classic example with a well-known optimum gets its lesson reversed when variation is taken into account. UPDATE: The second and third blog posts in this series are now available. Credits: Warren Sarle wrote two neural net papers years ago about how optimizing is like climbing to the top of a hill, and that was the inspiration for my analogy. The map is from Google maps. Tuesday, June 17. 2008Using JMP Genomics to Understand Chronic Pain
“Discovery,” the new name for the annual event formerly known as the JMP User Conference, seems like an appropriate moniker.
At yesterday’s opening session, I ran across conference attendees who are using interactive JMP software from SAS in some pretty amazing ways. There’s the greenhouse gas specialist from the University of North Carolina who is trying to create a carbon footprint for the Chapel Hill campus. There’s the statistician who is using JMP to control quality for a manufacturer of biotherapeutic products. And there’s Susan Dorsey, the nurse-turned-PhD neuroscientist at the University of Maryland Baltimore who is using JMP Genomics, the customized JMP software package for biostatisticians and researchers, in her quest to ease the suffering of people who suffer with chronic pain. Dorsey, an assistant professor in the UM-Baltimore School of Nursing, describes chronic pain as an epidemic in the United States – a condition that affects 75 million people. In a particularly cruel irony, some of the powerful drugs that extend life by decades for cancer patients and HIV/AIDS sufferers, among others, are themselves diminishing the quality of life because they leave behind pain that even morphine can’t control. This “peripheral neuropathy” also extends to patients with multiple sclerosis, fibromyalgia, diabetes, chronic fatigue syndrome and other illnesses. “You might cure the disease, but patients often have to come off the therapy because of the pain,” Dorsey explains. “So these are critical, critical problems.” Some patients won’t even report the pain, she added, for fear that they might be taken off the treatments that are keeping them alive. So Dorsey, using JMP Genomics, is working to figure out how chronic pain mechanisms work in hopes of identifying new therapeutic targets. Dorsey and her team at UM-Baltimore have identified a particular gene in mice – she calls it Gene X for now – that appears to play a role in reducing the effectiveness of morphine to control pain. She is using mice as subjects because they share remarkable genetic similarities with humans. Dorsey discovered JMP Genomics while she was looking for a replacement for another, more limited, genomics software package. She knew that the University of Maryland had a SAS license, so she asked for a demonstration of JMP Genomics. She now uses it for a variety of analyses, including exon, SNP and microarray. She said she likes its power, versatility and ease of use. “You don’t have to be a whiz-bang programmer to get your answers,” she says about JMP Genomics. “It’s very visual but also very statistically accurate.” Just the sort of technology that promotes discoveries – some of them potentially life-enhancing.
Posted by Anne Bullard
in Academic, Customer Stories, Discovery, Genomics, JMP 7
at
13:42
| Comments (0)
| Trackbacks (0)
Monday, June 9. 2008Tip for Saving JMP Reports: Save the Data, too!
If you ever called or e-mailed a problem to JMP Technical Support, you may have been in contact with Duane Hayes. Duane manages JMP Technical Support for SAS. We recently discussed a tip that also may be helpful when sharing JMP reports with colleagues.
Duane: People often send us JMP reports, .JRP files, so we can recreate and solve their problem. They don't realize that the .JRP file is just a script, like the one shown below. To recreate the problem we also need the data table saved as a .JMP file. Open("C:\Documents and Settings\hayes\Desktop\snapdragon");Gail: What should they send you instead? Duane: Nothing instead. Something in addition. They must also send us the data - the .JMP file. We then save the .JMP file and edit the Open statement in the .JRP file to point to the location where we saved the data file. JMP users need to remember this when sharing reports with colleagues. Gail: So, what must they do to share report files? Duane: Users have three options. Option 1: Put the data in a shared directory to which their colleague has access, run the analyses and then save the .JRP (report) file to that same directory. That way, when the colleague opens the .JRP file from JMP, everything will work. Option 2: Send the .JRP file and the .JMP file to the colleague, who saves the .JMP file to a location of choice. In the .JRP (report) file, the colleague then edits the Open statement to point to that location. Gail: And I bet Option 3 is to create a JMP project. Duane: Right. This allows users to bundle and share everything related to their analyses, including documents, PowerPoint presentations, animations and more. Richard Potter described this in detail in his blog Projects in JMP 7. Projects are really in line with the spirit of JMP, which lets you build upon one analysis with another, or subset data, without having to start the analysis anew. Projects are a great way to leverage work you have already done and make it accessible to someone else for review or the next step in the analysis.
Posted by Gail Massari
in JMP - General, JMP 7, Technical Support
at
10:30
| Comments (0)
| Trackbacks (0)
Monday, May 12. 2008Take a Quick Look at Data Visualizations Using JMP
Check out the article on JMP in the new issue of sascom magazine. It’s lean on words and big on visuals.
You’ll see a variety of graphs created in JMP showing how the software helps answer and explore questions from a range of industries and organizations: pharmaceutical, marketing, public policy, financial services and manufacturing. If you get a chance to read the piece, please come back and let us know what you thought of it.
Posted by Arati Bechtel
in Biz Viz, Data Visualization, JMP - General, JMP 7
at
14:30
| Comments (0)
| Trackbacks (0)
Monday, May 5. 2008Want to Know More About Split-Plot Designs?
Randomizing an experiment completely is often either impossible or prohibitively expensive. That's where split-plot designs can be valuable. Split-plot designs allow you to fix certain factors for several runs in a row. Within each block of runs (or whole plot), the factors that are hard to change remain fixed while the others vary at random from run to run. This makes the logistics of running a design simpler.
If you’ve had the opportunity to see JMP R&D Director Bradley Jones demonstrate how JMP Custom Designer handles split-plot designs elegantly and efficiently, you will want to read the paper he recently co-authored. In the May 2007 issue of Journal of the Royal Statistical Society: Series C (Applied Statistics), Brad and Professor Peter Goos from Universiteit Antwerpen introduce a new method for generating optimal split-plot designs. In the paper, they demonstrate the usefulness of this flexibility with a 100-run polypropylene experiment involving 11 factors. In the experiment, they found a design that is substantially more efficient than designs produced using other approaches. We also have more resources on design of experiments (DOE or DOX). If you have JMP, check out sample data and instructions for generating a split-plot design.
Posted by Gail Massari
in Design of Experiments (DOE), JMP 7, Statistics
at
10:00
| Comments (0)
| Trackback (1)
Thursday, April 17. 2008Try This Easy Way to Learn the JMP Partition Platform![]() Marie Gaudard, Phil Ramsey and Mia Stephens have taught JMP and used it in their North Haven Group Six Sigma consulting practice since the release of JMP Version 4 in the 1990s. All three are strong believers in the value of the JMP Partition platform for novice to expert users. Marie is an ardent data miner. She recently added a new file to our File Exchange. It‘s a JMP data table that will help you learn how to use partitioning to mine data. The problem is easy to understand: Some print jobs are ruined by a band of ink on the pages, and the production team wants to identify the factors that may be causing the ‘banding problem’. The data is easy to use: Marie provides data on more than 500 print runs and she includes embedded scripts that get you started. There is a tutorial to help you along: North Haven Group wrote a white paper that gives background on data mining and partitioning. The paper is in the form of a tutorial for implementing the techniques using JMP. I talked with Marie last week about why she likes JMP’s Partition platform. Marie: It is intuitive, powerful, easy to understand and users love it! It’s powerful because it handles large amounts of data and it’s trustworthy because it examines a very large number of possible splits and picks the optimum one. It is especially useful when the explanatory variables are nominal and have many levels. Me: Is it for JMP power users only? Marie: Emphatically, no! Wearing my trainer’s hat, I love it because it is so easy for my clients to use, and it gives them incredible insight into their data. We just teach a new user the basics of opening, saving, and navigating the interface and about the convention of the ‘red triangle’. The red triangle reveals lots of options available to them after they do their first split – options like small tree views and a leaf report. They just click SPLIT to see, in a graphical tree view, the variables that are most likely to affect the outcome in which they are interested, and the nodes that describe how the variables are related to the outcome. Then, they can easily click PRUNE when they want to reverse the operation. Your readers will know what I mean as soon as they open the data and run our scripts. Me: How does it compare to other data mining tools you’ve used? Marie: Many other data mining tools that do this kind of analysis are largely inflexible. For example, they give you the final tree based on their built-in stopping rules. But you know more about your data and even about the constraints of the organization that you must consider. JMP lets you lock out a variable that may be interesting, but not useful for understanding the problem. Then you can, very easily, go back and split on other variables that may be more valuable. You can also split at specific nodes. We find this very valuable for gaining a deeper understanding of the data. And you can decide when to stop splitting, based on knowledge of the process or using criteria provided by the Partition platform itself. Monday, March 31. 2008Board Games, Dice and Probability – and a Love of BaseballThe inspiration for our latest data story – on using JMP for fantasy baseball – came from Lou Valente, one of our product managers and an ardent New York Yankees fan. Lou is a synthetic organic chemist, a Six Sigma Black Belt and a passionate practitioner of design of experiments. He was a worldwide quality manager at Kodak in the Synthetic Chemicals Division before joining JMP about a year ago. But he’s no stranger to JMP. Lou has been using JMP for work and fantasy baseball for nearly 20 years. He has won the championship eight times in his fantasy league in the past 19 seasons. His team, the Vintage Yanks, also has the most consistent performance in his league, as the JMP graph above shows. For the past seven years, his mean win percentage is nearly 70 percent, while most other teams’ figures are in the 40-50 percent range. Lou’s data file and analysis of the top 200 professional baseball players is the basis of the baseball data story. The file is available for download from the JMP File Exchange. I chatted with Lou about the history of his interest in fantasy baseball. Me: How did you get started playing fantasy baseball? Lou: It began with board games, dice and probability. But it all had to do with my passion for baseball. My passion for statistics came from baseball. I grew up in New York in the 1960s, only 15 minutes from Yankee Stadium. I was very influenced by Mickey Mantle and Roger Maris. And my dad was in the minor league farm system for the Yankees. So we were a big baseball family. I taught my brother when he was 5 years old how to do batting averages. I showed him how 1 for 10 and 2 for 20 were the same thing. The calculation for earned run averages had to be normalized for nine innings, and this seemed like magic to him. Baseball definitely played a part in making us math-literate, and it made it fun. A board game came out in the ’60s called “Challenge the Yankees.” It was only out for two years. It was a game that had all of the All-Stars from all teams versus the Yankees since at that time it required a team of All-Stars to compete with them. Me: What was the game like? Lou: That game introduced me to statistics and probability. Every card and every player had the numbers 2 to 12 on it, and by throwing dice, depending on what entries were on the card – single, home run, fly out, ground out – the game could approximate a player’s statistics using the probabilities of dice. My cousin used to come from Michigan every summer. He was in college at the time, and I was 10 years old. He showed me how each dice roll had a different probability. You could get a 2 only with snake eyes, whereas you could have a 3 with a 1 and a 2, or a 2 and a 1. Six, 7 and 8 were the most frequent rolls. So the person who invented “Challenge the Yankees” realized that it approximates the statistics of baseball through the use of the dice. Then the game disappeared ’cause if you didn’t live in New York, you probably didn’t buy it. On very rare occasions, you can find the original version on eBay for over $1,000. Me: Then what did you do? You stopped playing? Lou: No. Then Strat-O-Matic came along in the ’60s. That game is still played today, and there are conventions all over the United States for this baseball game. The guy who invented the game realized he could make “the game of games” by adding one more die. There was one red die and two white dice. And the red die indicated the columns, so now you had six columns and 2 through 12 under each column. Three columns resided on the hitters card, and three columns resided on the pitchers card. The statistical granularity of all the things that could happen increased by six. This board game really took off, and everyone and his brother was playing it. And every year, people buy new cards with updated statistics from the most recently completed season. Me: Then this led you to fantasy baseball? Lou: Yes. Before computers, fantasy baseball was all on paper. Then it went with the age of computers and became more automated and easier. I’ve been playing fantasy baseball since 1988, and I’ve had eight championships in 19 years. I don’t win any money. The winner in my league gets a free subscription to the fantasy baseball Web site, which is worth only $20. A lot of people do play for money, and I probably should have been playing for money. But that’s not what it’s all about. What it’s really about is baseball. As I got older, I sort of lost track of baseball because of college and graduate school. And I only knew the Yankee players. Fantasy baseball allows me to know every team and every player. It increased my awareness of the game of baseball again. Plus, it was fun to play against co-workers and gain bragging rights every year!
Posted by Arati Bechtel
in Biz Viz, Data Visualization, Design of Experiments (DOE), JMP - General, JMP 7, Statistics, Visual Six Sigma
at
10:10
| Comments (10)
| Trackbacks (2)
Monday, February 11. 2008Fall in Love with JMP
Searching for a way to understand your business data better? JMP may be the perfect match – no flowers required!
Check out our new interactive data story about a fictional candy maker. It shows you – step by step – how JMP joins together data from different sources, helps you see sales trends and enables you to gain insights about specific product lines using the data visualization capabilities of the software. It gives you a taste of what JMP looks like and what it can do. If the data story piques your interest, download the fully functional trial version of JMP and explore it yourself using your own data. And be prepared to fall in love with JMP!
Posted by Arati Bechtel
in Biz Viz, Data Visualization, JMP 7
at
10:47
| Comments (0)
| Trackbacks (0)
Friday, January 25. 2008Listening to the Voice of the Customer
What are your customers saying? To find out, some companies are employing Voice of the Customer research and JMP software, according to an article in Quality Digest.
Voice of the Customer (VOC) research is for finding out what attracts customers to a company and drives them away from its competitors. It involves asking the right questions to the right people in the right way, says Rob Reul, Managing Director of Isometric Solutions in Minneapolis, Minnesota, and a practicing Six Sigma Black Belt since 1986. After collecting the right information, you have to visualize data and uncover patterns – and that’s where JMP software comes in. In the Quality Digest article, Reul explains how he conducted VOC research for a large office supply company that wanted to increase its market share. He used JMP to analyze more than 2 million customer records and to discern key relationships in survey data. It’s a fascinating case study about the value of customer loyalty. If you’re going to ASQ’s Lean Six Sigma Conference in Phoenix, Arizona, in February, you can hear Reul speak. He’ll be presenting a session titled “Using Six Sigma to Trade Performance for Profit” on Tuesday, Feb. 12, 9:15 – 10:15 a.m., Concurrent Session E4. Check out the full conference program here. JMP representatives will be at the conference, too. Look for our Six Sigma specialist Leo Wright, who recently posted about using JMP to improve success in fly fishing.
Posted by Arati Bechtel
in Data Visualization, JMP 7, Visual Six Sigma
at
11:22
| Comments (0)
| Trackbacks (0)
Thursday, January 17. 2008Improving Your Fly-Fishing Odds – with JMP
Most of us who brave Midwest winters pursuing steelhead trout – the ocean-going form of rainbow trout – already know that you have to pay your dues. It’s well worth the effort when we achieve a solid hook set into the awesome power of the famed silver torpedoes. So, how should we pick the prime days to be on the water to increase the probability of success?
Sorry, I’m a bit of a statistics nut, but I like to improve my odds and predict success when at all possible. Some may say that there are “liars, damned liars and then, finally, statisticians.” While we can argue this for fun over a beer, the bottom line is that data does not lie. I assimilate all kinds of data after every fishing trip, trying to perfect the ideal model to optimize my selection of the perfect fishing day (yes, I fish during the imperfect days as well). While there are many variables to consider, such as air and water temperature, cloud cover, flows, turbidity and, of course, the angler, don’t ignore the barometer. Barometric pressure is a very reliable factor to watch. It’s best to consider not only the current readings, but whether pressure is steady, rising or falling. Also note what the trend has been up to that point: steady, rising or falling. If we consider the contour plot above, which I created using JMP software, we can see the preferred barometer range for reliable steelhead fishing is between 29.9 and 30.35. The colored areas in the plot – defined in the Fish-On legend at right – show the number of fish caught during my fishing expeditions. The Barometer variable is straightforward. Current Trend describes the barometric trend at the time: 0 is falling, 1 is steady, and 2 is rising. Extremely high and low barometric readings generally equate to a tough day in the snow and cold. Steady readings between 29.9 and 30.35 usually will yield a decent day and result in solid hook-ups. Because the winter can bring frequent frontal changes to our area, the fish can get finicky or sometimes present the dreaded “lock-jaw” effect (that is, no bites), which is when generally the barometric pressure is low and falling or moderate to high and rising. If barometric pressures are changing, I generally prefer rising pressure, especially following frontal lows. For the statisticians in the audience, there certainly is a fair amount of noise present in this analysis because of the many other variables I mentioned earlier; but if you pay attention to these principles, your success rates will improve. For those of you who use your precious vacation days to go fishing, try these guidelines to plan for success!
Posted by Leo Wright
in Data Visualization, JMP 7, Statistics
at
10:19
| Comment (1)
| Trackback (1)
Monday, November 12. 2007Connected Trails in Bubble Plot
The most common question we get regarding Stephen Few's white paper and webcast on visualizing change is about the scripts for showing the connected trails in bubble plots. Often the existing bubble trails are overlapping, so the progression is clear, but when the bubble trails are spaced out it can be helpful to connect the bubbles with a line.
![]() In addition to the journal file that contains the data and plots from the paper, you can also download from the JMP Extras area the graphics script by itself which can be added to any bubble plot (via Right-Click and Customize) if you edit the column names in the script. Local( {s = "", c = 0, xx = {}, yy = {}},The script requires that the data be sorted by the ID column, "State" in this case. Friday, October 5. 2007Superposition Magic - How can you identify several clusters that are at the same place at the same time?If physicists can have their superposition magic, then so can statisticians. Suppose that you have lots of points across three variables in three groups. You need to count the number of points in each group. But you don't know which group each point comes from. And, by the way, each group has the same mean across all the variables. Try inventing an approach to do that before you read the rest of this blog entry. Grouping points is usually the job of cluster analysis. The computer scientists have a more colorful name for this: unsupervised learning. It's easy. You just cluster the data so that points that are near each other form clusters; assign each point to the closest cluster, count the points in each cluster and you are done. But these clustering methods never allow the clusters to overlap, much less have the same centers. So ordinary clustering just won't do this job. Before we solve the problem, first we have to be a little more specific about the data and generate a problem data set. We will have the data have multivariate normal distributions, and though each cluster will have the same mean, we will distinguish them by having a different covariance structure for each group. To be simple, we will generate uncorrelated multivariate normal data, with each of the three clusters having a larger variance in one component direction that is unique to that cluster. So we generate a table, in this case with variances of 1 for each variable for each cluster, except that each cluster has one component with a variance of 4, instead of one. The result is that each group sticks out in a different direction, though they have the same centers. Here is the JSL code that I used to make some data: ![]() JSL script to generate data And here is a picture of the multivariate normal density contours that results from doing this: ![]() Normal ellipsoid contours for 3 groups Notice that the groups with the large X and Y variances have 35% of the data, and the group with the large Z variance has only 30% of the data. If the method estimates the proportions well, then the problem is solved. We don't need to identify each point--we just need to estimate the proportions, or equivalently, the numbers in each group. Here is the secret: Instead of doing usual hierarchical or k-means cluster, we do normal mixtures. That is we fit the means, variances, covariances, and relative proportions of each group so as to maximize the likelihood of the data. This was implemented in JMP by Chris Gotwalt. Here are the results. ![]() Normal Mixtures Results Notice how well we did. The proportions are .343, .351, and .308 for the three groups, very close to the proportions used to generate the data, .35, .35, .30. And the means and standard deviations and correlations are close, too. Problem solved. Here is a picture of the data with the points colored by the most-probable group membership. ![]() Points colored for most-probable group Is this magic useful? It turns out that there is a very important type of counting that is very important to do in measuring the infection density in HIV cases. There is a special kind of white blood cell - a helper T cell - that expresses a protein called CD4. HIV infection is measured, in part, by how many of these cells are present in the blood samples, relative to other leucocytes. It turns out that you can make different types of white blood cells identify themselves by tagging their binding sites with different fluorescent dyes. Then you send it through an instrument called a flow cytometer, which makes a tiny jet of fluid droplets containing the cells. Several lasers of different wavelengths are shot through the droplets, and each droplet is measured how it fluoresces. The result is a huge data set, maybe half a million rows on 12 or so intensity measurements at different wavelengths. The data has a row for each cell. The different cells form clusters that overlap, and you need to count the cells in each cluster. The current practice for doing this is by hand-dragging polygons over the clusters, arbitrarily dividing the groups by eye. Using normal mixtures to do the counting would give a much more objective method, and improve the data reproducibility. But there are lots of stray points that don't belong to any cluster. No problem. Chris Gotwalt's routines include a (Huber) robust method of handling outliers. Doing this in 12 dimensions for half a million points is currently pretty expensive, so we are looking for ideas on how to speed this up. Pretty good magic.
(Page 1 of 2, totaling 29 entries)
» next page
|
ABOUT THIS BLOG
JMP Statistical Discovery Software from SAS
is proud to bring you this blog on all things related to
data visualization, visual Six Sigma, design of experiments
and other statistical topics.
The blog content appearing on this site does not necessarily represent the opinions of SAS. Your use of this blog is governed by the Terms of Use. CategoriesQuicksearchSyndicate This BlogCONTRIBUTORS
Calendar
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

