Graph makeover: Fractal scatterplot

One of the marvels of the Internet is the Online Encyclopedia of Integer Sequences (OEIS). Started 50 years ago by Neil J. A. Sloane (pic) as a graduate student, the repository now contains more than 256,000 integer sequences run by a non-profit OEIS Foundation with Sloane still at the helm. (Aside: I wanted to donate $1,000 to the foundation so I could have my name next to Donald Knuth's on the short list of $1,000 donors, but my wife provided a voice of reason as usual.)

Why do we need an encyclopedia of integer sequences? If your study produces a sequence of integers, you can use the encyclopedia to see what the next term is or what else is known about the sequence. For (a contrived) example, if you're counting carbon trees and get counts of 4, then 9, then 18, then 42 trees when you add more and more atoms, you can look up "4, 9, 18, 42" in the OEIS and find sequence A000678, which has more information and even a formula.

For the 50th anniversary, the foundation produced a poster featuring nine visually interesting sequences. One, A229037 the greedy sequence with no three equally-spaced items forming an arithmetic sequence, was this scatterplot:

A229037src

The graph was automatically generated by the website, and the interesting aspect is that the sequence appears to have a fractal structure in this space in that each cluster of points is a larger version of a previous cluster. However, there are two artifacts of the graph that obscure that information. There is serious over-striking of points, especially at the low end. Also, the clusters appear to get more sparse just because the spacing is growing, but the dots are staying the same size. As an attempt to remedy those artifacts, I made a version with translucent and variably-sized dots.

A229037xg

I think it's easy to see the fractal structure now. When I shared it with Sloane, he promptly added it to the sequence's web page and plans to update the poster as well. By the way, this sequence has a graphical explanation that hints at why it has a graphical structure: No three points in the scatterplot fall along the same line with equal spacing.

I'll leave you with a puzzle involving another sequence from the poster A250001. It's the number of ways to arrange n circles, ignoring size and not allowing single-point intersections. This image from the poster shows 7 of the 14 ways of arranging 3 circles. Can you find the other 7? There are 168 ways of arranging 4 circles, and no one knows how many ways there are to arrange 5 or more circles.

A250001

Post a Comment

Graph makeover: 3-D yield curve surface

A couple weeks ago, The Upshot section of The New York Times produced this "glorious" interactive 3-D graph of the last 25 years of US Treasury yield curve data titled "A 3-D View of a Chart That Predicts The Economic Future: The Yield Curve."

nytyield

The graph is very appealing at some level and comes with well-done animated flyovers to highlight some interesting features. Commentaries at Flowing Data and Visualising Data have been mostly positive. However, I'm always suspicious of 3-D views because of the extra step needed to translate values accurately in our minds and, for surfaces, the danger of missing information that's obscured. This graph works more as a backdrop for drill-downs into slices of interest rather than as a standalone data representation. While I think it adds value as a context, there is also a complexity cost to consider, and it's worth exploring other views.

Getting the data was refreshingly easy in this case. The US Treasury Department provides the data in an HTML table, and the Import HTML feature in JMP brings it into a data table nicely. Though there are more than 100 HTML <table> elements in the web page, JMP correctly identifies the one that contains data (the others are likely used for page layout). The only glitch was that the date values use two-digit years. Fortunately, JMP has a preference for how to interpret two-digit years, and after setting it to treat "90" as "1990," the dates come in correctly.

First, I'll try a 3-D surface view in JMP for comparison. Though the original looks beautiful in many ways, one feature I thought strange for a surface graph is the way the loan term lengths are treated categorically. That is, the spacing between 1-month and 3-month rates is the same as between 20-year and 30-year rates. I've seen yield curves drawn both ways, and it usually doesn't matter too much since the curve is often simplified to one of three states: rising, level or inverted. But given the context of the graph's title about “predicting the future,” it seems reasonable to look at the term length as a continuous value (that is how far into the future we're predicting).

Here is the surface plot in JMP. I could play with the lighting and smoothing, but this lets us get a sense of the effect of a continuous representation of the term length.

yieldSurface

The call-outs of the original piece focus on the three possible 2-D profiles. Looking at the rate versus the term length with a separate curve for each date (yellow to red) produces an attractive view, even if not very informative.

yieldcurve1

With the coloring, we can sense the downward trend over time though we miss the dips, which are obscured. Possibly this could serve as a backdrop if a few years of interest were highlighted and labeled.

Here's the same view with only one out of every 40 days shown. At least, we can get sense of the older low rates, which were previously obscured.

yieldcurve2

Another way to slice the cube is to look at each term length's rate over time. This graph of two term lengths representing short-term and long-term rates over the last 25 years in 2-D gives a clearer view:

yieldcurve3

To me, this 2-D view is clearer than the same 2-D profile within the context of the 3-D view. It's easier to see both the steady declining trend in the long-term rate and where the short-term rates were higher than the long-term rates. Another embodiment of my favorite maxim, "Less is more."

Finally, here is a reproduction of the heat map of date versus term length and using the interest rate as the color. The cut-out for the missing 30-year rates in the mid-2000s is a good application of the "alpha hull" feature added to contour plots in JMP 11.

yieldcontour1

I usually like heat maps for 3-D data, but this one doesn't seem very informative. Maybe it's the amount of variation in the rate or the irregular spacing of the term length values, but it's harder for me to get a good sense of the data from this view. I think the core issue is that the interest rate is too important to be represented by color alone (necessarily imprecise).

One benefit of remaking graphs like this is you discover some of the many decisions the designers had to consider when making the published view. A few substantive decisions for this data:

  • Continuous versus categorical term length.
  • An appropriate level of smoothing, since there were too many days in the history to show every value.
  • Dealing with gaps in the data.
  • Deciding which of many interesting data features merit call-outs.

I saved my work as a JMP script (uploaded to the JMP User Community), so I could redo it easily with, for instance, new data or new smoothing parameters for experimentation. It takes a little more effort to create a reproducible script from an interactive data exploration, but I'm finding the practice to be rewarding.

Post a Comment

Q&A with market research expert Walter R. Paczkowski

Last month, we featured consumer and market research expert and Founder of Data Analytics Corp. Walter R. Paczkowski on our Analytically Speaking webcast series. If you missed the live webcast, you can still view it on demand. Host Anne Milley took many audience questions, but was unable to get to all of them. So Walter graciously agreed to answer some of them in this Q&A.20150318_121214

Questions: (a) What is your approach to building hypotheses to be tested ahead of an analytics project? (b) Do you find that analytical work for B2B segments are much harder than B2C segments because there can be so many factors in B2B that cannot be put into a model?

Answer: The responses to these first two questions are similar, so I'll answer them together. This is where the upfront qualitative work becomes an important part of the overall research design. Remember, there are two phases I advocate for most projects: qualitative followed by quantitative. The qualitative phase helps set the parameters for the quantitative phase. We generally don’t know all the parameters needed for the quantitative phase – key factors or attributes, levels for the factors, correct wording, important concepts, just to mention a few. The qualitative research, focus groups or one-on-one in-depth interviews with subject matter experts (i.e., SMEs) or key opinion leaders (i.e., KOLs), helps identify them. This makes the quantitative phase more focused and powerful.

What does this have to do with hypotheses and B2B factors? Hypotheses are just parameters, no different from a list of factors or attributes to include in the quantitative phase. Discussions with consumers or SMEs or KOLs can help formulate hypotheses that marketing personnel may never have imagined.

The same holds for B2B modeling – or, in fact, for any modeling for B2B, as well as B2C or B2B2C. If the list of factors is large, then seek help from SMEs and KOLs. They’ll help tell you what is important and what can be ignored. But this is just upfront qualitative research.

Question: Do you think organizations have a balanced approach to creating value from both the found data and the more information-rich data to be gained from well-designed surveys and experiments?

Answer: I’m not sure about “balanced,” but the use of both types of data is definitely there. Since I do work across a wide range of industries, I see many practices, the best and worst, which I talked about with Anne. Many of the large organizations, the sophisticated ones I mentioned in the interview, use these two sources of data to answer their key business questions and understand their markets. These are the ones who follow the best practice of using the right tools – the tool being the type of data in this case.

Over the past few years, I’ve presented workshops on choice modeling, a great example of an experimental approach, and working with Big Data, as I mentioned in the interview. Not only have they been well-attended, but I noticed that many attendees were from the same company, different divisions but nonetheless the same company. So the use of both types is there – I have the data!

Question: When pricing is a factor in a choice experiment, how well does the optimal price indicated by the experiment correspond to what the actual best price should be in the field?

Answer: This is a great question. And hard to answer. First, the last part of the question asked about “what the actual price should be in the field.” This is the whole purpose of the study – to find that price. I think what the question is really asking is whether or not the study replicates current existing prices. That can best be determined using the Profiler in JMP or a simulator by setting the base case to the current actual conditions. But these conditions won’t match exactly what’s in the market since market prices are driven by many other factors that the study can’t handle. Nonetheless, the study should come close. So look for the Profiler to help on this issue.

Post a Comment

John Tukey on the rule of zero-origin scales

I saw the following post recently on Twitter:

Eric Jonas @stochastician Mar 16
There’s basically never a reason to start the y-axis of your comparison graph anywhere besides zero.

It generated several dissenting replies, including one from me. Coincidentally, I had just reread part of John Tukey's classic book Exploratory Data Analysis (1977), in which he shows a good counter-example to that guideline. His example comes from a discussion introducing a variation of a box plot called a "schematic plot." He introduced the general box plot in 1969, and the schematic plot refined the box plot with a specific set of rules for the whisker lengths and outlier displays, which has always been the default in JMP where it's called an "outlier box plot."

The example in question uses Lord Rayleigh's measurements of the mass of nitrogen. It's a very small data set by today's standards, and Tukey nicely lists the data in his book. I also checked Lord Rayleigh's publication, On an anomaly encountered in determinations on the density of nitrogen gas, 1894 which contains a few more observations and a couple minor differences from Tukey's data. I've attached CSV and JMP versions of the data set in my JMP User Community post, with Tukey's data in a separate column. Here's an excerpt from Lord Rayleigh's paper:

rayleigh1.png

Importantly, the measurements record the details about how the nitrogen itself was produced. The graph below shows the recorded weight (in grams per "globe") versus its source and the purifying agent.

nitrogen0.png

The main difference is whether the nitrogen comes from air or not, which is how Tukey shows it. Here are some of his text and figures.

tukey1.jpg

tukey2.jpg

tukey3.jpg

Although Tukey is comparing summary views (box plot vs. mean bar chart), his point holds for raw data as well. Here are JMP scatterplot versions of those plots.

nitrogen1.png

nitrogen2.png

It turns out that Lord Rayleigh's "nitrogen" from air also contained other elements unknown at the time, and the small differences led to the discovery of the element argon, for which he won a Nobel Prize.

So while a zero scale is often wise for comparison graphs, there is no substitute for making an intelligent choice. As Tukey suggests, the zero-origin plot doesn't make the case for a Nobel Prize.

Post a Comment

What statistical details do you want documented?

The documentation for JMP must meet the needs of JMP users with a diverse set of backgrounds. The needs of one group of users can differ markedly from the needs of another group. For instance, some users report that there is too much statistical jargon in the documentation, while others comment that there is too little statistical detail in the documentation. We could debate whether jargon and detail are the same thing or not, but even the group of users wanting more statistical details contains at least two sets of users. There are the statisticians who know a particular analysis and want details about the implementation choices made by the platform developer, but there are also those who aren’t familiar with a particular method and want to learn more about that particular analysis.

Over the past few releases of JMP, the documentation team has worked to improve the amount of statistical detail about the methods implemented in JMP and JMP Pro. For statisticians who want to know more about the in-depth details, we continue to add more formulas and algorithm descriptions. For users who don’t have a statistical background, we continue to add more examples and enhance the quality of the explanatory content. One method we have employed to achieve these goals is to present details of the statistical algorithms for a platform in a section at the end of the chapter for that platform. Many platform chapters now have a section at the end titled “Statistical Details for the XYZ Platform.”

But there are still platforms whose chapters are in need of statistical details! If there are specific statistical details that you would like to see included in the JMP documentation, please share them in the comments. As we continue to raise the level of statistical detail in the documentation, your comments will help the documentation team prioritize their efforts.

Note: A version of this blog post also appeared previously on the JMP User Community.

Post a Comment

Using images to bring JMP 12 graphs to life

Earlier this year, I had the opportunity to speak at a National Wear Red Day lunch-and-learn at SAS. I was invited to share my data and experiences as we marked a day devoted to raising awareness of the sobering statistics about cardiovascular disease risk among women.

Heart attack and stroke are responsible for one of every three deaths in women, and kill more women than any other diseases by a wide margin. On the bright side, many important risk factors for these diseases are lifestyle-related and largely under our control. Over the past decade, National Wear Red Day has promoted awareness of these risk factors and helped bring to light gender inequalities in cardiovascular research and health care. Many women are now actively reducing their risks by adopting the Simple 7 lifestyle changes recommended by the American Heart Association.

Although I have blogged about my personal diet and fitness data and presented a Discovery Summit 2014 e-poster on the topic, this lunch-and-learn represented the first time I publicly discussed the connection between my own lifestyle data and my risk factors for heart disease and stroke. I shared this connection in part because cardiovascular disease has touched my own family. I lost one grandmother when her doctor misdiagnosed her symptoms as anxiety, failing to recognize key heart attack symptoms experienced more often in women than men. My other grandmother experienced a series of debilitating mini-strokes late in her life. I share these sad stories to help explain why I am passionate about encouraging more women to take active steps towards positive health changes and do what they can to maximize their chances of living to see their children, grandchildren and great-grandchildren grow up.

If you saw the first post in my fitness and food blog series, you saw how I took advantage of a new JMP 12 feature to embed pictures of myself in my weight data table. I pinned several representative pictures to my historical weight graph to highlight the changes in my body weight over the past 15 years. My weight metric tracks predictably with my other cardiovascular risk markers, like blood pressure, body fat, waist circumference and blood cholesterol composition.

Unlike a cholesterol test, however, tracking my weight is easy to do at home; as a result, I have lots of historical data! When I showed my weight graph during my talk on National Wear Red Day, it immediately resonated with the women in the audience. I heard from several people that seeing my pictures (in second graph below) made the ups and downs in this chart much more meaningful than without (first graph below).

Weight graph without picture 12-21-14

Weight Graph Grad School to Present 9-9-14

Obviously, I have been through many weight fluctuations over the years. I first actively tried to lose weight through dieting and exercise in middle school. I recently obtained my medical records from my undergraduate years, and they reflected a similar pattern to other stressful periods in my life. I gained weight (12 pounds in the first semester), and I continued to pack on the pounds as the years went by, gaining a total of 30 pounds during my undergraduate years. It has taken me many subsequent ups and downs through graduate study, parenthood and my early working years to learn that adjusting to life's stress does not have to mean giving up on my weight, health and fitness goals.

In the past, I shelved healthy habits and stopped tracking during holidays or when my time was tight. I then felt like a failure, and this negative thinking led me down a path of declining fitness, increasing weight and rising blood pressure. In reviewing my data in notebooks, I realized that I have always been most successful when I am actively tracking my efforts. Several years back I recommitted to adopting data collection habits permanently, and this has helped me in the process of desensitizing my weight loss and maintenance efforts. Even in such an emotionally charged area, data can just be data, and continuing to collect it motivates me to keep my established habits going even during busy times. I track my food and workouts regardless of schedule, and have learned to both scale down my routine during hectic stretches and enjoy social eating events without stress. My yearly health metrics now reflect my long-term commitment to positive lifestyle habits.

As I mentioned earlier, increases in my weight always lead to my other risk biomarkers heading in the wrong direction. It is hard to statistically separate the effects of diet and exercise on my biomarkers, since I tend to adopt a constellation of healthy lifestyle behaviors when actively working to address my weight. However, I observed that my risk biomarker numbers tend to be at their best levels when I am within my current, healthy maintenance weight range, and my cholesterol composition is not quite as good when I rise above that range.

Lab data

As I showed in my talk, and you can see from the pink shading in the graph above, hearing the bad news about my heart disease risk biomarkers in early 2008 didn't prompt me to take action right away. In fact, my weight rose another 10 pounds over the next 18 months (during which I avoided yearly blood work) before I reached the point where I was ready to make the changes I needed. The important thing is that I did manage to make positive changes -- although my total cholesterol has stayed fairly consistent around 200, you can see from the graph above that the composition of my good and bad cholesterol has changed greatly since 2008. My ratio of total/HDL ("good") cholesterol was 3.8 in 2008, close to a worrisome threshold of 4, but now sits near 2 since raising my HDL 40 points and lowering my LDL 40 points.

In looking at this data, I wish I had access to multiple measurements at the same time to get a sense of my variation. Did having my blood cholesterol test on the Monday after my 2015 holiday break affect my numbers at all, compared to waiting a week post-vacation? Without cholesterol issues, I don't have an easy or inexpensive way to get more frequent cholesterol lab tests. Unfortunately, home blood cholesterol tests are expensive and don't provide the kind of detailed information I get from my yearly blood work. Although I don't have an identical twin to serve as a replicate, my youngest sister is almost exactly my height, and we look and are built very similarly. She shared with me her own cholesterol improvements after her 35-pound weight loss: She increased her HDL 6 points, reduced her LDL 25 points and reduced her triglycerides 33 points between June 2010 and November 2014, while her Hemoglobin A1C levels (an important biomarker for type 1 diabetics like her) dropped significantly.

I helped coach my sister through the changes she has made, and we in turn helped coach my mom and other sister as they have both improved their heart disease risk through lifestyle changes. My mother lost weight, became more active and improved her own cholesterol numbers. Between me, my mom and my two sisters, we have lost a combined total of more than 200 pounds over the past several years using the same strategies I have showed through my previous blog posts: reducing calorie intake and increasing activity to achieve a calorie deficit.

I started this blog post by talking about family, and will end talking about it, too. I think we sometimes forget the positive ripples that achieving health improvements can have on our social networks. I shared how my changes influenced my immediate family, but it goes even further than that. Once my mother achieved her own positive changes, she became a positive influence on a friend's post-stroke weight loss efforts, and her friend has now lost more than 100 pounds and adopted several new exercise activities.

To add pictures to your own data table in JMP 12:

  • Create a new column and change its Data Type to Expression.
  • Drag or copy/paste your pictures into appropriate cells in your Expression column.
  • If desired, change the marker color of the rows that contain pictures so that you can easily identify them in graphs.
  • Hover over points with pictures and pin the hover labels to keep them visible.

Keep in mind that the size of your pictures will affect the final size of your table when open in memory or saved to your file system. If you have the space to save large images in your table, yet want to use smaller sized versions in your hover labels, you can open up the Column Info dialog and select Expression Role near the bottom of the Column Properties list to modify the size of the picture as shown. Also new in JMP 12 is the ability to change the height of data table cells, so you can adjust the size of the thumbnail image as shown in your table.

Column Info Expression Role

I also used images in one of the final graphs I showed in my Discovery Summit 2014 e-poster to track my pregnancy weight changes alongside pictures of my expanding belly. Although I didn't end up including this one on my poster, I created an expanded version of this chart that included summary information about average calories per food item and average number of food items logged per week during my pregnancy. I noted that there was a definite decline in the diversity of my food log during March and April 2011 and a rise in the average calories per item logged.

Weight graph for blog

What was going on here? I suspected I knew the answer, recalling that early in my pregnancy, I struggled with an aversion to coffee, dairy, and many green, fresh vegetables. I reviewed my detailed food logs and summarized them with treemaps using the local Data Filter in JMP to restrict date ranges, as described in this post. My suspicions were confirmed: During the months when my stomach was most unsettled by nausea, I ate more combination foods and starchy carbs than usual, which were more calorie dense than my usual food choices.

Eating patterns early in pregnancy treemap

Eating patterns later in pregnancy treemap

What uses can you think of for pictures in your data tables, either for your own personal data or work-related projects? I think the possibilities are simply endless!

Post a Comment

Discovery Summit Europe live blog: Beau Lotto

In the final keynote of Discovery Summit Europe in Brussels, we hear from Beau Lotto, renowned neuroscientist and Director of the Change Lab at University College London.

View the live blog of this speech.

See photos and tweets from the conference at jmp.com/live.

Post a Comment

Discovery Summit Europe live blog: Bradley Jones and Peter Goos

On the second full day of Discovery Summit Europe,  Bradley Jones, JMP Principal Research Fellow, and Peter Goos, Professor of Technology at the University of Antwerp, deliver a keynote speech on design of experiments.

View the live blog of this speech.

See photos and tweets from the conference at jmp.com/live.

Post a Comment

Discovery Summit Europe live blog: Dick De Veaux

Williams College statistics professor and data mining expert Dick De Veaux gives a keynote speech at Discovery Summit Europe 2015 in Brussels, Belgium.

View the live blog of this speech.

See photos and tweets from the conference at jmp.com/live.

Post a Comment

Discovery Summit Europe live blog: John Sall and Chris Gotwalt

JMP creator John Sall, who is also SAS Co-Founder and EVP,  gives the opening keynote speech of Discovery Summit Europe 2015 in Brussels, Belgium. Sall is joined by Chris Gotwalt, Director of Statistical R&D for JMP, in a speech titled "Addressing the Challenges of Data Variety." The speech marks the official launch of JMP 12. To see a recording of this speech, watch the webcast premiere on Wednesday, March 25.

View the live blog of this speech.

See photos and tweets from the conference at jmp.com/live.

Post a Comment