Monday, November 2. 2009Can You Dig It?
A few weeks ago, my wife and my youngest son heard what appeared to be a lunatic shrieking in our backyard. They rushed out the door to find me standing in the midst of about a dozen freshly dug holes, waving my arms and yelling at our mud-covered, 10 month-old Golden Retriever.
The dog, who had obviously excavated all of the holes, was simultaneously wagging her tail and sneezing the dirt out of her nose, utterly unimpressed with my tirade. My wife and son managed to get me inside with the promise of a cold drink and a warm compress, all the while calmly listening to me carry on about ruined landscaping and holes all the way to China. Once I had calmed down and my wife had taken the dog off to the tub to wash away the evidence, my son solemnly informed me that it would be impossible for Gracie to dig a hole to China, because there were lots of rocks in the way and, besides, the center of the earth was way too hot. He also reminded me how much fun it can be to dig. He was right. It can be a lot of fun to dig. Now, I’m not talking about the back-breaking, working-on-a-chain-gang type of digging, but rather the treasure-hunting or the let’s-see-what-we-can-build type of digging. Just look at a bunch of kids at the beach or in a sandbox, and you’ll see what I mean. Old blues singers ask “Can you dig, it?” Peter, Paul and Mary tell us that they “dig” rock & roll music. Paul McCartney planned to “dig the weeds” when he turned 64 (somehow, I don’t think that ever happened.). The seven dwarves sang about “dig, dig, digging” in the mines where a million diamonds shine. The pirates of yore and modern-day archaeologists spend their days digging for buried artifacts and other treasures. Digging can be pretty cool! As genomics scientists, we also spend our days digging. Faced with a mountain of data, we dig and we sift and we dig some more, all the while looking for the clues buried in those mountains. Without the right tools, the task can be daunting; in fact, like treasure hunters without a map, we could dig forever without ever finding anything of value. Here at SAS, we make the tools that make the digging easier. For example, new features in JMP Genomics 4.1, which is due out later this year, include interactive, graphical tools that will allow you to visually evaluate and compare complex statistical results for thousands of genes and other markers across an entire genome. Once you’ve identified one or more genes of interest, other new features make it easy for you to annotate those genes and link your analysis to online data bases. These tools won’t help you dig to China, but they will help you sift through your data until you find the treasure. Excuse me, now. I have some digging to do. Thursday, October 29. 2009Just the Facts, Ma'am...
Now that JMP Genomics v4.1 has gone to production, I have found a little time to catch up on my reading. As I was perusing a recent issue of the Proceedings of the National Academy of Sciences (USA), I was struck by the fact that the traditional limit of 5 pages has been scrapped and that articles of 6 or more pages have become common. Papers have grown as investigators try to pack as much information as they can into more and more space. Even authors of Science reports, which have traditionally been limited to 2500 words, get around page limitations by regularly including links to Supporting Online Materials, web sites that contain vast amounts of additional data and descriptions. All of this information tends to overwhelm the reader.
Perhaps I’m just showing my age again, but I miss the days when you were limited to a strictly-enforced set number of words or pages and there were no Supplementary Online Materials. Authors had to ensure that all of the supporting facts and figures needed to tell a good story were included in the article. These requirements forced authors to carefully consider which pieces of data were the most important and to use the most concise language possible. Good papers were razor sharp and told you just what you needed to know. They were a pleasure to read. “Just the facts, ma’am…” On the old Dragnet TV series (I’m most definitely showing my age here!), Joe Friday used these words to cut through all the fluff and get down to the essentials. We follow the same philosophy when we write the documentation for JMP Genomics. We have divided the JMP Genomics User Guide, by theme, into nine different volumes. We describe each process in its own chapter. We tell you what the process does and what you need to run it. We then show an illustrative example and tell you how to interpret the results. Each chapter is structured in the same way; once you learn how it works for one process, you know how it works for every process. A colleague of mine, who documents a different software package, recently told me that the JMP Genomics documentation was rather simple. Rather than being insulted, I took her comment as the highest form of praise. Software documentation should not contain a lot of fluff. Instead, it should just tell you what you need to know in an easily accessible format and then it should get out of your way. The JMP Genomics User Guide is designed to do just that. Friday, August 14. 2009Mouse in the House
I bet you know the words to Three Blind Mice and Hickory Dickory Doc and have seen at least some of Mighty Mouse, Tom & Jerry, Speedy Gonzales, An American Tail, Stuart Little, The Great Mouse Detective, Flushed Away, Ratatouille, and Desperaux. You’ve hugged Mickey and Minnie at Disney World. You’ve played the ultimate Rube Goldberg game Mouse Trap (and maybe even used the very handy 2D graphical JSL function with the same name). You have a fancy electronic one attached to your computer. You’ve read (Cliffs or Spark notes for) John Steinbeck’s classic novel. A neutral observer of Western culture over the last century might easily conclude we love mice.
But hang on. Over the centuries, these furry little friends have shuttled countless fleas bearing Yersinia pestis, the microbe of bubonic plague (the Black Death), killing around one-third of Europe between 1347 to 1353 and causing numerous devastating outbreaks ever since. On a less severe but likely equally numerous scale, they’ve exasperated wives, husbands and children in homes everywhere. A $&*# one even found its way into our pantry last week; that’s not supposed to happen in 2009! Key science question: Why does it nibble daintily in bags of powdered sugar, chocolate candies and caramel but doesn’t touch the Fiber One or wheat bran? Regardless, an old-school trap baited with peanut butter summarily ended its mini-foray into food paradise. What about the mouse’s bigger and tougher cousin, the rat? Who really loves or wants to be a rat? James Cagney, Willard, and Ratigan certainly don’t help. My father-in-law, who spent much of his career in a naval shipyard, tells stories of rats the size of felines and canines, and meaner than both. So do we love or hate mice? Genomics will tip the scales in their favor. The little critters offer great promise to human health as a model organism, as the mouse genome is around 99% similar to ours. Commercial, government and academic institutions worldwide are making great strides in mouse comparative genomics; see the National Human Genome Research Institute Web site for some details. Top-tier researchers like David Threadgill (formerly at UNC, now at NC State) have provided great service to the scientific community with their extensive inbred lines and genetic analyses. The links between mice and men are often difficult and error prone, but steady progress continues (Arriba, Arriba, Andele! Andele!). At SAS, we’re hoping to help with new features in JMP Genomics 4.1 (due out late this year) that provide graphical methods for interactively viewing and comparing advanced statistical results across an entire mouse or rat genome, including ability to drill down and display tracks. Like Reepicheep in The Chronicles of Narnia, a hero of the 21st century could very well be one with an already-cool name: Mus musculus. Wednesday, July 22. 2009Technology Ahoy!
We are privileged to live in an age of advanced science and technology. Software is one of the most amazing parts of it, providing complex programs to control mind-boggingly incredible inventions that help us with such things as unraveling the deep mysteries of genomics. As widesweeping as software has become, it’s important to keep it in perspective with respect to our interpersonal relationships and worldview. A fun trip I had this past weekend helped remind me of this.
The vessel: A 36-foot Catalina Sloop Rig The voyagers: • Pops, wise skipper and veterinarian • Daniel, thoughtful son of Pops, Penn State devotee, and trip organizer • Eric, fluent in Greek, Latin, and Phillies baseball • Ben, DC native with an encyclopedic mind for trivia • Yours truly The voyage: From Bowley’s Quarters, Maryland (20 nautical miles north of Baltimore), to the Inner Harbor and a Saturday evening Orioles game at Camden Yards, fulfilling one of Ben’s lifelong dreams. In critical need of stout nourishment before embarking on the Chesapeake, we frequented a converted double-wide trailer in the sticks a few miles inland. Plush accommodations consisted of lawn chairs and folding tables covered with plastic. In a few minutes, our waitress (who appeared to have been on a few voyages of her own) plopped down husks of fresh sweet corn, a pitcher of light beer and a pot of bright red steamed crabs, claws intertwined. Under expert guidance from Ben, we systematically dismantled the crustaceans and extracted the tasty edibles. With plenty of laughs and wounded fingers smarting with Old Bay seasoning, it was the perfect start. We hoisted our pirate flag and set sail at dawn (Pacific time). Although routine by today’s standards, Pops’ sailboat was impressive and undoubtedly would have made brown sailors of old green with envy. The ropes were all neatly pulleyed; the main sail was raised easily with one of them and the ferruled jib simply unrolled with another. The galley came complete with refrigerator and freezer (ice cream!), stove, oven and sink. Every nook and cranny was purposefully utilized. Our electronic gear instantly informed of us depth of water, boat speed, exact GPS coordinates, and wind speed and direction. We were really roughing it. It was a glorious sunny day in the 80s with a stiff breeze against us. Via the magic of Bernoulli forces on our two sails and Pops’ skillful navigation, we steadily tacked our way southward. To trim the jib on each side, we alternately cranked two nicely engineered double ratcheted winches. The job was easy enough that Ben and I dozed sometimes in between tacks, earning ourselves the nickname “the napping wenches.” My primary duty was to spot crab-pot buoys, bestrung across the bay like large waterbugs. We only ran over two of them, which statistically I thought was excellent work, but the crew threatened me with a keelhauling if we hit any more. This is the practice of tying an unfortunate soul about the waist with ropes and dragging them under and across the barnacle-laden hull of the ship, iterating as necessary. I was also responsible for lunch that day. Being in possession of the adverse allele of the CULN2 gene, I had managed the day before to bribe two of my daughters to do the honors. As Food Channel junkies, they came through in grand style with elegant ham wraps and a gourmet pasta salad. Catastrophe struck as a few of the wraps found their way to ice water in the cooler, but with the aid of one of Pops’ handy fans they were saved -- and shipmates none the wiser. After several hours, we conquered the choppy water and crossed triumphantly under the Francis Scott Key Bridge. As if on cue, a very loud report from one of the cannons at Fort McHenry signaled our arrival and nearly startled Ben overboard. We passed several massive cargo ships, one with the sweet smell of brown sugar being scooped in huge buckets onto the conveyor chute of the Domino’s Sugar factory. We arrived with time to saunter through the chic Baltimore Inner Harbor condos and attractions and to wind our way up to Camden Yards. At the ballpark during batting practice, there is keen competition to grab balls hit over the fences. With strategic anticipation sharpened on the high seas, Eric and Daniel managed to snag three. The game itself was rousing and close, but the Orioles lost their lead near the end. Ben felt a curse and said he hadn’t been present for a win in this stadium for seven years. Then in the bottom of the 12th, Mora cracked a spectacular walk-off homer, and the dream was realized. Eric, watching real-time cell-phone updates, was more pumped that the Phillies came back from a 5-0 deficit to win 8-7 in the ninth. Even my poor Cleveland Indians eked out a victory, and a night of baseball perfection was complete. We celebrated back on the dock, staying up way past bedtime. On the return trip the next day, alas, the winds were sucked to a standstill by Scylla and Kharybdis. We reluctantly overcame them with the iron mainsail (compliments of Rudolf Diesel), and even had time to anchor offshore for awhile and cannonball off of the back of the boat. This trip was a great mix of old and new, of nature and gadgetry. The best part was the camaraderie and conversation. We had some treasurable time for philosophy and theology, peppered with Nittany Lion – Buckeye trash talk. (Those with Y chromosomes usually require such things before they can open up.) Technology rocks, but as sailors of all time will attest, there’s no substitute for seamates. Friday, June 19. 2009Bogeys, Pictures and Numbers
This weekend features one of my all-time favorite sporting events: the golf US Open (plus Father’s Day on Sunday provides a convenient guilt-free excuse to actually watch it). This year, the tournament is held at the Bethpage Black course just outside of New York City. It has a classic sign:
Not sure who “We” represents, but there is an unquestionable tone of authority, with prepositions, definite and indefinite articles all capitalized. And note this course is not just for “Skilled Golfers”, but “Highly Skilled Golfers.” (Hats off to the pros this week who are also battling Mother Nature under very soggy conditions.) Some data sets should come with a similar warning. In genomics, we are now faced with experiments conducted on thousands of individuals with millions of measurements on each across a variety of complex molecular domains: genetic markers, transcript abundance, copy number, microRNA, protein and metabolite intensities, not to mention thousands of standard phenotypes. Analyzing such data sets properly certainly requires skill along with the best possible software. A primary goal of JMP since its inception more than 20 years ago has been to provide a dynamic and optimal combination of both statistics and graphics. Drop one of this pair, and you are going to miss something critical. JMP Genomics, although much younger, is definitely building on the same philosophy. Accomplishing this goal is difficult, but we continue to make progress and relish your feedback on how to do it better. Confession: Every time I pass a mirror and find no one is looking, I practice my golf swing. I’m a sucker for every newfangled idea on how to hit a golf ball better -- trust me, there is an infinite supply of them -- and just have to try it. I’ve been tinkering with my dang swing since graduate school and still don’t have it right. Not seeing him enter, I almost knocked a guy out one time with my mock follow-through in a small men’s room. One technology that’s a godsend is digital video with slow motion. Interactively successive freeze frames taken from good angles show exactly what I’m doing (even though I think I’m doing something else) -- that, of course, and actually striking that dumb little 1.68 inch diameter sphere and adding up my score. Pictures and Numbers. One day they might even get me to “Highly Skilled.”
Posted by Russ Wolfinger
in Genomics, JMP - General, Statistics
at
09:39
| Comments (0)
| Trackbacks (0)
Monday, June 8. 2009Benefitting from the Wisdom of Others
For about a year now, I’ve been having trouble with my faithful Craftsman Eager-1 push mower, which I bought at Sears about 15 years ago. The daggum thing starts up fine but then cuts off after about 2 seconds. This has led me to adopt the following algorithm:
1. Press rubber priming balloon 5 times. 2. Pull rip cord to start engine. 3. If the engine does not start, go to Step 1. 4. If the engine starts but then stalls after a few seconds, utter expletives go to Step 1. 5. If the engine stays on, mow the grass. 6. If the engine cuts off again, utter stronger expletives and go back to Step 1. My problem is that the number of iterations through Steps 1-4 has been steadily increasing by about 2 times per month. Now that I’m north of 20 times for just one mow, our longstanding relationship is really on the rocks. I think it’s time for a new mower, but what am I to do with this one? On my way to the store to check out the latest new models (urggh-urggh), I pass a small house with a hand-written sign out front: Good Used Lawn Mowers 555-1234 “What the heck,” I say to myself as I pop the number into my cell. After a few rings, an elderly voice with a friendly Southern drawl answers: “Hello.” “Hi, I’ve been having trouble with my mower and was wondering if you might be interested in it.” “Bring it by on Saturday at 9 in the mornin’, ‘cause I like to sleep in on weekends. When you come, knock on the back door.” “OK, see you then.” On Saturday, I pull around behind the house to find a double-wide carport with what must be the largest assemblage of used hand mowers in the state of North Carolina. Some are obviously very old but are neatly arranged, and all appear to be ready for action. I knock on the door, and a spry old gentlemen with a flannel shirt and workman pants pulled up over his stomach with suspenders greets me with a firm handshake and beckons me to an overstuffed chair in his small living room. We exchange pleasantries, and I learn he is 91 years old and had spent 45 years as a railroad engineer. He has lived in Cary his entire life, and after his wife passed away, he started tinkering with mowers. We go out back to take a look at the Eager-1. He deftly removes the air filter and squirts a bit of starting fluid into the exposed hole. “Give ’r a pull,” he tells me. I obey, and the Tecumseh engine roars to life but then quickly begins to stall in its usual fashion. But just before it completely dies, he gently places his index finger over the hole and, lo-and-behold, the engine coughs and cycles back to full power! It tries to stall again, but with perfect timing he chokes off the hole just enough to maintain that magic mixture of fuel and air. After a few more taps the engine is running steadily and better than ever. “Give ’r a try,” he says, pointing to a patch of grass. I am so ecstatic that I mow his whole backyard. As software users and developers, we’re often tempted to abandon and bash old technologies and go for the latest, greatest new thing. While I’m certainly in favor of using the best means possible for the task at hand, sometimes those best means are those that have stood the test of time and have benefitted from the wisdom of those who have struggled through and solved myriads of problems using them. I put classic SAS software into this category – it provides a richly deep and powerful foundation for the processes in JMP Genomics. Our team continues to learn about clever new ways to use it to more effectively handle genomics data. By the way, using this new-to-me-but-really-old-school technique, I can now start my Eager-1 with a single pull Tuesday, May 26. 2009Our Evolving Science
Sometimes, I’m totally astounded at how much our science has advanced since my days as a graduate student. Back then, the closest anyone got to “genomic” studies of eukaryotic organisms involved “melting” DNA and watching it come back together using CoT curves. Cloning and sequencing a single cDNA could get you a paper in Science or even (if it was especially important) Nature. As a classically trained molecular biologist, I was used to thinking about single genes. We ran northern blots and probed them with single cDNA probes. On a good day, our sequencing gels could resolve up to 300 nucleotides, maybe a few more if you did multiple loadings. Experiments were labor-intensive and not very quantitative. While we could generate some very pretty pictures, we certainly couldn’t do statistics. Back then, our science was limited by a lack of data.
Since that remote time (half-way back to the Pleistocene, as my kids would say), we have made incredible progress. We have sequenced the genomes of a growing list of diverse organisms. We can quantitatively assess the expression of not just one gene, but of every gene in an organism, all at the same time. We can ask global questions that we could never have asked just a few short years ago, and we can get answers to those questions in a relatively short period of time. In fact, today we have the opposite problem: far too much data! We used to spend months or even years gathering a few crucial data points that could be assessed by a mere glance at an autoradiogram. Today, we can do an experiment in a fraction of the time, but the analysis takes so much longer. Fortunately, our tools and skills are evolving along with our science. The initial release of JMP Genomics in 2006 married the visualization capabilities and ease of use of JMP software with the power of SAS. It offered researchers more than 100 different processes for importing, manipulating and analyzing the vast amounts of data generated by the new technologies. The recent release of JMP Genomics 4.0, builds on an already strong platform of data management and analysis tools. We have added features to and enhanced the power of all of the existing processes. In addition, we have added 16 totally new processes. In fact, this latest release contains almost 200 different processes for importing, assessing, normalizing, annotating, and exploring genetic and microarray data. Every process is fully documented and available for you to use as is or to adapt to your particular needs. You can modify existing processes and workflows or build new ones and add them to your menus. In addition, if there is something special that you need, just let us know, and we’ll work with you to build it. As always, we remain committed to helping you meet your research goals. We have come so far in such a short time. Where will our science go next? You will help decide, and JMP Genomics will help you take us there. We’re already hard at work on our next release. Stay tuned. The best is yet to come. Tuesday, May 19. 20093-D Pie Reply
Thanks for the enlightening comments to my blog post "I Like 3-D Pie Charts" and for the new graphs. While the bar charts from Joe and John are very nice, I prefer vertical bars because of their connection with the gravity orientation of trees, mountains, buildings and of course cell-phone signal strength. It’s interesting that Paige and Lee suggest 2D pie charts. Warning: The graphics gods are watching you. Daniel, I’m glad you did not sleep through Art Appreciation in college.
SAS is well known for its corporate amenities, and its cafés are no exception. I was in line for lunch the other day and came upon the always tempting dessert case. The middle shelf featured over a dozen haphazardly arranged pieces of another one of my all-time favorites: chocolate crème pie. Simply irresistible. Maybe it’s my upbringing, but I had absolutely no problem identifying the largest slice. I don’t think it’s a guy thing either, because in the extra second I took reveling about how good it was going to look on my tray, the woman behind me handily placed it on hers. How dare she! But I was smiling inwardly as I quickly grabbed the next largest one, not only because we were likely identical by state for a latent canine quantitative trait, but because I had some more assurance that at least some humans are actually decent at assessing the size of three-dimensional wedge shapes. The psychological experimental evidence cited to the contrary is largely conducted on college students who are really only good at determining how much liquid remains in bottle-shaped objects. This has me wondering if exploding pieces of the pie chart might actually be helpful in avoiding volumetric distortion. If nothing else, exploding and appropriately labeling one or more pieces seems like a great way to emphasize them, and, conversely, leaving the really thin slices unlabeled is desirable when they are not of interest. Regarding ordering of slices or bars, I forgot that in examples like this one there is an overriding analytical criterion. In statistical modeling in general, one typically puts main effects first, then two-way interactions, then succeeding higher order interactions and finally residual/unexplained variance. One way to accomplish this (or any other ordering) in JMP is with the Value Ordering column property. For our example, if you first run JSL code like the following…
…then the bars appear in the desired order. We use such a column property in JMP Genomics scripts to make sure chromosome plots appear in numerical instead of alphabetical order. If you’re having trouble comparing sizes of slices in the next pie chart you see, just pretend that it’s your favorite dessert and you’re really hungry. Works every time.
Posted by Russ Wolfinger
in Data Visualization, Genomics, JMP 8, JSL, Statistics
at
09:33
| Comments (2)
| Trackbacks (0)
Tuesday, May 12. 2009I Like 3-D Pie Charts
So you know I’m a faithful left-brained statistician who makes every attempt to adhere to the highest professional standards of data visualization and analysis. Graphics luminaries like Edward Tufte and Stephen Few have made very valuable contributions to the field, and I bow to their wisdom.
But I have a secret confession to make: I like three-dimensional pie charts. It’s wrong, and I don’t know why; I just like them. Actually, I may be starting to figure out why. (BTW, new research reveals that a telltale sign of having at least one geek allele is a preponderance to begin sentences with the word “actually”—we’re searching for the corresponding genes with JMP Genomics.) We know the commandments from the graphics gods: Keep it simple. Avoid chart junk. Let the data shine through. Favor linear over spatial comparisons. Eschew volumetric distortion. Wield Ockham’s Razor. Obey these commandments at all times. 3-D pie charts are the worst offenders and have long ago been banished to graph purgatory. Few has explained. Why do I still like them? Let me be your graph optometrist for a sec and ask that annoyingly simple question: “Better A or Better B?” So ... Better A? ![]() Or ... Better B? ![]() While a usual reply to my optometrist is “I can’t freakin' tell because of those stinging drops you just put in my eyes,” in this case, the answer for me is B. Some more background: The purpose here is to quickly and effectively convey the dominant sources of variation in a microarray experiment. Without doubt, the bar chart A has more detail and nicely uses linear instead of spatial comparison. It’s a great graph and in fact is the default one shown for such analyses in JMP Genomics. Why B? It takes advantage of color, aggregation and 3-D aesthetics. The labels enable immediate identification with the data instead of forcing me to eyeball down to an X-axis and tilt the head to read them. In addition, blogosphere exigencies require omission of a critical feature: interactivity. It’s a spinnable graph that comes complete with slider bars that let you adjust degree of explosion and shininess. (Thanks to JMP experts Xan Gregg, who has written about 3-D pie charts in JMP, Craige Hales and David Barbour.) The ability to personally control the graph won me over. Graph B also appears to be better suited for rapid scan viewing as recommended by Bill Cleveland. Heaven forbid: multiple 3-D pie charts! Could it be my artistic right brain has suddenly come to life like a vampire after decades of dormancy and is in need of a consultation with Van Helsing? This all has something to do with philosophical presuppositions. Dutch philosopher Herman Dooyeweerd and colleagues have extensively discussed 14 Aspects of Reality arranged in a specific order: 1. Numerical 2. Spatial 3. Kinematic 4. Physical 5. Biotical 6. Sensitive-psychical 7. Logical 8. Cultural-historical 9. Social 10. Economical 11. Aesthetical 12. Juridicial 13. Ethical 14. Fiducial They make a convincing case that these ordered aspects are irreducible in the sense that you cannot eliminate any of them without getting into irrecoverable binds and self-refuting contradictions. Furthermore, nearly all philosophical conflicts throughout history have arisen from different attempts to make one of these aspects the divine/ultimate one upon which all others depend. (Such reductions have often turned Ockham’s Razor into Sweeney Todd’s.) Although there is a lot more to it, everything in creation possesses each of these aspects in varying degrees. For example, the computer on which you are reading this blog exists in space, has physical properties, has economic value, etc. With reference to the bar and pie charts above, the bar chart relies primarily on the numerical, spatial and logical aspects, whereas the interactive pie chart adds aesthetics and kinematics. These latter two aspects make a big difference and enable the pie chart to connect with the viewer on more levels. We’re naturally drawn to things that are beautiful and exhibit pleasing colors, symmetry,and interactivity. We travel the world to engage with captivating wonders and works of art, both natural and man-made. We reward business professionals and politicians who build their careers not on the substance of their message, but by the elegance and flair with which they convey it. We play Guitar Hero and Rock Band for hours on end. The pie chart also offers a biotic connection to various round delectables. More confessions: My wife makes the world’s best grated-apple pie, and I grew up devouring my mom’s to-die-for strawberry pie. I love pizza and cheesecake and even eat quiche from time to time. So I’m environmentally conditioned to be sorely tempted by the evil 3-D pie chart, and I’ve succumbed. So you still prefer a bar chart? That’s fine; the gods are pleased. For now, I’m cranking up Warrant’s “Cherry Pie” on my iPod and playing with some more interactive 3-D graphics. Thursday, May 7. 2009Drill Down TimeMy youngest daughter was recently watching a rerun of Home Improvement featuring Tim “The Tool Man” Taylor and his mock TV show Tool Time. She said “Dad, you and Tim both seem to get into a lot of trouble when playing with power tools.” My wife added that throwing a statistical degree into the mix appears to have increased the trouble coefficient. My vehement objections were met with a reminder of several recent incidents, including how we spent most of New Year’s Eve this year without water due to a slight mishap with copper joint I had sweated in a downstairs bathroom renovation project. OK, so I’m a DIY junkie and have a garage full of tools, but the operative word in this instance is slight.
Of course no home project would get very far without a power drill, and I have three: an 18V cordless with dual batteries, an old-school 3/8” Craftsman that works great whenever those accursed batteries weaken, and a red 1/2” Milwaukee complete with side handle and heavy duty chuck. The Milwaukee is great for man stuff like drilling holes in railroad ties and mixing big batches of thinset mortar. Tim would give it an “Urghh Urghh” (my feeble attempt to spell his signature simian/caveman/Scooby-Do/Chewbacca grunt) Which brings us to the topic of this blog: Drill Down. Genomics data sets typically require conducting tens of thousands to millions (and in some cases even billions) of statistical tests, and we are immediately faced with how to most effectively present and explore the most significant ones. The idea behind Drill Down is to start with a summary graphic that highlights interesting points and enables you interactively zoom on them and obtain more statistical or graphical detail hierarchically. A prototypical example of such a summary graph is a volcano plot, e.g. as generated by JMP Genomics’ One-Way ANOVA Analytical Process:
Each point represents a gene. The X-axis charts some measure of change (e.g. a difference between least-squares means) and the Y-axis is the corresponding –log10 p-value. The plot has a characteristic V-shape because of the statistical relationship between X and Y. The most significant points are towards the top and resemble lava rocks erupting from a volcano [insert Tim Taylor grunt here] and the least significant ones--a vast majority--are nicely overstruck at the bottom. The plots are dynamically linked to other displays like dendrograms and parallel plots, complete with brushing and zooming, capabilities not available in R. (Historical note: I believe Greg Gibson and I coined the term “volcano plot” in a 2001 Nature Genetics paper.) After selecting desired points in the volcano, go to the window of “Action Buttons”, which offers a variety of Drill Down capabilities:
Drill Down with such Action Buttons is different from zoom a la Google Earth; you can launch sophisticated statistical analyses on the selected points, taking full advantage of either SAS or JMP functionality. For example, “Fit Model to Input Data for Selected Rows and Plot LS Means” launches JMP’s Fit Model platform on the selected genes (using the same model that you originally specified) and produces an informative integrated mix of statistical tables and graphics. Clicking “Plot OneWay Means by Chromosome and Position” fits one-way models in a SAS Data Step and then generates graphs like the following:
In this case the expression profiles of three treatment conditions are plotted along a very short section of the genome. The width of the confidence bands is interactively adjustable and the plot is embellished with gene track annotation that was automatically incorporated from a supplemental file specified in the initial JMPG dialog. You can drill down further and bring up various web pages corresponding to genes of interest. So “Urghh Urghh” for Drill Down! It’s available in a few other places in JMP Genomics (e.g. Multidimensional Scaling and Predictive Modeling) and you can expect more in future releases.
By the way, that bathroom project is now finished and turned out beautifully. All three of the drills came in handy (if you want to borrow the Milwaukee please let me know). It took a mere four months to complete and cost only 3x more than our initial high-end estimate. More importantly, interlocutors are happily silent (at least until the next home improvement project begins) Wednesday, April 29. 2009Mixed Models: Yes and No
When asked if I miss being the lead developer of mixed model software at SAS, I usually reply “yes and no.” The “yes” comes from feeling very fortunate to have been involved with this powerful methodology throughout my nearly 20 years at SAS and a part of the strong legacy that began with the work of Jim Goodnight on PROC GLM and PROC VARCOMP in the early days of the company. A big debt of gratitude goes to Dave Delong, who initially pointed me in this direction, and to John Sall and Randy Tobias (our newest ASA Fellow!), who provided key pieces of code that formed the beginnings of PROC MIXED. Interestingly, John’s code was based on the pioneering work of Jennrich and Schluchter on covariance-structure modeling (the R side of the mixed model), while Randy’s code focused on generalizing REML estimation of variance components (the G side of the mixed model). Putting the two together was somewhat serendipitous and turned out to provide a wonderful technology for such diverse applications as animal breeding, clinical trials, manufacturing quality control, education assessment, space-time modeling and statistical genetics.
The “no” comes from confidence in the very impressive work done by our current mixed model gurus: Oliver Schabenberger (PROC GLIMMIX), Chris Gotwalt (JMP), Paul Wright (EVAAS) and Tianlin Wang (PROC HPMIXED). That, and my now decade-long passion for genomics and the true joy that comes from leading a top-notch team and working with all of the dedicated professionals who have helped JMP Genomics become successful. I’ve been amazed at how mixed model theory is a unifying theme throughout statistics, encompassing such methods as empirical Bayes, ridge regression, time series, kriging, and smoothing splines. Did you know you can fit a support vector machine using the radial basis functions in PROC GLIMMIX? In a bit of a throwback to SAS in the '70s, we’re using variance components on principal components to quantitatively compare sources of variability in microarray data. Most recently, we’ve been exploring how mixed models can effectively adjust for population stratification in genome-wide association studies. So I can’t seem to get away from mixed models — not that I really want to! Let me close with a mixed model curiosity that I was looking at just last week. Consider the following SAS code, which compares a simple linear regression model with the same model fit with random effects: /* simulate data for a simple linear regression and compare fixed and random models */ %let seed = 2817340; data xy; do slope = 1 to 5; do nx = 1 to 10; x1 = rannor(&seed); x2 = rannor(&seed); do rep = 1 to 10; y = x1*slope + x2*slope*2 + rannor(&seed); output; end; end; end; run; ods exclude all; ods noresults; proc mixed data=xy method=ml; by slope; model y = x1 x2 / s; ods output fitstatistics=fs covparms=cp solutionf=sf; run; ods exclude none; ods results; title "Fit Statistics from Fixed Model"; proc sort data=fs; by descr slope; run; proc print data=fs; run; title "Covariance Parameters from Fixed Model"; proc print data=cp; run; title "Beta Hat from Fixed Model"; proc sort data=sf; by Effect Slope; run; proc print data=sf; run; ods exclude all; ods noresults; proc mixed data=xy method=ml; by slope; model y = / s; random x1 x2 / s; ods output fitstatistics=fsr covparms=cpr solutionf=sfr solutionr=srr ; run; ods exclude none; ods results; title "Fit Statistics from Random Model"; proc sort data=fsr; by descr slope; run; proc print data=fsr; run; title "Covariance Parameters from Random Model"; proc sort data=cpr; by CovParm Slope; run; proc print data=cpr; run; title "Beta Hat from Random Model"; proc print data=sfr; run; title "Gamma Hat from Random Model"; proc sort data=srr; by effect slope; run; proc print data=srr; run; The second PROC MIXED call does something that I would have (up until last week!) recommended against — using simple covariates in a RANDOM statement with no SUBJECT= effect. It turns out that the gamma-hat estimates (the empirical BLUPs) are virtually identical with the beta-hat estimates from the first simple regression model. So covariance structure modeling is intimately connected with mean modeling. If you’re really into mixed models (or would like to be), here’s a question for you: Why are the variance component estimates from the second model approximately equal to the square of the slopes? Hint: The answer is not “yes and no.”
Posted by Russ Wolfinger
in Genomics, SAS Integration, Statistics
at
13:28
| Comment (1)
| Trackbacks (0)
Tuesday, December 9. 2008JMP Genomics User Guide Has Been Updated
Our goal at JMP Genomics has always been to make it easier for you, the genomics scientist, to analyze and interpret your data. The JMP Genomics User Guide has been there to help you choose which analytical process(es) best meet your needs by showing you what you need to run each process and, through the use of an illustrative example, how to evaluate and interpret your results. This documentation has grown and evolved as JMP Genomics has grown and evolved. We’re proud to announce and make available to our users the newest edition of the JMP Genomics User Guide.
This update to the JMP Genomics User Guide combines and extends the best features of the original JMP Genomics User Guide and the JMP Genomics User Guide – Supplement that you received with JMP Genomics. Existing chapters have been revised, and new chapters have been added. All chapters have been fully updated with detailed illustrations and explanations for our current version, JMP Genomics 3.2. The updated JMP Genomics User Guide is divided among the nine different volumes listed below:
To access this updated resource, just download the zipped file from the JMP Genomics Web site, unzip it and copy it into your JMP Genomics documentation folder. After you install it, you can access all nine volumes as before: Just select Genomics > Documentation and Help > User Guide to go to the User Guide entry page. From this portal, just click on the appropriate link to go to a specific manual. Each manual is dynamically cross-referenced to each of the other manuals – just follow the links embedded in the chapters to access the additional information you need. Registered users of JMP Genomics should watch their Inboxes for an e-mail containing instructions on how to download and install the latest update to the JMP Genomics User Guide. Wednesday, September 17. 2008Review Praises Marriage of JMP and SAS in JMP Genomics
Scientific Computing has published a review of JMP Genomics 3.2 by statistician John Wass titled "The Magic of Genomics." Wass also evaluated an early version of JMP Genomics in 2006.
JMP Genomics has evolved fast and come a long way since those early days, and Wass’ review recognizes that – along with the fact that genomics software in general is highly complex. Wass notes that JMP Genomics 3.2 “now is no more cumbersome than most, easier to use than others, and offers the statistical power of SAS while proffering the simplicity of JMP, altogether a fine marriage.” Indeed, JMP and SAS make a powerful combination. In my experience, the integration of JMP and SAS appeals to JMP users and non-users alike because users can both analyze and visualize large data sets. Wass’ review also incorporates a brief history of the use of statistics by geneticists and describes a sample workflow in detail. It’s also packed with screenshots, which you can enlarge to get a closer look. He praises the thorough documentation and SAS’ respected reputation in statistics. If you'd like to learn more about JMP Genomics after reading the Scientific Computing review, sign up for one of our free Webcasts, or download a white paper or the PDF of the product brief. Also, our site summarizes the new features in our last few versions. Wednesday, July 23. 2008Support for Larger Genotype Data Sets in JMP Genomics 3.2
The rapid growth of SNP data sets due to the introduction of 1 million SNP chips from Affy and Illumina and NextGen sequencing has led to larger and larger data sets. While JMP Genomics 3.1 supported analysis of SNP data sets as large as 1 million SNPs x 4,000 individuals, JMP Genomics 3.2 includes a number of processes that support data sets as large as 1 million SNPs x 10,000 individuals: Marker Properties, Missing Genotype by Trait Summary, Recode Genotypes, Case-Control Association, SNP-Trait Association and PCA.
Because of numerous code improvements in JMP Genomics genetics processes, performance for large data sets has also improved dramatically. The main limitation for working with these data sets is hard drive space, since these data sets are processed in SAS code, and SAS is file-based rather than RAM-limited. A simulated 1 million SNP x 10,000 sample data set is about 80 GB and requires about twice as much free space on the hard drive to analyze. Users working with such data sets commonly use 1 terabyte or larger hard drives. To deal more efficiently with testing the limits on data set size, all JMP Genomics team members recently upgraded their testing and development Windows XP PC workstations to Dell 755s with large dual hard drives, dual-core processors and 3-4 GB RAM. Even so, delivery of a large genetics data set to be analyzed for MAQC sent us searching for extra storage for one of our main testing servers. Although we’re just getting started analyzing that data set, it’s given us a new appreciation for the challenges of storing and analyzing such large amounts of raw data. We all are excited to dig into a real, large data set. Simulated sets may help us test the limits of our code, but there is nothing like digging into real data. Thursday, July 10. 2008GCRMA Implementation in JMP Genomics 3.2
For JMP Genomics 3.2, we had several requests from users to implement GCRMA background correction in our CEL import process. Similar to the request to implement RMA for JMP Genomics 3.0, this was primarily motivated by customers’ desire to process relatively large data sets without being limited by RAM memory. Several customers (and our own testers) reported being limited to processing about 60-80 HU133 CEL files at a time using the GCRMA implemented in R/Bioconductor on a 32-bit 2GB Windows XP machine. We saw this when using R/Bioconductor as a standalone application or through the Bioconductor Expresso wrapper in JMP Genomics.
The JMP Genomics GCRMA implementation overcomes this memory limitation and allows processing of hundreds or even a thousand CEL files at a time. During the process, developer Tzu-Ming Chu worked through intermediate steps to implement the algorithm in SAS code. Similar to other commercial implementations of GCRMA, our implementation uses a version (2.1) of the algorithm, which depends on mismatch probes to estimate the background correction. JMP Genomics GCRMA gives results that are highly correlated with the R/Bioconductor implementation. Correlations among log2 GCRMA normalized intensities generated by the two implementations for the same arrays ranged from 0.995 to 0.999 for several data sets we tested. Incidentally, Tzu-Ming did find that several steps in the R implementation (e.g., line fits to a subset of data for estimating non-specific binding affinity) were sensitive to the nature of the random subsets chosen by different runs of the algorithm. He also found that the current released version of the gcrma package differs from the beta version currently under development, so users of this package should be aware of potential changes coming in a later version of the algorithm. Though the GCRMA normalization was introduced in Wu et al. (2004), the implementation in Bioconductor varies from version to version and may not be exactly the same as the original paper stated. Tzu-Ming and the other developers always urge me to point out that normalization methods including quantile normalization, such as RMA and GCRMA, are rather severe methods. In a Nature Biotechnology publication from MAQC I, Tong et al. (2006) showed clearly that Affymetrix external controls (probe sets with prefix AFFX) tend to perform inconsistently after GCRMA and RMA normalizations. Though the correlation among arrays does improve after GCRMA and RMA normalizations are applied, users of these methods may risk over-normalizing their data. Our primary motivators for implementing these methods in JMP Genomics were demand from commercial customers and the popularity of these algorithms in the genomics analysis market. As I sometimes remind the development team, we can’t tell the market what it wants. For our young product, this is the best example yet of this phenomenon.
(Page 1 of 3, totaling 34 entries)
» next page
|
ABOUT THIS BLOG
JMP Statistical Discovery Software from SAS
is proud to bring you this blog on all things related to
data visualization, visual Six Sigma, design of experiments
and other statistical topics.
The blog content appearing on this site does not necessarily represent the opinions of SAS. Your use of this blog is governed by the Terms of Use. CategoriesQuicksearchSyndicate This BlogCalendar
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||

