What can a SAS Partner from Sweden and a space physicist at NASA learn from each other?
Read how Big Data can be a connection point between academia and the business world, where the two help each other learn new and old methods so that both parties can reach insights quicker. (Mr Mohammadi was graduated 2012 from our program of Data- and Business Analytics consultants/Data Scientist in Sweden).
What could a space plasma physicist have in common with a data warehouse consultant? At first glance, not much, but is it really so?
If we try to remove space physics terminology, ignore fluid dynamics and Maxwell's equations, and instead focus on what these types of scientists actually do with their data, it might not be that foreign after all. The fact is that both numerical-physicists and experimentalists are completely dependent on their data for new insights. The data source is usually numeric simulations based on a model, or as in the case of an experimentalist, the instruments with which they measure their experiment. Both cases however can produce really Big Data, huge even.
Next, the physicist would prepare the data for analysis in different ways. If you have a data warehouse background, you might have just thought quietly for yourself “ETL?” (Extract-Transform-Load) and you would be right to think so, in some sense it is an ETL-process. Even though this might be true, data warehousing is used very little in this type of academic research. The question that comes to mind is: Can Big Data be a connection point between academia and the business world, where the two help each other learn new and old methods so that both parties can reach insights quicker? Yes, I actually believe that it could.
With these reoccurring thoughts we reached out to one of my former colleagues from my own space physics days, Lars Daldorff, who nowadays is contracted at NASA and works with numerical plasma simulations. We asked ourselves a simple question: What would happen if we would take simulations Lars had of the Sun and structured them in such a way that they could be uploaded into a Big Data, in-memory environment which used out-of-the-box analytical methods? The challenge that Lars Daldorff faced in his work at NASA was not to produce big data volumes, but to analyze it in an effective way. To illustrate the point, when we as Data warehouse consultants asked “How big is your data?”, the reply we got was “How big do you want it to be?”.
In the academic numeric world, the fast paced technical development and access to large super computer centers has meant that production of data can easily be scaled up. However, much of what is produced is of little or no scientific interest, it is simply already known physics or noise of different sort.
Basically this is a needle in a haystack situation, the phenomena of interest is somewhere in the data, but you usually do not know “where” or even “when” in the data it can be found. At the same time, the visualization and analysis methods usually used are time consuming. A consequence of this is that the researcher in question (in this case, a physicist) needs to slice the data by making qualified guesses “where” and “when” in the data the needle is. A simplified description of the process can be seen in Fig. 1. The problem with this process is that even if you are lucky and just happen to find an interesting phenomenon on your first guess, you can’t be sure that it is the only phenomena of interest in the data.
This problem means that the time between the gathering of data (from numerical simulations of the sun in our case) and insight about your data becomes very long. But what if you didn’t have to visualize your data in slices? What if we could take out the guess work from the process? What if it would be possible to upload all of your data at once into a platform which would instantly tell you where the needle (or needles) is (are) by using standardized methods? What if, after you have found the needle(s), you could simply export the data of interest to do your full analysis on only that which matters?
Why speculate? Let’s do it!
Soon, the collaboration had started and the first analysis started to come out. The phenomena that they wanted to study was simulations of the sun, or more specifically the magnetic arches associated with solar spots, which contribute to a considerable increase the X-ray and ultraviolet radiation from the outer solar atmosphere (and hence into the upper atmosphere of earth) and how these arches arise. The phenomena can be seen in this beautiful YouTube clip that the Heliophysics group at NASA recently released as part of the “SDO” (Solar Dynamics Observatory) project.
There are still many open questions regarding these phenomena today, however its effects are clearly visible as you can see in the clip above. When these powerful arches are created, there are speculations that a phenomena called “magnetic reconnection” occurs. It’s this moment in the data that you need to identify, both spatially and in time, that is, both“where?” and “when?”.
Fig. 2. Simplified description of the new process for “From creating/collecting data to insight” , where we first load the entire data, automatically analyze and visualize all candidates of “Point of Interest” (POI) and then export the data for deeper analysis performed by the subject expert.
We loaded the entire data set into SAS Visual Analytics, in our own setup on Microsoft Azures cloud environment, and helped Lars Daldorff get started with the platform, then we could start looking for the needle(s) in the haystack. The aim was to automatically identify where and when the phenomena occurs, for all possible candidates. We wanted to replace the circular process described in Fig. 1 with the linear process described in Fig. 2. This could simplify and speed up how you actually get results and find insights, in this particular case regarding how the sun works, maybe in your case, how your customers work.
Fig. 3: Shows simulated data for one of the many “arches” that are formed at the surface of the sun and how we used SAS Visual Analytics to identify the crucial moment, the needle in the haystack, with the help of heat-maps and decision trees.
What we see in Fig. 3 is how standardized methods which are widely used in the business world, suddenly find use for a completely different type of data. These tools and methods don’t care what your data is, the methods for identifying Points of Interest, performing analysis, visualizations and creating reports are the same, regardless if it’s used on business data or scientific data.
Something the academic world is generally very good at, is to experiment on their data, dare to play with it, explore it with the mindset “I don’t really know what I will find, but I hope it’s something interesting!”, or to quote a famous physicist
“Experiment is the only means of knowledge at our disposal. Everything else is poetry, imagination.” - Max Planck
This is something we in the business world really could learn from. You don’t always need to know in advance what explicit report or analysis the work should result in, there is great value in having all of your data easily accessible at the tip of your fingers so that you can experiment on it and through your experiments reach new insights about your business.
In conclusion, this is only the beginning. We have already sent these preliminary results of our collaboration to Joint Statistical Meeting in Seattle (http://www.amstat.org/meetings/jsm/2015/) and received approval to present it during the conference in August 2015. Our hope is that these results can help Lars Daldorff in his research at NASA and that this case can help show the use of explorative analysis of data. There are numerous possibilities for this application and could potentially help different types of researchers obtain quicker insights and results. More updates will follow! Keep your eyes open!
This work has been made possible by a very good collaboration between Lars Daldorff (contracted researcher at NASA) and Infotrek (a data warehouse consultancy company in Sweden) with contributions by Saiam Mufti, Lars Tynelius and support from SAS institute Sweden. Also thanks to Laura Fernandes for helping with this text.