Hi there! I am Murali Sastry, a student pursuing my Master’s degree in Analytics at Capella University (CU). I love data and really enjoy digging into it to find valuable insights. If you’re a student in an analytics program like I am, let me give you a bit of advice. In addition to your coursework, I encourage you to look for opportunities to apply what you learn either in a real-world setting, by pursuing an internship, or using real-world data, by participating in any of a number of analytics competitions. In this blog I wanted to share with you the experience I had in my first data analytics competition. The experience has already helped me in so many ways as I look to continue my study of and career in analytics.
Analytics competition provides real-world opportunity
Fortunately for me, CU staff is well connected to industry events, especially in Twin Cities Metropolitan Area (TCMA). Recently, our department chair sent a flyer to students about an analytics competition opportunity. The competition was jointly sponsored by: Midwest Undergraduate Data Analytics Competition (MUDAC), an institution with a history of conducting analytic competitions in the academic community; Social Data Science (SDS), a non-profit society focused on applying analytic insights to the public sector; and, MinneAnalytics, a seven-thousand-member non-profit organization. The competition was part of these organizations’ commitment to advancing the analytic professional community.
For this event, MinneAnalytics provided all the funding and the judges (eighty of them!). SDS worked with the Twin Cities Metropolitan Council to provide the three datasets (those exceeded six gigabytes), and MUDAC shared its five years of experience in hosting analytics competitions among academic institutions. Per MinneAnalytics, the competition focused on “Providing an experiential learning opportunity during which students grapple with real world data and research issues, and where the outcomes they determine matter.”
The competition was a great way to sink into real-world data and what better opportunity could one have to experience real-world data than in competition, right? MinneAnalytics supported the event by bringing together students, employers, and judges to showcase student presentations and insights (i.e., all analytics professionals - budding or otherwise) in one place. It was fabulous!
I worked as part of a team with fellow student Jared Mehlhaff and our faculty advisor, Prof. Stefanie Reay, who was willing to guide us through this project. The insights and recommendations that came from the competition would be utilized by TCMA authorities to positively impact communities.
Using analytics to solve a local community problem (specifically, lake water quality and its impact on property values) had put things in perspective for us and motivated Jared and me to work together on this project. Funny thing is that our team resides in three different states (CT, MN, and NC) meaning we had to manage communication effectively. In fact, I met my teammate in person for the first time in my life before the day of the event.
Per MinneAnalytics, the primary goal of the water quality project was “to provide a thorough and insightful investigation of the relationship between various property characteristics and the water quality of lakes within the seven county Twin Cities region”.
Based on the problem description, and the follow-up information session by sponsors, our understanding was that it was about understanding and presenting relationships between variables and their impact on each other rather than prediction of a target variable from predictor variables.
Event organizers provided three datasets (MetroGIS-Fourteen years of property data, thirty-four years of lake water quality data, and water proximity data). After reviewing the work ahead for problem definition and realizing that the deadline was fast approaching, we were intent on simplifying these massive datasets to get a clear understanding of data to mine it with analytics tools. We used both SAS Enterprise Guide and R to gain insights. Datasets were 250 to 850 Megabytes(MB) each and contained over seventy variables. In addition, we had to address missing data issues. For example, for water quality dataset, we had missing data ranging anywhere from 15-90%.
Our strategy was to stay on the island of simplicity for ease of understanding and ease of use, and select critical variables that provided the biggest value for analysis. We linked the problem with available data and selected variables that made sense from property dataset (e.g., home price, no. of bedrooms, and whether lake front property) and from lake water quality dataset (e.g., seasonal lake grade, physical condition, recreational suitability, total phosphorus, and Secchi Depth). For a deeper understanding of the lake water quality dataset, we had to find a single variable or entity that defined overall water quality for a lake instead of the five specified variables for lake water quality.
Further literary research per Carlson and communication with Minnesota Pollution Control Agency (MPCA) revealed that there is one metric that defines overall health of a lake called the Trophic State Index (TSI). This discovery was crucial for us as TSI is used as a common denominator for lake water quality and considers Secchi Depth, Chlorophyll-a, and Total Phosphorus which are critical parameters. TSI could be used to compare water quality of lakes and also for comparing water quality of watersheds.
We calculated TSI for sample lakes and ranked lakes from best to worst (best in the sample-moderately clear) and (worst in the sample-very green) identical to the ranking of lakes on MPCA website. If a lake is very green, it indicates high algal bloom, shallower Secchi Depth, and potentially higher Total Phosphorus. TSI is rated from 0 to 100, with lower numbers good and higher numbers bad. We classified lakes with TSI <52 as “Moderately clear”, and those with 53-65 TSI as “Green” and lakes with TSI >65 as “very green” (Figure 1).
Out of the three hundred thirty-two lakes in the TCMA, we took a stratified sample (forty-four lakes data for 2002-15 period to match property dataset period; include all seven counties in TCMA, and include at least five watersheds). Two lakes and two watersheds were considered for presentation that were used for characterizing property datasets as well. TSI is a common denominator for comparing property values year over year and their relationship with water quality trends through the years.
Figure 2 demonstrates the water quality TSI trend for a prominent watershed in TCMA and shows the range of TSI amongst 3 lakes that are in the sample for this watershed.
“So what?”, would be an obvious response, right? Let us see…
Our team used SAS Enterprise Guide and R programming for data manipulation and for data analysis. We tried several statistical techniques including linear regression to investigate whether there were relationships between several variables, and if there was a relationship, how strong or weak it was and verified data for normality. So, normal probability plots were employed.
Lakes with high TSI were classified as “Severe Algal Bloom” (physical condition), “No Aesthetics Possible” (Recreational Suitability) and would receive potentially an F (or a 0) grade for the seasonal lake grade.
On property dataset, median property values were collated per zip code and compared year over year and property delta values were computed for year over year comparison. Jared, my teammate, worked tirelessly on these huge property datasets to clearly understand and depict property data and insights. (I could not have asked for a better teammate and better faculty advisor for this crucial project. I am grateful to both of them.)
Can we show that water quality has an influence on property values?
Does water quality impact properties closer to the lake than those that are farther away? Does anybody want to guess? These were the questions we were trying to answer and solve.
So what exactly does this chart tell us?
Figure 3 shows that, as TSI increases, there is a steeper slope for those properties closer to the lake. This means that those properties closer to the lake are impacted more by the water quality than those that are farther away. This is based on r (Pearson Correlation Coefficient) ~0.4 i.e., the model explains ~40% of the variation in property values from water quality (TSI), and as one can see that this regression need to be used with caution if one is using it for predicting property values as low r or r-squared values impact prediction precision.
Why is it important to participate in such competitions for Analytics students?
From our team’s perspective, this competition gave us an opportunity to:
- Work with real-world data and potentially help the communities around us.
- Showcase our data understanding and insights with several teams of judges and receive their feedback.
- Review various analytics approaches from different teams addressing the same problem, helping us expand our thinking.
- Conduct a live interview through presentation of insights from data with many potential employers in the audience.
- Enjoy the value of teamwork and use our strengths to give back to society and experience analytics learning with real data.
- Network with many great analytics professionals.
The competition provided me with one of the best days of my life considering the effort, results, presentations, and an opportunity to network with hundreds of brilliant analytics minds in the business all in one place.
If you were to ask me, “would you consider going to another analytics competition?” I would certainly say yes.
Competitions like these provide unique experiences to understand the same business problem(s) from other teams’ perspectives and, generate networking opportunities with analytics professionals.
I had fun sharing my experience. I hope it inspires a lot more students to pursue the analytics profession to impact our societies in a positive way.
The world needs more analytics professionals to draw insights from massive datasets all around us to convert “data rich information poor” state to “information and decision bound” strategies from use of data “to showcase a better society than we found for generations to come.” I hope you are, like me, working to join this field and make a difference.
I would like to express my heartfelt gratitude and thanks to CU, MUDAC, SDS, MinneAnalytics organizations, and judges for providing our team an opportunity to participate in the event.