Friday, October 28, 2016

Highlights from the Moore/Sloan Data Science Environments Summit (MSDSE 2016)

The Moore/Sloan Data Science Environments (MSDSE) program is an ongoing effort to enhance data-driven discovery by supporting cross-disciplinary academic data scientists at research institutions across the nation. Halfway through the program, researchers from the University of Washington (UW), New York University (NYU), and University of California, Berkeley (UCB) came together to present their latest research and discuss the potential future of data science at a three-day summit.


MSDSE Working Groups

In an effort to identify and tackle important issues in data science, the MSDSE program installed a set of working groups at the three universities.

In an effort to educate people to do data science "correctly", the Education Working Group discussed ways to bring data science courses into the major curriculum at universities. Today the working group has already created a variety of opportunities for beginners, intermediate and advanced students to learn about data science in bootcamps, tutorials, workshops, intensive courses, and so-called "hackathons" (although the latter was perceived as too hostile a term for most folks). In the future, the goal is to inject data science into existing courses, and launch cross-university programs.

Whereas the Software Working Group offered hands-on workshops to become experts in platforms such as the Open Science Framework and the Journal of Open Source Software, the Careers Working Group discussed the future of researchers with software-heavy and data-intensive job descriptions. The working group discussed MSDSE's vision of dual mentoring and joint appointments for tenure-track faculty, but also considered alternative non-tenure track career paths such as Professors of Practice and Research Software Engineers, and how they would fit in the current academic ecosystem.

Perhaps the most stimulating discussions were held in the Reproducibility Working Group, which tackled the question of how to make science reproducible (again). You may have heard of the Reproducibility Crisis. In several disciplines such as behavioral and cognitive sciences it is often not possible to completely reproduce an experiment—simply because this would entail going back in time and reproducing the exact experimental conditions. The same applies to all data-intensive sciences, as most projects have become so complex and involve so many researchers that it is often not feasible to take the necessary steps to make their research truly reproducible—at least this is what some researchers claim. Truth is in today's research environment there are little to no incentives to spending more time on documenting code and logging results than absolutely necessary to get a paper out or to get the next grant.

However, all working groups agreed that reproducibility should be more than just hygiene around the data: It should come as part of the "operating system" of science. It should be a given.

Creating such an environment might be difficult to achieve in the short term, because it will require a cultural shift of how we do science. Open Science is perceived as an integral part to this idea, but this will require some of the leading market forces to enforce these best practices. However, several movements are already being pursued: ACM Journals are now awarding badges for papers whose findings are reproducible, and more and more conferences are beginning to hand out prizes for the most reproducible paper. If more journals and conferences would require to always publish data and code, this could go a long way. In addition, there's great technology being developed to aid with these goals. For instance, ImpactStory is an online platform that tracks your impact in the open science community by awarding badges for reproducibility, openness, and social media impact. ReproZip is a framework that lets you easily store all your data, methods, and environment (including software dependencies and whatnot) in a Docker container, so you can easily store, share, and load entire experiments, no matter where you are.

All these tools are built around the one truth of scientific reproducibility. In order for something to be reproducible, you need three ingredients: the same data, the same methods, and the same environment.


Lightning Talks

In order to deal with the large number of attendees, the preferred presentation format was that of "lightning talks", which challenged researchers to present their research in just a few minutes. A selection of these is presented below.

Fernando Perez (UCB) detailed new developments at Project Jupyter, which include the next generation of the "Jupyter notebook", which is being re-branded as "Jupyter Lab". Among others, Matthias Bussonnier (UCB) showed some exciting possibilities to integrate Jupyter notebooks with a whole range of different programming languages, such as C, R, Fortran, and whatever your heart desires—all in the same notebook. You should also give nbdime a shot, which is a tool for git diffing and merging Jupyter notebooks. Binder lets you turn a GitHub repo into a collection if Jupyter notebooks. Julia is a high-level, high-performance dynamic language for technical computing. PySurfer offers various ways to visuale cortical surface representations of neuroimaging data. Whole Tale allows you to examine, transform, and republish research data that was used in a specific article. Cesium is a machine learning framework that specializes in time-series data.

Apart from software, there were a number of exciting scientific contributions. Bob Rehder pointed out that Dynamic Bayesian Networks (DBN), the de facto generative model for studying real-world stuff, actually does not apply to most real-world problems, because there are causal cycles in the real world, which the theory does not allow. His research thus focuses on finding a new normative model that applies. Alexandra Paxton (UCB) talked about how big data can help cognitive scientists explore cognition and behavior, by exploring cognition and behavior in the real world. Ariel Rokem (UW) talked about building systems for analyzing big neuroscience imaging data, one such system being DiPy, a free and open source software project for diffusion magnetic resonance imaging (dMRI) analysis. Todd Gureckis (NYU) contemplated ways to ask good scientific questions, and ultimately suggested that by building machines that are able to ask human-like questions, we might learn something about natural intelligence that current AI cannot capture. Brenden Lake (NYU) pointed out that humans are still much faster at learning representations than artificial learning systems, suggesting that there might be mutual benefits from studying data science in relation to cognitive sciences, and vice versa. Dani Ushizima (BIDS) recapped this year's inaugural ImageXD workshop and planned for the next event. Finally, Micaela Parker and Sarah Stone (UW) reported from this year's very successful iteration of the Data Science for Social Goods (DSSG) program at UW.


Wednesday, October 26 at NYU

On the last day of the summit, the group traveled from Mohonk Mountain Resort to the Kimmel Center at NYU in New York City, where they were joined with data science representatives from academia, industry, and government bodies.

First item on the agenda was a Careers Panel to discuss the future of data science. Among others, Ed Lazowska (UW) explained the idea of pi-shaped faculty to the audience; a term applying to modern faculty who need not only combine a breadth of knowledge in a wide range of scientific domains with deep knowledge in their scientific domain—but also with a second "leg" of deep knowledge in scientific methods. It was the purpose and goal of the MSDSE movement to foster both methodological and domain skills of today's IGERT Grad student and Moore/Sloan PostDocs in order for them to prosper as pi-shaped faculty in the future. Jennifer Chayes (Microsoft Research) applauded the effort, but challenged the concept by jokingly observing that she had grown more than two legs over her many years of research experience in academia and industry, effectively rendering her more of an "octopus-shaped" scientist. Karthik Ram (UCB) lamented the daily challenges of researchers whose job descriptions fall outside the traditional academic positions. Together with Simon Hettrick (SSI), he re-iterated the urge for more job safety for non-tenure-track faculty, which Laura Norén (NYU) backed up with survey data collected as part of her graphic sociology research at NYU. Lauren Ponisio (UCB, UC Riverside) defended the voice of PostDocs, pointing out the difficulties of realizing the proposed concepts in an environment that has PostDocs continue previously established research in a short time frame rather than conducting more independent research (which leads to people being trapped in to the so-called state of "PostDoc purgatory").

Several lightning talks then touched on various science domains (such as astronomy, nematology, and sociology) and introduced open-source software tools that aid in their research. Mario Juric (UW) talked about the data analysis challenges posed by the Large Synoptic Survey Telescope, the largest optical survey of the sky ever to be built, which is estimated to produce ~1TB of data per day. Holly Bik (UC Riverside/Sloan) presented Phinch, an interactive exploratory data visualization framework for biological data such as genes, proteins, and microbial species. Nick Adams (UC Berkeley) pointed out that AI requires social scientists and crowds. Alyssa Goodman (Harvard) talked about glue, a Python library to explore relationships within and among related datasets, where different data from different datasets can be overlayed in an intuitive way, and selections in any graph propagate to all others. Jake VanderPlas (CalPoly/Sloan/Moore) presented Altair, a statistical visualization tool that lets you describe what you want to be plotted (declarative), rather than how to plot something (imperative), and lets the software figure out the rest. Slides are here.

The first keynote was held by Tracy Teal, co-founder and Executive Director of Data Carpentry. Modeled after Software Carpentry, her organization is offering workshops to teach data skills to students and researchers across the nation. In the last year alone, Data Carpentry has held close to 100 workshops, and currently counts over 700 volunteers worldwide who develop and teach lessons. Attendee satisfaction and workshop quality is usually measured with the help of surveys, which are handed out right at the end of a workshop. Judging from these surveys, Data Carpentry seems to be doing extremely well. However, the steering committee is also thinking about assessing the effectiveness of their workshops more long-term.

The second and last keynote of the day was given by Chris Ré, Assistant Professor in Computer Science at Stanford Uniformity, who presented computational tools to handle so-called "dark data". Like dark matter, dark data is the great mass of data buried in text, tables, figures, and images, which lacks structure and so is essentially unprocessable by existing software. His software package DeepDive then helps bring dark data to light by creating structured data (SQL tables) from unstructured information (text documents) and integrating such data with an existing structured database. The follow-up project to Deep Dive is termed Snorkel, and intended to be a more lightweight framework with continuous developer support.

All that was left to do was to celebrate a successful #2016DSSummit on the rooftop of a NYC skyscraper:

Picture courtesy of Prof. Juliana Freire at NYU.