Monday, August 29, 2016

UW Data Science for Social Good: Training students to use data-driven research for societal benefit

Data Science for Social Good (DSSG), a summer program at UW's eScience Institute, partners up interdisciplinary student teams with professional data scientists and subject-matter experts to tackle social issues that really matter. This year the theme was "Urban Science"—so for ten weeks, four teams with students from around the country combined data, tools, and programming to tackle social issues such as transit system planning, disease control and prevention, tracking of socioeconomic status in developing countries, and sidewalk mapping for accessible route planning. The results are quite impressive.


The 2016 DSSG fellows at eScience (missing Kaicheng Tan)

Modeled after summer programs at the University of Chicago and Georgia Tech, the eScience Institute at the University of Washington brought together select Grad and advanced Undergrad students with academic researchers, data scientists, and public stakeholder groups to work on data-intensive research projects during a 10-week project.

This year's projects focused on Urban Science, aiming to understand and extract valuable, actionable information out of data from urban environments across topic areas including public health, sustainable urban planning, crime prevention, education, transportation, and social justice.

Each student was part of a team working on a research project that has concrete relevance to the local community on the theme of Urban Science. Projects involved analysis and visualization of data on a wide range of topics, including public health, sustainable urban planning, environmental protection, disaster response, crime prevention, education, transportation, governance, commerce, and social justice.

During a demo day held at UW's physics and astronomy department, students presented their work to academic researchers, non-profit organizations, the press, and the general public. Here's what each team had to say.

Mining online data for early identification of unsafe food products

The Centers for Disease Control and Prevention estimates that 48 million people experience foodborne illness, 128,000 are hospitalized and 3,000 die from foodborne illness in the United States each year. The estimated economic cost of foodborne illness is more than $15.5 billion annually. The most frustrating part of this is that a tainted shipment of frozen vegetables or the like might remain on the shelves for months before a company has enough evidence to issue a recall. This is true for online purchases, too. And, with all the people leaving reviews on websites such as Amazon, shouldn't it be possible to catch these things before they spread too far and cause greater harm?

That was the question asked by the Unsafe Foods team. The team looked up recent recalls and scraped several thousand reviews from Amazon's systems, in the hope of finding reliable patterns that would indicate the occurrence of food-borne illnesses. This is not as straightforward as you might think, because the writing style, customer satisfaction level, and writing style differ greatly from one reviewer to the next.

They deployed machine learning algorithms and statistical models, but ultimately found that the data was just too imbalanced (which is a common problem with anomaly detection): There is usually a large number of good reviews, but only a handful of recalls on which the algorithm could be trained on. As a result, their algorithm showed high accuracy but low sensitivity: Their processes successfully identified reviews relating to recalled products, but couldn't predict those recalls when applied to new data.

Another obstacle they encountered right away was that the government records for recalls were incredibly messy. Who would have thought... According to the students it took a long time to simply join the data from the FDA, which is indexed by a universal product code (UPC), with Amazon's data, which is indexed by an Amazon product ID (ASIN). FDA entries were often ambiguous, wrong, or simply missing.

Although the end-product needs some work, the concept and tools they presented seemed very sound and robust. In fact, it is bewildering that Amazon itself has not produced anything of this quality, as they do not only have access to much more data but are also well-know for all kinds of big data analysis.

Use of ORCA data for improved transit system planning and operation

Seattle's booming population has led to serious traffic and transportation issues. With the introduction of ORCA (One Regional Card for All), a common electronic fare payment system, it was the city's hope that the information that is gathered whenever someone is boarding a bus, streetcar, or subway would provide travel behavior information that could be used to improve regional transportation system planning and decision making. However, despite this torrent of data, little of it has been put to good use.

One group thus examined nine weeks' worth of ORCA card data, which consisted of 21 million individual data points that include location, route number, and time of boarding. The group also used data from bus sensors, which measure the number of footsteps at a stop to estimate the number of people boarding a bus. These records were then linked to vehicle location data (AVL) to determine where those boardings took place. In addition, they estimated for about half of those trips where the traveler exited the bus, and if they transferred, how long that transfer took place.

"We’ve created a suite of applications in an integrated dashboard to shed some light on the data," said DSSG fellow Victoria Sass. The applications can visualize different subsets of the data, and allow users to look at patterns in ridership, overcrowding, and even the number of people using ORCA versus other payment methods. Differences and relationships between the numbers provide powerful insight into who’s riding where, when and — potentially — how to prevent problems like overcrowded buses. The city could also tap into this data to find out which companies are meeting certain "commute reduction" goals, like persuading X percent of employees to use transit. "These applications we’ve created offered a lot of insight into what could be done with this data,but there’s a lot more to be done."

Global Open Sidewalks: Creating a shared open data layer and an OpenStreetMap data standard for sidewalks

On-demand directions from Google or Apple are a godsend to those of us lacking basic navigational skills, but a major deficiency is an almost total lack of accommodations for people with disabilities. For instance, one route might be shorter—but take the user along sloped or ill-maintained sidewalks with no curb cuts and no marked crosswalks. That’s a serious obstacle to someone in a walker or with limited sight, and the ability to prefer other routes would be invaluable.

The OpenSidewalks team decided to tackle this problem, but soon found it was even more difficult than they expected. OpenStreetMap allows for annotations such as those they wanted to add, but the standard edit tools are not suited to them. Municipalities must track their own sidewalks for maintenance purposes, and do, but that data (or at least the data the team had access to) was a total mess. The USGS maintains slope data, but it’s not easy to merge with the rest.

Therefore, the team created a custom editing app for OSM and established a set of schema for tagging the features they deemed most important: curb cuts, crossings, sidewalks and associated attributes like width, condition and so on. They presented their work at the State of the Map conference and later ran a “Mapathon” to test the effectiveness of their toolset; in a day, their volunteers annotated much of the University District.

With luck, the editor and project will gain a bit of steam and friendly mappers around the country will start piecing together areas where this kind of effort is most needed.

CrowdSensing Census: A heterogenous-based tool for estimating poverty

Household surveys and censuses, periodically conducted by National Statistical Institutes and the like, collect information describing the social and economic well-being of a nation, as well as the relative prosperity of its different regions. Such data is then used by agencies and governments to identify those areas in most need of intervention, for example, in the form of policies and programs that aim to improve the plight of their citizens. To provide the most value, socio-economic data needs to be up to date and it ought to be possible to disaggregate the data at each of these levels of granularity, and in between. However, due to the high cost associated with the data collection process, many developing countries conduct such surveys very infrequently and include only a rather small sample of the population, thus failing to accurately capture the current socio-economic status of the country’s population.

The CrowdSensing team thus aimed to develop a tool that could give you a general idea of important measures like poverty without going door to door and asking. Such a tool could be deployed by developing countries that can’t afford a manual census. The team gathered data from a whole range of sources: “points of interest” from OpenStreetMaps, things like bike racks, bars, universities, banks, and such; call detail records taken from mobile providers; and an analysis of street layouts to determine their convenience and accessibility to other areas and resources.

However, the project just turned out to be to large in scope to be solved in a mere ten weeks. Many correlations were found between measurements they extracted and socioeconomic status, but ultimately there was just too much to sift through, too many possible variables to explore. Why were more bars indicative of a nicer neighborhood in Milan, but not Mexico City? Should having a radial layout to the city change how accessibility is scored? Should transient cell signals be downplayed if there’s a university nearby? Perhaps a follow-up project can answer these questions.

via via

Edit: An earlier version of this post accidentally mentioned some of last year's projects in the introduction.