Thursday, May 12, 2016

Open science gone awry: How 70,000 OKCupid users just had their private data exposed

Earlier today, a pair of individuals allegedly affiliated with Danish universities publicly released a scraped dataset of nearly 70,000 users of the dating website OKCupid (OKC), including their sexual turn-ons, orientation, plain usernames—and called the whole thing research. You can imagine why plenty of academics (and OKC users) are unhappy with the publication of this data, and an open letter is now being prepared so that the parent institutions can adequately deal with this issue.

If you ask me, the very least they could have done is to anonymize the dataset. But I wouldn't be offended if you called this study quite simply an insult to science... Not only did the authors blatantly ignore research ethics, but they actively tried to undermine the peer-review process. Let's have a look at what went wrong.

The ethics of data acquisition

"OkCupid is an attractive site to gather data from," Emil O. W. Kirkegaard, who identifies himself as a masters student from Aarhus University, Denmark, and Julius D. Bjerrek√¶r, who says he is from the University of Aalborg, also in Denmark, note in their paper "The OKCupid dataset: A very large public dataset of dating site users." The data was collected between November 2014 to March 2015 using a scraper—an automated tool that saves certain parts of a webpage—from random profiles that had answered a high number of OKCupid's (OKC's) multiple-choice questions. These questions include things like whether users ever do drugs (and similar criminal activity), whether they'd like to be tied up during sex, or what's their favorite out of a series of romantic situations.

Presumably, this was done without OKC's permission. Kirkegaard and colleagues went on to collect information such as usernames, age, gender, location, religious and astrology opinions, social and political views, their number of photos, and more. They also collected the users' answers to the 2,600 most popular questions on the site. The collected data was published on the website of the OpenAccess journal, without any attempts to make the data anonymous. There is no aggregation, there is no replacement-of-usernames-with-hashes, nothing. This is detailed demographic information in a context that we know can have dramatic repercussions for subjects. According to the paper, the only reason the dataset did not include profile pictures, was that it would take up too much hard-disk space. According to statements by Kirkegaard, usernames were left plain in there, so that it would be easier to scrape and add missing information in the future.

Information posted to OKC is semi-public: you can discover some profiles with a Google search if you type in a person's username, and see some of the information they've provided, but not all of it (kind of like "basic information" on Facebook or Google+). In order to see more, you need to log into the site. Such semi-public information uploaded to sites like OKC and Facebook can still be sensitive when taken out of context—especially if it can be used to identify individuals. But just because the data is semi-public doesn't absolve anyone from an ethical responsibility.

Emily Gorcenski, a software engineer with NIH Certification in Human Subjects research, explains that all human subjects research has to follow the Nuremberg Code, which was established to guarantee ethical treatment of subjects. The first rule of the code states that: "Required is the voluntary, well-informed, understanding of the human subject in a full legal capacity." This was clearly not the case in the study under question.

To be clear, OKC users do not automatically consent to third party psychological research, plain and simple. This study violates the first and most fundamental rule of research ethics (and Danish Law, Section III article 8 of the EU Data Protection Directive 95/46/EC), just sayin'). In the meantime, an OKC spokesperson told Vox: "This is a clear violation of our terms of service—and the [US] Computer Fraud and Abuse Act—and we're exploring legal options."

A poor scientific contribution

Perhaps the authors had a good reason to collect all this data. Perhaps the ends justify the means...?

Often datasets are released as part of a bigger research initiative. However, here we're looking at a self-contained data release, with the accompanying paper simply presenting a few "example analyses", which actually tell us more about the personality of the authors than the personality of the users whose data has been compromised. One of these "research questions" was: Looking at a users' answers in the questionnaire, can you tell how "smart" they are? And does their "cognitive ability" have anything to do with their religious or political preferences? You know, racist classist sexist type of questions.

As Emily Gorcenski points out, human subjects research must meet the guidelines of beneficence and equipoise: the researchers must do no harm; the research must answer a legitimate question; and the research must be of a benefit to society. Do the hypotheses here satisfy these requirements? "It should be obvious they do not", says Gorcenski. "The researchers appear not to be asking a legitimate question; indeed, their language in their conclusions seem to indicate that they already chose an answer. Even still, attempting to link cognitive capacity to religious affiliation is fundamentally an eugenic practice."

Conflict of interest and circumventing the peer-review process

So how on earth could such a study even get published? Turns out Kirkegaard submitted his study to an open-access journal called Open Differential Psychology, of which he also happens to be the sole editor-in-chief. Frighteningly, this is not a new practice for him—in fact, of the last 26 papers that got "published" in this journal, Kirkegaard authored or co-authored 13. As Oliver Keyes, a Human-Computer Interaction researcher and programmer for the Wikimedia Foundation, puts it so adequately: "When 50% of your papers are by the editor, you're not an actual journal, you're a blog."

Image retrieved 12 May 2016:

Even worse, it is possible that Kirkegaard might have abused his powers as editor-in-chief to silence some of the concerns brought up by reviewers. Since the reviewing process is open, too, it is easy to verify that most of the concerns above were in fact brought up by reviewers. However, as one of the reviewers brought up: "Any attempt to retroactively anonymize the dataset, after having publicly released it, is a futile attempt to mitigate irreparable harm."

Click to enlarge. Image retrieved 12 May 2016:

Where to go from here

Open science can be a good thing. The Open Science Framework was created, in part, in response to the traditional scientific gatekeeping of academic publishing. Anyone can publish data to it, with the hope that the freely accessible information will spur innovation and keep scientists accountable for their analyses. And as with YouTube or GitHub, it's up to the users to ensure the integrity of the information, and not the framework.

Brian Nosek, executive director of the Open Science Foundation, says the foundation is currently discussing whether it should intervene in such cases in the future. "This is a tricky question, because we are not the moral truth of what is appropriate to share or not," he says. "That's going to require some follow-up." Even transparent science may need some gatekeeping.

An open letter is now being prepared that asks the Danish parent institutions to take action. If you feel the open letter deserves your support, feel free to sign it.

However, the harm has already been done. Although the authors have since password-protected the dataset, and although the data will be removed should OKC file a legal complaint, dozens of people already downloaded the original version (if only to check if their user name shows up). Undoubtedly, the dataset will eventually show up in its original form on one or several dubious websites. To quote Emily Gorcenski one last time: "This incident does not represent open science; it represents a security breach, enabled by a lack of countermeasures, and should be handled and treated as such."

Update: Aarhus Universitet has now released an official statement regarding the incident:

Update 2: A list of all usernames in the dataset has now emerged.

Sources: Open Differential Psychology, Motherboard/VICE, Vox, Emily Gorcenski's Blog, Oliver Keyes' Blog.

Picture credit: Twitter snapshots taken by Oliver Keyes.