All the sessions from Transform 2021 are available on-demand now. Watch now.
Public datasets like Duke University’s DukeMTMC are often used to train, test, and fine-tune machine learning algorithms that make their way into production, sometimes with controversial results. It’s an open secret that biases in these datasets could negatively impact the predictions made by an algorithm, for example causing a facial recognition system to misidentify a person. But a recent study coauthored by researchers at Princeton reveals that computer vision datasets, particularly those containing images of people, present a range of ethical problems.
Generally speaking, the machine learning community now recognizes mitigating the harms associated with datasets as an important goal. But these efforts could be more effective if they were informed by an understanding of how datasets are used in practice, the coauthors of the report say. Their study analyzed nearly 1,000 research papers that cite three prominent datasets — DukeMTMC, Labeled Faces in the Wild (LFW), and MS-Celeb-1M — and their derivative datasets, as well as models trained on the datasets. The top-level finding is that the creation of derivatives and models and a lack of clarity around licensing introduces major ethical concerns.
DukeMTMC, LFW, and MS-Celeb-1M contain up to millions of images curated to train object- and people-recognizing algorithms. DukeMTMC draws from surveillance footage captured on Duke University’s campus in 2014, while LFW has photos of faces scraped from various Yahoo News articles. MS-Celeb-1M, meanwhile, which was released by Microsoft in 2016, comprises the facial photos of roughly 10,000 different people.
Problematically, two of the datasets — DukeMTMC and MS-Celeb-1M — were used by corporations tied to mass surveillance operations. Worse still, all three contain at least some people who didn’t give their consent to be included, despite Microsoft’s insistence that MS-Celeb-1M featured only “celebrities.”
In response to blowback, the creators of DukeMTMC and MS-Celeb-1M took down their respective datasets, while the University of Massachusetts, Amherst team behind LFW updated its website with a disclaimer prohibiting “commercial applications.” However, according to the Princeton study, these retractions fell short of making the datasets unavailable and actively discouraging their use.
The coauthors found that offshoots of MS-Celeb-1M and DukeMTMC containing the entire original datasets remain publicly accessible. MS-Celeb-1M, while taken down by Microsoft, survives on third-party sites like Academic Torrents. Twenty GitHub repositories host models trained on MS-Celeb-1M. And both MS-Celeb-1M and DukeMTMC have been used in over 120 research papers 18 months after the datasets were retracted.
The point isn’t that ethically problematic datasets shouldn’t be retracted. The critical work that led to retractions is invaluable. The point is that the creators could have handled the retractions better and we need other approaches going forward so retractions aren’t needed.
— Arvind Narayanan (@random_walker) August 9, 2021
The retractions present another challenge, according to the study: a lack of license information. While the DukeMTMC license can be found in GitHub repositories of derivatives, the coauthors were only able to recover the MS-Celeb-1M license — which prohibits the redistribution of the dataset or derivatives — from an archived version of its now-defunct website.
Derivatives and licenses
Creating new datasets from subsets of original datasets can serve a valuable purpose, for example enabling new AI applications. But altering the compositions with annotations and post-processing can lead to unintended consequences, raising responsible use concerns, the Princeton researchers note.
For example, a derivative of DukeMTMC — DukeMTMC-ReID, a “person re-identification benchmark” — has been used in research projects for “ethically dubious” purposes. Multiple derivatives of LFW label the original images with sensitive attributes including race, gender, and attractiveness. SMFRD, a spin-off of LFW, adds face masks to its images — potentially violating the privacy of those who wish to conceal their face. And several derivatives of MS-Celeb-1M align, crop, or “clean” images in a way that might impact certain demographics.
Derivatives, too, expose the limitations of licenses, which are meant to dictate how datasets may be used, derived from, and distributed. MS-Celeb-1M was released under a Microsoft Research license agreement, which specifies that users may “use and modify [the] corpus for the limited purpose of conducting non-commercial research.” However, the legality of using models trained on MS-Celeb-1M data remains unclear. As for DukeMTMC, it was made available under a Creative Commons license, meaning it can be shared and adapted as long as (1) attribution is given, (2) it’s not used for commercial purposes, (3) derivatives are shared under the same license, and (4) no additional restrictions are added to the license. But as the Princeton coauthors note, there’s many possible ambiguities in a “non-commercial” designation for a dataset, like how nonprofits and governments can apply the dataset.
To address these and other ethical issues with AI datasets, the coauthors recommend that dataset creators be precise in license language about how datasets can be used and prohibit potentially questionable uses. They also advocate ensuring licenses remain available even if, like in the case of MS-Celeb-1M, the website hosting the dataset becomes unavailable.
Beyond this, the Princeton researchers say that creators should continuously steward a dataset, actively examine how it may be misused, and make updates to license, documentation, or access restrictions as necessary. They also suggest that dataset creators use “procedural mechanisms” to control derivative creation, for example, by requiring explicit permission to be obtained to create a derivative.
There’s a much-needed and active line of work on mitigating ML dataset harms. The main implication of our findings for this work is the difficulty of anticipating ethical impacts at dataset creation time. We advocate that datasets should be “stewarded” throughout their lifecycle.
— Arvind Narayanan (@random_walker) August 9, 2021
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.
Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
- networking features, and more
Become a member