Preserving Privacy While Sharing Data
Differential privacy can safeguard personal information when data is being shared, but it requires a high level of expertise.
Topics
Frontiers
As organizations increasingly seek to exploit data, both for internal use and for sharing with partners in digital ecosystems, they face more laws mandating stronger consumer privacy protections. Unfortunately, traditional approaches to safeguarding confidential information can fail spectacularly, exposing organizations to litigation, regulatory penalties, and reputational risk.
Since the 1920s, statisticians have developed a variety of methods to protect the identities and sensitive details of individuals whose information is collected. But recent experience has shown that even when names, Social Security numbers, and other identifiers are removed, a skilled hacker can take the redacted records, combine them with publicly available information, and reidentify individual records or reveal sensitive information, such as the travel patterns of celebrities or government officials.
Get Updates on Leading With AI and Data
Get monthly insights on how artificial intelligence impacts your organization and what it means for your company and customers.
Please enter a valid email address
Thank you for signing up
The problem, computer scientists have discovered, is that the more information an organization releases, the more likely it is that personally identifiable information can be uncovered, no matter how well those details are protected. It turns out that protecting privacy and publishing accurate and useful data are inherently in opposition.
In an effort to tackle this dilemma, computer scientists have developed a mathematical approach called differential privacy (DP), which works by making that trade-off explicit: To ensure that privacy is protected, some accuracy in the data has to be sacrificed. What’s more, DP gives organizations a way to measure and control the trade-off. Many researchers now regard DP as the gold standard for privacy protection, allowing users to release statistics or create new data sets while controlling the degree to which privacy may be compromised.
How Differential Privacy Works
Invented in 2006, DP works by adding small errors, called statistical noise, to either the underlying data or when computing statistical results. In general, more noise produces more privacy protection — and results that are less accurate. While statistical noise has been used for decades to protect privacy, what makes DP a breakthrough technology is the way it gives a numerical value to the loss of privacy that occurs each time the information is released. Organizations can control how much statistical noise to add to the data and, as a result, how much accuracy they’re willing to trade to ensure greater privacy.1
The U.S. Census Bureau developed the first data product to use DP in 2008. Called OnTheMap, it provides detailed salary and commuting statistics for different geographical areas. It can be used, for instance, to determine how many people living in, say, Montclair, New Jersey, commute to work in lower Manhattan, along with their average age, earnings, race, and the industry in which they work. To prevent the information from being used to identify a single commuter, where they work, and how much they earn, DP adds noise to the original data by changing the number of people who live and work in each census block.
Since DP’s introduction, the Census Bureau has used it for its release of the 2020 census, and the Internal Revenue Service and the U.S. Department of Education now use DP to publish statistics on college-graduate incomes. More than 20 companies have said they have deployed or are considering using DP, including Apple, Google, Meta, Microsoft, and Uber.
A controversy arose last year when the Census Bureau used DP to protect the census data used by states to draw legislative and congressional districts. All the records in the file were synthetic, generated by a statistical model created and protected using DP. Demographers and social scientists objected to the use of DP, warning that so much noise would be added that the results might be useless. Alabama and 16 other states sued in April 2021 to block the move, saying that DP “would make accurate redistricting at the local level impossible.” But in June 2021, a three-judge panel denied the lawsuit’s key requests, and Alabama dropped its lawsuit in September 2021.2
DP’s ability to adjust the level of privacy protection or loss is both its strength and its weakness. For the first time, privacy practitioners have a way to quantify the risk that comes with the disclosure of confidential data. On the other hand, it forces data owners to confront the inconvenient truth that privacy risk can be adjusted but not eliminated.
This truth has often been ignored by lawmakers on both sides of the Atlantic. Privacy regulations generally aim to safeguard information that’s personally identifiable — anything that makes it possible to isolate the details about an individual — and policy makers typically write these rules in black-and-white terms: Either the information is protected or it isn’t. DP demonstrates that data privacy is much more complicated.
Experience has shown that any data about individuals is potentially identifiable if it is combined with enough of the necessary additional information. For example, researchers at the University of Texas identified Netflix subscribers by combining IMDB movie ratings with an “anonymized” list that Netflix released of movies that subscribers watched and rated. The researchers showed that individual records could be reidentified and linked to the subscriber. The company was sued under the Video Privacy Protection Act and settled the class-action lawsuit for $9 million.
DP must be applied to all the information that is associated in any way with an individual, not just that which is personally identifiable. This makes it possible to control how much data is released — and how much privacy is lost — based on an organization’s unique needs and what it considers to be its threshold for privacy.
Three Different Approaches to DP
Privacy researchers have developed three distinct models for using DP.
The trusted curator model. An organization that uses confidential data applies noise to the statistical results it publishes for wider consumption. This is the approach used by the Census Bureau to publish privacy-protected information, such as its OnTheMap product.
The trusted curator model can protect both data that is published and data that is used within an organization. In 2018, Uber created a DP system for internal research that included data about riders and drivers, trip logs, and information the company collects to improve the customer experience. DP enabled Uber’s analysts to evaluate the performance of their systems without seeing details about individual riders and their trips.
DP-protected synthetic microdata. This is an additional approach that organizations that apply the trusted curator model can use. In this case, the organization creates a statistical model of the original data and then applies DP to the model to create a new privacy-protected model. This model is then used to create individual records. These microdata records might contain information about a person’s age, education level, and income that produces similar statistical results when analyzed but doesn’t exactly match those of an actual individual.
The advantage of microdata is that it can be distributed or repeatedly re-analyzed with no additional privacy loss. But it is difficult to create accurate microdata records that have more than a few columns of data, and they can’t be readily linked with other record-level data sets because the protected data lacks identifiers such as names or Social Security numbers.
The local model. Statistical noise is added to each data record as it is collected and before it’s sent to analysts (either internal or external). Google used this method to produce statistics about users of its Chrome web browser — including information about users’ home pages, visited sites, and the various processes their computers were running — as a way to improve its ability to block malware without collecting sensitive information. But Google eventually abandoned the tool because “there’s just too much noise,” a former Google researcher said at the time. Instead, the company moved to a more complicated approach that combined anonymous mixing and the trusted curator model.
Overall, the trusted curator model works best for organizations like the Census Bureau that are working with data they already have. The local model is attractive for organizations that have previously held off on collecting data because of privacy concerns.
Apple, for example, wanted to learn what text people typed when they used emoji — such as whether people entered “heart” or “love” for the heart emoji — and used the local model to protect the privacy of users. With this method, an organization can say that it’s applying privacy-protecting technology to data before it’s collected.
So Is DP Ready for Business?
At this stage, DP is still a young technology and can be used only in limited circumstances, mainly for numerical statistics that rely on confidential data, such as the geographic statistics used in the OnTheMap application. DP doesn’t work well (yet) for protecting text, photos, voice, or video.
Because DP has a steep learning curve, those interested in the technology should start small, with well-defined pilot projects. For instance, a local utility that was asked to share customer delinquency records could provide a DP-protected data set indicating the number of people on each block most likely to be delinquent, without identifying the individual households. An emergency assistance program could then use the data to narrowly target outreach to the blocks with the greatest risk of delinquency instead of blanketing the entire region.
DP can also be used to create privacy-protected microdata, though this approach is limited to data with only a small number of variables. For instance, Google responded to the pandemic by publishing COVID-19 “Community Mobility Reports,” which showed the number of people moving daily between homes, offices, grocery stores, transit stations, and other locations. It converted the microdata — each individual location — in the form of the locations’ latitude and longitude coordinates (that is, records with two columns) to the six general location categories and used DP to obscure the number of people in each category.
Companies considering DP should begin by consulting with or hiring an expert with advanced academic credentials in computer science or a similar field. (LinkedIn has hired doctoral-level privacy experts to develop its audience engagement statistics.) The most reliable information on the technology is found in highly technical academic papers, and some job postings reflect this by requiring applicants to have published technical papers or developed publicly available DP code. Attempting to use DP now without this kind of expertise is likely to lead to mistakes.
With an expert in DP on hand, an organization is in a better position to evaluate currently available DP tools, both commercial and open source, in order to determine which will best meet the needs of the use case in mind. Companies should ask: Is the technology designed to protect data that is already on hand, or information that is newly collected? If it’s existing data, does it need to protect statistical results, or record-level microdata? What training, educational materials, or support does the vendor provide?
In the near term, DP may still be too complex for most organizations. However, they can improve their privacy protections today by adopting some of the principles underlying the technology, such as adding statistical noise to their data products, even if they lack the ability to precisely measure the actual trade-off between privacy and accuracy.
References
1. While we will not explore the mathematics of DP here, readers who wish to know more are directed to C.M. Bowen and S. Garfinkel, “The Philosophy of Differential Privacy,” Notices of the American Mathematical Society 68, no. 10 (November 2021): 1727-1739; and A. Wood, M. Altman, A. Bembenek, et al., “Differential Privacy: A Primer for a Non-Technical Audience,” Vanderbilt Journal of Entertainment and Technology Law 21, no. 1 (fall 2018): 209-276.
2. For a discussion of the controversy involving the deployment of DP and the 2020 U.S. Census, see S. Garfinkel, “Differential Privacy and the 2020 U.S. Census,” MIT Case Studies in Social and Ethical Responsibilities of Computing (winter 2022), mit-serc.pubpub.org.