[Blog] Using big data for research and policy analysis: how to deal with legal and ethical concerns?

Published 29 January 2020
By Rebecca Mignot-Mahdavi

*Asser researcher Rebecca Mignot-Mahdavi to take part in a webinar and discuss the challenges posed by the emerging use of big data for socio-legal research and policy analysis. ©Shutterstock*

In a free online webinar, co-organised with the Hague Centre for Stategic Studies, Asser researcher Rebecca Mignot-Mahdavi will discuss the challenges posed by the emerging use of big data for socio-legal research and policy analysis: “All data should be treated as potentially ‘personally identifiable information’ and every possible data combination as potentially sensitive”.

By Rebecca Mignot-Mahdavi

“Just as Duchamp repurposed a found object to create art, scientists can now repurpose found data to create research”. Through this analogy, Matthew Salganik, Professor of Sociology at Princeton University and author of the insightful Bit by Bit: Social Research in the Digital Age, describes the process through which social scientists reuse big data created and collected by private companies and governments for research purposes.

Identifying relevant Big Data sources for social research
Big data can be defined as the practice of accumulating extremely large amounts of information from a variety of sources, and processing that information using algorithms and statistical analysis. Big data sources include of course social media posts, but are not limited to these.

Researchers can also be interested in other sources of big data. Think of data collected in the physical world through digital devices, providing indications about users’ preferences, beliefs, ideas, or even behavioural patterns.

Finally, governmental organisations also collect unprecedented amounts of data to create administrative records. Although such records have always attracted social scientists, big data has allowed governments to boost the collection and analysis processes, and made it an even more attractive tool to study social phenomena.

Legal and ethical challenges

Independence: taking distance from given data or collecting your data yourself

Most of the time, big data used in social research is generated by private companies or governmental organs. Maintaining the research project’s independence and trustworthiness thus requires conducting a laborious but essential preliminary inquiry to understand as much as possible about the people who created the data and how they did it. This is the first essential step to “repurpose” the data, as Salganik calls it.

Another possibility is for researchers to create their own data. Instead of analysing existing data collections, social scientists can develop tools, digital devices, and offer it to users who will in return provide the data sought. For this creation process to be successful, the researcher will generally have to create a tool that is also valuable for its users for them to be incentivised to use such tool. Such value can be created, for instance, through the practical usefulness of the device, its educational dimension, promotion of inclusiveness, capacity to raise citizens’ voice, and so on.

Privacy: Strengthened anonymisation process

Disclosure of personal information has always been prohibited. In the digital age, however, the risk of disclosure of personal information is more acute. Indeed, information about our behaviours and ideas that would never have been collected before, can now be the object of data collection and recording mechanisms. Thereby, disclosure can cause not only physical harm but also social harm, economic harm, or psychological harm in more intense ways than before.

One could argue that anonymisation processes have always allowed to respond to the privacy concerns in empirical social research and could still perform this function regardless of the increase of the risks. However, the traditional anonymisation processes prove insufficient to prevent this increased risk of harm. Why is that? The danger with big data is that some informational elements might not in themselves appear as “personally identifiable information”, or PII (i.e. data that could potentially be used to identify a particular person) but might allow re-identification when taken altogether.

Tackling ethical and legal challenges
To tackle this problem, Salganik suggests considering all data as potentially “personally identifiable information” and, I would like to add, every possible data combination as potentially sensitive. Starting from this premise, the researcher in the digital age should contemplate disclosure for each data type, whereas he would traditionally consider as PII only the name, home address, birth date and sex.

Furthermore, creating a data management plan before even collecting data is essential, in order to establish which data set could be disclosed and which, on the contrary, would allow prohibited re-identification. Data management and protection plans also help to organise the safe storage of the data during and after the project and identify how non-PII will be made accessible (and thus beneficial) to the public for re-use. On the one hand, some data will never be disclosed (even when traditionally not considered as PII) and other data will be made open-access to serve future research projects.

Other ethical and legal challenges that have to be tackled when using big data for social research, would be:

dealing with changes in the algorithmic system (voluntary or inherent to machine-learning algorithms) when trying to accurately capture social evolutions through big data;
coping with the fact that behaviours captured through digital devices are not spontaneous but rather shaped by those devices;
managing the existence of “dirty” big data sources (when the interaction with the digital device is used in a way that does “not reflect real actions of interest to researchers”, as Salganik defines it).

Similar concerns emerge for think tanks when using big data to explain complex policy developments and forecast future social, political, and economic trends. Rebecca Mignot-Mahdavi will share her thoughts on these issues at the occasion of an online webinar on Thursday January 30 2020 at 15:30, co-organised with the Hague Centre for Stategic Studies.

Sign up here to join the webinar ‘Understanding the knowledge-policy nexus in a complex and uncertain world: Opportunities and challenges of the data deluge’, co-organised with The Hague Centre for Strategic Studies. The webinar will discuss the benefits of data-driven methods and tools, and the ethical and legal challenges associated with their use. The webinar will also ask “How can young people interested in policy analysis and data analytics prepare themselves for a career in this fast changing field?”.

Rebecca Mignot-Mahdavi is part of the Asser research strand Human dignity and human security in international and European law, which adopts as its normative framework a human rights approach to contemporary global challenges in the fields of counter-terrorism, international and transnational crimes, challenges at sea, new technologies & and artificial intelligence, non-discrimination and historical memory. It examines what it means to safeguard human dignity - also in relation to human security - in these areas.