Digital Data Collection through Data Donation

Financed by: HUN-REN

Project start: 01/11/2021

Project duration: 24 months


Research background and objectives

Survey research has dominated quantitative social science for the last 50-70 years. Researchers have always known the weaknesses of this method, but nothing has broken its hegemony as the best technique available. However, recent changes have challenged the leading role of survey research. One part of these changes is the increasing difficulty of fieldwork and declining response rates; another driver of change is the emergence of new digital data sources. Some of this digital data is content users share, such as tweets, posts from places they like, or other reactions and interactions on social media. Digital data also includes the unintentional data that the various devices we use collect about us or our life (e.g., mobile phone location data). Digital data can already replace classic survey data in many areas. However, it is not evident that this should mean a complete paradigm shift in data collection in social science research. The survey method has advantages that cannot be replaced but make it an invaluable data collection method. However, the combination of the two data collection methods may be able to overcome the weaknesses of each method, and the right combination of survey and digital data may even result in new knowledge elements that are not just the sum of the parts. The main objective of our research is to create and test a methodological framework that allows for the effective conduct of mixed surveys in a changing digital data access environment.


The standard way of accessing digital data used to be APIs, but social networking sites such as Facebook and Instagram have disabled these solutions. These changes have implied the development of new digital data access models. One of the most promising new approaches is called data donation. GDPR obligations require large platform providers to offer users access to their data through "data download packages" (DDPs). In the data donation model, researchers invite users to share their digital data stored by the platform. The key benefit of partnering with users rather than companies is that it makes the data collection process more transparent for research participants. As this research approach is based on active collaboration with participants, it is easy to link this data collection with survey research. Combining the two data types is an ideal way to exploit their unique strengths and overcome their limitations.

Our research goal was a multi-platform data collection on a representative sample of Hungarian internet users. We planed to involve 500-800 people in the research. Multi-platform here means combining digital and survey data and collecting digital data from different sources - Facebook, Instagram, TikTok, Twitter, and Google. This data collection design is unique and novel; no international project uses a multi-platform approach to collect social media data in parallel on a representative sample.


Data collection and data processing

In the first step of the research, we conducted a survey experiment in the spring of 2022 on a sample of 1000 people to investigate the factors that influence attitudes towards data sharing.

Building on the results of the preliminary research, we launched the data donation research in February 2023. The research was approved by the Research Ethics Committee of the Social Science Research Institute under Research Ethics No. 1-FOIG/130-37/2022. The data collection work for this research, led by the CSS-Recens group of TK. The fieldwork was carried out by NRC. Research participants were asked to share their Facebook and Google (YouTube, search history, geolocation) data as specified in the research description. In addition, participants could also share their Instagram, TikTok and Twitter data. Participants who successfully shared their data also completed a 40-minute questionnaire at the end of the research. The data collection ended in June 2024. In total, data was collected from 758 participants. Facebook and some Google data was collected from all participants, Instagram and TikTok data had lower realisation. The technical description of the data collection is available in English at /uploads/files/Data_Collection_Process_20231004.pdf


The complex data processing work started after the data collection finished. From the raw json files, we developed a standardised and anonymised SQL database capable of serving a variety of research needs. This dataset is continuously supplemented with additional external data (e.g. assigning metadata to YouTube videos).


The current state of the project

Although the research funding ended in November 2023, the project is not finished. The depth and size of the database created by the research is unparalleled internationally. Dissemination of the research data is ongoing on a variety of topics. The project data is used by, among others, Júlia Koltai's Lendület research group (link) and Zoltán Kmetty's NKFIH-K research on Digital Political Footprints (link).

An important part of the dissemination of the research is the reposting of the anonymised data. The completed databases will be made available in the KDK repository.