If you were in a public Discord any time over the past decade, you weren’t just chatting with your friends—you were participating in a massive sociological experiment. According to 404 Media, a team of researchers at Federal University of Minas Gerais in Brazil scraped more than 2 billion Discord messages from public servers and published the anonymized data online. So hopefully you were very cordial in your messages, because they’re forever now.
The exact tally of all the messages, which were published as part of the research group’s paper “Discord Unveiled: A Comprehensive Dataset of Public Communication (2015 – 2024),” is 2,052,206,308 collected from 4,735,057 users across 3,167 servers made between the time of Discord’s public launch in 2015 and 2024. In total, the researchers said, that accounts for about 10% of the platform’s open servers.
The reason they gave for publishing the massive dataset of user messages was to give scientists a sizable sample of human activity that could be used for other research. “Our dataset enables researchers to explore the impact of digital platforms on political discourse, the propagation of misinformation, and the development of effective moderation and regulation strategies tailored to such environments,” the paper authors wrote. The paper suggests potential applications of the data like discourse analysis, looking at the relationship between social media and mental health, and training AI chatbots.
There’s almost certainly interesting information in the dataset, as Discord’s lax moderation approach makes it a particularly good place to look for the evolution of the very online. But it’s at least a little uncomfortable to know that this data was just scraped willy-nilly and published without users knowing or consenting to it.
The researchers did anonymize the data, which included replacing usernames with randomly generated pseudonyms, hashing and truncating user and message identifiers, and removing other potentially identifying features. But that process often is not as effective as one might think. Especially when there is the potential to piece together conversations and series of messages, it may be possible for a person to glean details that could identify users.
Also, it’s not entirely clear that this project is kosher with Discord’s own rules. While the researchers argue that the messages are from public groups, 404 Media pointed out that Discord’s Terms of Service explicitly states, “Do not mine or scrape any data, content, or information available on or through Discord services”—a rule that has been in place since at least 2020.
“Scraping our services without our written consent is a violation of our Terms of Service and Community Guidelines. Discord is diligently investigating this activity and will take appropriate enforcement actions,” a Discord spokesperson confirmed.
“This is a serious matter, and we are committed to protecting the privacy and data of our users. Based on our initial investigation, we determined that user accounts accessed Discord servers that were discoverable and widely accessible and scraped data without our permission,” the spokesperson said. “It appears the researchers took steps to protect people’s identities, but this still violates our policies and we are fully investigating.”
If nothing else, the paper is a good reminder to watch what you say. You never know who might be listening (or, in this case, reading it a decade later).