Article

Data Accessibility Update: X Takes a Tumble, TikTok Inches Forward

laptop with 3D filing cabinets popping out

For researchers and civil society organizations who track online antisemitism and other forms of hate, access to data from social media platforms is essential to verify platforms’ claims and expose gaps in policy enforcement. Accurate and comprehensive data from platforms allows groups such as ADL to understand the extent and nature of online hate and extremist activity, the experiences of victims, and steps platforms are or are not taking to address harmful content. Without this kind of data access, the public is left with few options to validate tech companies’ assertions when reporting on hate speech and enforcement metrics. 

As part of ADL’s 2021 Online Antisemitism Report Card, ADL included a scorecard evaluating platforms for their commitment to researcher access to data. 

Since our initial report, however, some platforms have taken steps backwards in terms of data access. In 2023 two major mainstream platforms, Reddit and X, began monetizing their previously free access to platform data. This restricts the ability of interested parties, such as researchers and volunteer community moderators (e.g. Reddit mods,) to monitor and take action against problematic speech. 

User Data & APIs: How They Work

Social media platforms collect untold volumes of user data, which is sold to advertisers and used to maximize engagement on the platforms, for immense profit. This business model, sometimes called “surveillance capitalism,” means that tech companies have an incentive to collect granular personal data by maximizing the amount of time users spend watching, scrolling, liking, and clicking on their sites—even when the content is hateful or incendiary. But tech companies also have a responsibility to make online spaces just and equitable, protecting users from harmful content such as hate and harassment.  

Making platform data transparent and accessible to the public ensures that independent researchers, including from academia and civil society, can evaluate tech companies’ claims about enforcing their rules, such as how much violative content they remove or otherwise sanction. Data access also allows researchers to study online hate and harassment and to better understand the experiences of targets, not to mention extremist activity. Without this degree of transparency, platforms’ claims about moderation and user safety are impossible to verify.  

Community moderators and developers rely on data access tools as well, including third-party content moderation tools for sites like Reddit, Discord, and Twitch. Without access to Application Programming Interfaces (API), moderators are unable to access, report, and remove content violating rules for their particular online communities, such as subreddits or Discord channels. Restricting or monetizing data access hurts researchers, moderators, and ultimately the public that relies on social media for conversation, connection, and community. 

APIs serve as intermediaries between different pieces of software, allowing them to communicate using a sort of common language. In the context of platform research, this typically entails researchers submitting requests for specific public data from the platform and having the requested data returned by the API. For example, to study the rate or nature of antisemitic posts over a given period, ADL might request posts containing the word “Jewish,” posted between October 7 and October 14, that have received at least 1,000 likes.  

Restaurant metaphor for API access
 

A popular way of explaining APIs is to compare them to a server at a restaurant. The technical infrastructure of the platform where data is stored, structured, and made visible for interactions is like the restaurant’s kitchen – the general public is not allowed in. But, if you know how to place an order correctly (by putting in a “call” to the API), the API can go back into the kitchen for you with your request and bring you out what you need. 

Official API access from platforms is critical to research because it is generally considered the most complete and reliable source of data about social media content and user engagement. While there are other means of collecting data, including purchasing data from third parties and “scraping” it from the internet, these methods often require more time and/or effort. They can also be blocked by anti-automation measures such as CAPTCHA, leading to incomplete data or a total lack of access at scale. Although scraping public data is considered lawful, it generally violates platforms’ terms of service. Platforms prohibit scraping for several reasons: they have a financial incentive to retain sole access to datasets of user behavior at scale (to sell to advertisers and improve the platform) and prohibiting scraping also keeps their servers from being overwhelmed by requests and protects user privacy. To extend the earlier metaphor, walking into the kitchen yourself is generally frowned upon by the restaurant; while you might be able to get something to eat before you’re kicked out, it may be more like a smattering of ingredients than a full meal. 

Scoring Platforms on Data Accessibility 

Official API access is not a one-size-fits all solution: the utility of platform APIs is affected by a number of other factors. Our updated Data Accessibility Scorecard evaluates the extent and functionality of each major platform’s data access for researchers, using several criteria:  

  • The primary question is who has access; are APIs open to the general public, to researchers generally or only specific types of researchers, or only to those organizations (frequently corporations) who can afford them?  

  • What data is accessible, and can it be easily queried via search?  

  • Can live data and/or random samples of platform content be acquired?  

  • What usage limits (also known as rate limits or quotas) cap the amount of data that can be requested?  

  • And does the platform publish clear and thorough documentation meant to aid in the proper use of their API? 

The updated Data Accessibility Scorecard addresses these questions, assigning each platform a composite score based on five categories: groups provided API access, extent of API monetization, available data, search capability, and rate limits.1 These scores provide an overview of the state of data accessibility on eight major platforms in 2023, and largely paint a dire picture. 

Reddit 

Reddit’s API provides access to public data, increased rate limits for researchers (although even after several weeks, Reddit had not responded to ADL’s requests for researcher status), search functionality with a number of filters, a rate limit that allows for 100 requests per minute with no further daily quota (for comparison, TikTok limits API usage to 1,000 requests a day), and even access to live data in the form of Live Threads. Despite these helpful features, Reddit charges third parties for higher rate requests (largely in response to concerns that OpenAI and other AI companies are using Reddit content as training data without compensation). Some press speculated that Reddit made this move to boost revenue in advance of an expected upcoming IPO).  

Reddit does not charge non-commercial researchers for API access, nor for moderation bots--though such bots require time and technical knowledge to implement. However, monetization has severely affected popular third-party tools which made moderation at scale more manageable for Reddit’s volunteer moderators without the necessary hours and coding expertise. It is not the mere act of monetizing the Reddit API that proved to be a problem but the pricing, which Apollo CEO Christian Selig estimated at roughly 72x the cost of data from Imgur, a similar platform to Reddit.  

In response to this change, Reddit moderators launched strikes and blackouts of popular subreddits. And while Reddit made some concessions for accessibility-focused apps, the platform otherwise forced through its changes. Third-party apps which once aided Reddit's volunteer moderators have been severely affected by the high cost of API access: Apollo has shuttered entirely, while through a partnership with Reddit, Pushshift has become available to moderators only. It has also been beset with technical hiccups and obstacles such as single-use access tokens that prevent the automation of moderation. While Reddit’s API now provides the most accessible data among mainstream platform APIs; its lower score in 2023 versus 2021 reflects the effect of this monetization scheme on the community moderators Reddit relies upon to identify and take action against violative content. 

YouTube 

YouTube offers a data API to the general public with additional data access for academic researchers. Its strengths include retroactive search functionality, detailed and clear documentation, and access to videos, comments, recommended videos, and metadata for videos and user pages. Its main drawback is its rate limit, especially for civil society researchers (academic researchers are eligible for higher limits as part of its higher education program).  

The public Data API offers 10,000 “units” a day, but search requests are heavily weighted, costing 100 units and returning only 10 videos for each search. If using the API for searches, only 100 searches would be permitted per day which puts YouTube’s rate limits below even TikTok’s 1,000 search requests per day and includes fewer results per search. Unlike TikTok, there is opportunity for these rates to be expanded for certain approved academic researchers. YouTube should also add the ability to track comment threads beyond immediate replies (that is, one level deep). The features available in this API are expansive, but higher rate limits are necessary for non-academic researchers. 

Discord 

Discord, a chat platform for online gamers and other communities, consists of private channels created by users. Accordingly, there is no truly “public” data. Researchers must already be members of a given channel to access data. The tools described in the Discord API documentation are largely geared towards integration with bots and moderation--with options such as setting up auto-moderation rules and pinning messages--more than they are tailored towards research applications. While the API features a high rate limit, it lacks search functionality, hampering its utility for research: channels or particular posts would need to first be located manually by a researcher, rather than discovered through search queries. 

TikTok 

In 2023, TikTok launched its Research API, initially for academic researchers in the United States, and then in Europe. Although the API can return data on videos, comments, and user pages, and there is a search feature in the API with a number of filters, several other features are lacking or missing. For example, only 1,000 comments may be collected per post; there is no way of filtering by Stiches to or Duets with a video (the TikTok version of reply threads); and follower lists are not available despite being public on the app and on the web. TikTok still does not provide access to civil society researchers, including ADL.  Moreover, the requirements for approved researchers are onerous: they must refresh data every fifteen days and share outputs with TikTok seven days in advance of publication (this is a significant improvement over the original requirement of 30 days).  The platform has also still not made available an API tracking content moderation actions, which was promised alongside the data API in July 2022, nor indicated when (or if) it will. But the most serious limitation of the TikTok API is its rate limit of only 1,000 requests for data per day, lagging other similar platform API offerings, with no possibility for expanded access. This low rate limit curtails the scale of research possible on the platform. 

Meta (Facebook and Instagram) 

In 2016, Facebook purchased CrowdTangle, a tool for tracking trends and how posts perform on the platform. From 2017 onwards, especially after the 2018 Cambridge Analytica Scandal, Facebook began to roll back CrowdTangle access. By 2022, Meta had removed CrowdTangle support entirely.

In late 2023 Meta released its Public Content Library and API, available to researchers affiliated with a “qualified academic institution or a qualified research institution.” The documentation for this API and its features is clear, and the “Get Code” option on the web application is especially useful for researchers with less coding experience; it a feature other platforms would do well to adapt. Meta’s rate limit of 60 requests/minute is fairly generous, though the weekly cap of 500,000 posts a week could hinder some research.  

The primary concern about Meta’s new Content Library and API is what is not available. Foremost, comments and stories cannot be collected for either platform, leaving large swaths of data on user activity inaccessible. WhatsApp is absent from the API, likely due to the platform’s more private nature. Finally, all tags of individual users are excluded and anonymized. While this may be a useful step in protecting the privacy of most individual users, it presents a serious obstacle to studying activity surrounding public figures such as politicians and celebrities. 

Twitch 

Like Discord, Twitch operates primarily as a secondary platform for online gamers where creators can livestream gameplay. Also, similarly to Discord, Twitch’s API is better suited to content creators and moderators than researchers. Video recordings of completed streams made public by their creators may be obtained, but no further data is available. Live video streams and chat logs, arguably the two most important aspects of this streaming platform, are inaccessible. This data is available only when streamers grant express permission to researchers to collect this data and approve their software for doing so; otherwise, the Twitch API is limited only to video recordings. The previous version of this scorecard weighted reporting and moderation tools more heavily, whereas in this update we prioritize accessibility of raw data, which accounts for the change in Twitch’s rating.

Once the gold standard of official data APIs, X has gated access to data behind steep subscription fees. In March 2023, X announced the details of its revised API access tiers, officially ending a 17-year era of free API access. Previously, the general public could retrieve up to 2 million tweets per month and access a random 1% of the live “firehose” stream of all tweets for free, and academic researchers could retrieve up to 10 million Tweets per month without charge. Now, the free tier of API access includes just write-only access (e.g. posting tweets), making the API unsuitable for researchers.  

The $100/month “Basic” tier includes only 10,000 post reads per month. The next tier up costs $5,000/month, with access to 1 million posts per month (a tenth of what was previously available to researchers for free). Only at this tier do search and the live stream of tweets become available. Further access is available via the Enterprise tier, with plans from $42,000/month to $210,000+/month. Such sums are beyond the reach of most researchers, rendering access to X cost prohibitive to all but companies and organizations with deep pockets. 

Footnotes

1. Available data and rate limits were scored based on the best free tier of data access for the platforms with paid API access options.