Introducing Hatebase: the world’s largest online database of hate speech

Predicting genocide is, by definition, an almost impossible task due to the scarcity of early, actionable data. There’s no chi-squared test or Monte Carlo method for reliably distributing societies along a spectrum from homogeneous to homicidal, both because the extermination of entire populations has become a relatively rare occurrence (thanks to the ever-increasing internationalization of human rights, law, media, and trade) and because those societies which do succeed at systematized annihilation are often equally resourceful at hiding evidence of their crimes.

In the information-rich twenty-first century, good data remains the Achilles’ heel of genocide studies.

At the Sentinel Project for Genocide Prevention, we’re tackling this problem on two fronts. First, in order to improve our data intake we’ve begun to engage in direct field work through our situations of concern (SOCs). Earlier this month, staff from the Sentinel Project were in Kenya during the contested presidential elections, monitoring tensions in urban hubs such as Nairobi and Mombasa as well as in known regional conflict zones such as the Tana River District.

Our second strategy has been to improve the tools with which we parse and prioritize data, whether from the field, from mainstream media or from social networks. To this end, the Sentinel Project recently partnered with my own organization, Mobiocracy, on the development of Hatebase, an authoritative, multilingual, usage-based repository of structured hate speech which data-driven NGOs can use to better contextualize conversations from known conflict zones.

Photo: Sheila Steele

Hatebase is available to casual users through a Wikipedia-like web interface, and to developers through an authenticating API. Although the core of Hatebase is its community-edited vocabulary of multilingual hate speech, a critical concept in Hatebase is regionality: users can associate hate speech with geography, thus building a parallel dataset of “sightings” which can be monitored for frequency, localization, migration, and transformation.

For instance, an organization monitoring several simultaneous theaters of operation might integrate location-based Hatebase data into its monitoring software to assign additional real-time “weight” to specific conflict zones, providing guidance on how to best redeploy limited resources. For genocide monitoring organizations in particular, regional hate speech is a widely recognized indicator of elevated risk.

There are some weaknesses implicit in a solely vocabulary-based approach to linguistic analysis. Innocuous language, when localized, can adopt a sinister secondary meaning (e.g. “cockroaches,” meaning Tutsis in Rwanda), and threats can be communicated without the need for easily identified keywords (“their days are numbered”). Despite these limitations, Hatebase can provide a layer of relevance which complements other context-based information sources, not unlike traffic congestion layered onto a city map.

In the months ahead, we’ll be adding additional data attributes, visualizations, and end-user functionality to Hatebase, with a particular focus on strengthening the API in accordance with our commitment to partnership-based innovation. Our hope is that other individuals, groups and organizations will embrace this collaborative model by leveraging Hatebase data in their own applications.

hatebase.org

Introducing Hatebase: the world’s largest online database of hate speech

Recent Posts

Categories