I’m often asked whether the world is a more hateful place today than it was yesterday. Are current levels of intolerance and xenophobia statistically meaningful – waypoints on the road to some grim Mad Max world of hypertribalism? It’s a difficult question to answer without first understanding how Hatebase collects and categorizes data from public data such as social media.

To perform a comparative analysis, such as between one time period and another, you have to devise strategies for canceling out artifacts in the data. This process of de-artifacting is alternately called “data wrangling” or “data rationalization,” and is a mix of statistical manipulation and creative data modeling.

At first glance, the variety of artifacts present in any large international dataset can appear daunting. Hatebase, for example, contains a massive amount of incident data and there are significant disparities in data collection between economically robust regions and countries with nascent technological infrastructure. Being able to minimize the impact of human, technological, and environmental bias is therefore an inevitable and important part of any data analyst’s skill set, as we’ll see when we look at some of these artifacts in greater detail.

The granularity artifact

Every few minutes, Hatebase ventures out onto the internet to query a public data source with a term from its multilingual lexicon. At present throughput, this means that the system automatically cycles through the entire lexicon of several thousand terms over the span of a few days. Another way of saying this is that the minimum sample granularity for analysis is roughly a week – while you can query a specific day, there’s only a ~20% chance that the term you’re interested in has been checked on that day, and so it’s easy to mistake no results for a null result.

To better understand the importance of granularity, imagine that you want to calculate the number of daylight hours for some point on the Earth. You would need at least 24 hours to make a reasonable estimate, since a single hour’s measurement will tell you only whether it’s night or day during that hour. (And ideally you’d want to sample much longer than 24 hours to get a better understanding of variance due to the tilt of the Earth’s axis.) This is the challenge of granularity, which is more pronounced the more that you narrow your focus.

If instead of querying the usage of a single term, you query the usage of all English language terms, the impact of this artifact is reduced because presumably some English terms will have been checked on any given day.

The volume artifact

Hatebase ingests approximately 10,000 unique datapoints per day, which is the capacity ceiling for the amount of data which HateBrain, our national language processing (NLP) engine, can analyze. Ingesting more data, while tempting, would lead to a situation where the database begins to accrue “data debt” – an ever-increasing backlog of data that can never be analyzed because it’s buried beneath more recent data. The solution to a capacity ceiling is to increase processing power (e.g. through a parallel processing architecture), and this is indeed a “nice to have” on Hatebase’s product roadmap. For now, however, 10,000 datapoints per day is a healthy amount of raw data to work with.

Of course, compared with the amount of data published by public sources like Twitter (which generates approximately 500,000,000 tweets per day), 10,000 datapoints represents just a fraction of all online conversations, which results in a potential volume artifact – whatever percentage of ingested data is assessed by HateBrain to be hate speech, the percentage of non-ingested data may be arbitrarily higher or lower depending on how representative that Hatebase’s data is.

By the same token, imagine that you want to estimate the alphabetical distribution of surnames in the New York City phone book, but to do so you can sample only 1,000 names. The law of averages should give you some comfort that a random sampling of 1,000 names will more or less match the distribution across all of New York – but this will still be a representative measurement rather than an exact distribution.

As with the granularity artifact above, the best strategy for counteracting a volume artifact is to broaden your search parameters and thus increase the sample size, since an analysis of more vocabulary over a longer period of time will generally cause crests and troughs in the data to cancel each other out.

The geolocation artifact

Every piece of incident data in Hatebase (which we call a “sighting”) has two critical attributes: time and place. Ingested data which can’t be geolocated is ignored by Hatebot, our intake engine, because data without the context of date and location isn’t actionable.

Our geolocation engine, HateMap, uses some innovative strategies to successfully geotag 10-15% of ingested data, which is significantly higher than native geolocation from any of the public data sources we use, but this still tosses out 85-90% of ingested data, resulting in a potential geolocation artifact.

As with the aforementioned artifacts, geolocation artifacting is most pronounced in smaller sample sizes, so broadening the scope of your query will have the best luck minimizing this artifact.

The evolutionary artifact

HateBrain is currently in v2.2 at the time this is being written. New versions of HateBrain are released as we iteratively refine our NLP code, but data classified with previous versions of HateBrain will reflect different (and probably less efficient) categorization. Further, there are some gaps in our historical data where previous, buggier versions of HateBrain failed and required months of iteration and improvement to resume ingesting data. This is a problem common to all complicated data collection systems – machines break, or are improved, and this arbitrarily impacts the amount of data collected.

A good analogy would be healthcare analysts trying to assess relative levels of mental illness across historical time periods, which is a challenge because the definitions of various pathologies vary across editions of the DSM (Diagnostic and Statistical Manual of Mental Disorders).

The best strategy to minimize the impact of older classifications is to run multiple parallel queries as a means of identifying gaps in the data. For instance, if you’re querying sightings of religious hate speech in 2016, try also querying ethnicity-related hate speech or gender-based hate speech during the same time period, which will give you a ratio. Then run the same queries for different time periods and see if the ratio is more or less the same. The larger your sample size, the more likely these ratios are to align, since while vocabulary is always being added to Hatebase, the distribution of types of vocabulary more or less remains the same.

If you do encounter an artificially higher or lower result, you’ll need to apply a correction, inflating or deflating your result based on a ratio you have greater confidence in. When this happens, it’s important to disclose this correction in your analysis so that readers are aware of the margin of error.

Technological and cultural artifacts

If you’re aware of the “streetlight effect”, you’re aware that data visibility does not equate to data accuracy. (For those unfamiliar with this observational bias, also known as “the drunkard’s search,” a policeman asks a man who is searching for his keys beneath a streetlight whether he’s sure this is where the keys went missing, and the man replies, no, he lost his keys in the park, but it’s easier to look where the light is brightest.)

The most predominant language in Hatebase’s lexicon is English, and thus the overwhelming majority of sightings identified by HateBrain are also in English. Without additional context, one might therefore conclude that English speakers are more inclined toward hate speech than, say, Icelandic speakers. However, considering that Iceland has a tenth of a percent of the number of Internet users to be found in the United States (which is itself but one of many English-speaking countries), it’s evident that population and technology are skewing the data.

Similarly, many of the public data sources which Hatebase consults tend to be more popular in North America and Western Europe than in, say, Moldova, adding to an artificial skew towards English user content. Gender can also play a role in skewing results geographically, since some countries have a higher percentage of male Internet users than female ones, which can result in a lower percentage of gender-based hate speech.

The best way to correct for these biases is to use compensatory ratios based on known geographical disparities in Internet access, platform adoption, income level, gender, etc. In general, cross-cultural analyses will always have a significantly higher margin of error than time-based analyses within a single geographical area.

Some generally useful strategies for data rationalization

A few strategies are shared by several of the artifacts discussed above:

  • Establish a baseline – When you’re unsure whether no results imply a null result or incomplete data, try to establish a baseline using other data types or date ranges. If a result stands out as unusually high or unusually low, baseline analyses can reveal whether your sample size is truly representative.
  • Maximize your sampling – Where possible, longer date ranges will provide better aggregate results than shorter.
  • Avoid comparing countries and languages – If you do need to analyze cross-linguistic data, try to compensate for known technological, cultural, and socioeconomic disparities.
  • Look for correlations – Hate speech rarely occurs in a vacuum. If you see a spike that resists correction and which you suspect to be an actual increase in hate speech in a specific region during a specific period of time, look for correlating “real world” events which may explain it, such as a contentious election, a racially-charged legal decision or police action, an act of terrorism, a precipitous drop in GNP, a dramatic change in population movement, etc.

A good example of rationalized, statistical data analysis can be found in the paper “A Quantitative Approach to Understanding Online Antisemitism” by Joel Finkelstein (Princeton University), Savvas Zannettou (Cyprus University of Technology), Barry Bradlyn (University of Illinois at Urbana-Champaign) and Jeremy Blackburn (University of Alabama at Birmingham). (In the interest of full disclosure, Hatebase was one of the data providers for this research.)

Still stuck? Let us know what you’re trying to accomplish with our data and we’ll try to provide some high-level guidance.