fbpx

At the Sentinel Project, we are big advocates of making the data we are creating openly available for everybody that wants access to it. Making our data available allows people of the public to learn from our data, create data visualization, gain new insights or create mashup with other sets of data openly available.

In this blog post, you will learn how to easily access and manipulate the two main flows of data that we are making available.

The software that makes it very easy to get started is OpenRefine (formerly called Google Refine). This data manipulation tool was previously created by Google and later abandoned and given to the open-source community. Go ahead, install it and run it, it runs on Windows, Linux and Mac.

Before you get to run it, you need to decide on the data you will be using and build the URL you will need to access it. The two main streams of data available at The Sentinel Project are available in JSON format through a URL-based API.

Threatwiki

Threatwiki is our genocide risk tracking and visualization platform to help monitor communities at risk of genocide around the world (more details about the tool on the launch article). The data is a list of events, researched and found by our research analysts, that are chosen because they would indicate a threat to the community and fit as part of our Stages of Genocide Model. We previously used this data set to create a visualization of the persecution against the Baha’i community in Iran.

There are 3 kinds of data

  • Datapoints: this is the main type of data. Those datapoints contain the events themselves, which can be further sorted by description, genocide stage, location, tags, event date, etc.
    • To get all the datapoints of the Iran Situation of Concerns (same API url used to build our visualization. Notice it’s under the format /api/datapoint/soc/Name_of_situation_of_concern)
      http://threatwiki.thesentinelproject.org/api/datapoint/soc/Iran,%20Islamic%20Republic%20of
    • All the datapoints under the Genocide stage Extermination
      http://threatwiki.thesentinelproject.org/api/datapoint/stage/Extermination
  • Situation of Concerns (SOC): the countries or regions that we are currently gathering data on
    • If you want a list of all the situation of concerns:
      http://threatwiki.thesentinelproject.org/api/soc
  • Tags: each datapoint gets tagged in order to simplify filtering among them
    • If you want a list of all the tags that are being used to classify datapoints into the Myanmar situation of concern:
      http://threatwiki.thesentinelproject.org/api/tag/soc/Myanmar

Get a full list of all the URLs possible on the Github project page:
https://github.com/thesentinelproject/threatwiki_node/wiki/API-Usecases

Hatebase

Hatebase is the world’s largest online database of hate speech launched in March. On top of being a catalog of hate speech terms, it also tracks usage of hate speech, either submitted manually by our users or automatically through a bot that scans geo-located tweets that contain hate speech terms. All this data is also available for free.

In order to query the API you first need to

Once this is done, Hatebase has a page with the instructions to query the Hatebase API.

For your convenience, here are few examples with the main two sets of data of Hatebase (keep in mind you are limited to 100 queries a day on Hatebase)

  • Vocabulary, which includes the hate speech words that are contained in the database
    http://api.hatebase.org/v3-0/YOUR_API_KEY_HERE/vocabulary/json/language%3Deng
      • All the terms in French about ethnicity
    http://api.hatebase.org/v3-0/YOUR_API_KEY_HERE/vocabulary/json/language%3Dfra%7Cabout_ethnicity%3D1
    • All the vocabulary (if there is more than a 1000 words, you will need to use the pagination option)
    http://api.hatebase.org/v3-0/YOUR_API_KEY_HERE/vocabulary/json/
  • Sightings, usage of the hate speech terms, either observed by our users or found on Twitter
    • All the sightings between 2013-07-01 and 2013-07-13
    http://api.hatebase.org/v3-0/YOUR_API_KEY_HERE/sightings/json/start_date%3D2013-07-01%7Cend_date%3D2013-07-13
    http://api.hatebase.org/v3-0/YOUR_API_KEY_HERE/sightings/json/country%3DMX
    • All the sightings (each page provides 1000 records, increment page number for the number of sightings you want)
    http://api.hatebase.org/v3-0/YOUR_API_KEY_HERE/sightings/json/page%3D1

    http://api.hatebase.org/v3-0/YOUR_API_KEY_HERE/sightings/json/page%3D2

 

Using OpenRefine

After installing OpenRefine and launching it, it opens a page in your browser.

  1. Click on Create Project -> Web Addresses. That’s where you put the URL link to the data you want to obtain and manipulate, either for Hatebase or Threatwiki.
  2. Choose JSON files parsing
  3. Select in the preview the part of the JSON data that corresponds to a record
  4. Choose a Project Name on the top right and click Create Project
  5. You get your data displayed in a table (excel-stype) type of format

On the button Export at the top right of the page, you can decide to export to other type of file formats (such as Excel) or other formats that would allow you to analyze the data in other software.

You can also use OpenRefine to manipulate the data directly. There are tons of resources out there on how to use OpenRefine. You can filter the data, sort it, change the name of columns, get a list of all the values available in a column, transform the data using a set of scripts, etc.

I’ve made this short video to show you quickly the kind of manipulation you could do with OpenRefine.

I hope this blog post helped you understand how to obtain data through our API. Don’t hesitate to write to us at techteam@thesentinelproject.org and let us know how you use the data!

UPDATE March 12th 2014: Someone created a PHP wrapper for the Hatebase API, it is accessible here: https://github.com/awelters/hatebase