Mapping climate policy: navigating context-specific language (Part I)
Finding all climate policies relevant to a particular topic is not a simple task. This is doubly difficult where language and terminology carry political sensitivity. The Climate Policy Radar app already contains these, so selecting text based on keywords is an obvious first step. However, where language is often very context-dependent, we often need to combine multiple keywords in order to gain a representative overview. Even the phrase "most vulnerable" is controversial in and of itself. Some people, including many academics, find it a useful shorthand for a host of different risk factors; others consider it disempowering (even offensive) because of its allusion to "weakness". We need to be very deliberate in choosing our keywords, so we help our users find what they need, without accidentally entrenching historic discrimination or alienating users.
To help us learn more about how climate terminology changes from country to country, we recently developed a mapping tool.
Searchable world map in CPR’s ‘Labs’ environment
Building a mapping tool
The basic idea of this tool is to 1) count how many paragraphs in our dataset mention a given keyword and then 2) plot the results on a world map. Luckily, all of CPR’s data already has country information. Purely technically, that means building maps is a breeze.
If you’re interested in the technical details, the code is all open source. We recently released a public version of this data on Hugging Face, so we start by loading in this data. The database contains the full text of thousands of climate documents, including national policies, international agreements, project reports from climate finance institutions and so on. The whole dataset is split up into millions of paragraphs: searching all the text can take quite a while. To speed this up, we create a DuckDB database: meaning that, with a few lines of code, we can do quick searches using SQL. We then used Geopandas to create country maps from the search results.
Finally, to make all of this interactive, we then create a streamlit app. This again only takes a few lines and it means that those in our team who do not know how to code can play around to their heart’s content.
Before using this tool, it’s worth emphasising that we’re simply counting paragraphs. As a consequence, a country with 20 unambitious laws will likely rank above a country with a single short but ambitious law. In other words, we are not measuring anything like policy quality here. Even for policy attention this is a very rough proxy: we should be very careful not to over-interpret our results.
Mapping how climate terminology differs across the globe
Just to see if it works, let’s start by creating a map for the keyword “hurricanes”:
Number of paragraphs containing ‘hurricanes’ from 110 geographies
Great! This gives us lots of results, especially in coastal countries, which makes sense as these are most likely to experience hurricanes.
The United States is over-represented by a lot, while Asian countries seem to be missing: there are no results for China, and only a handful in the notoriously storm-hit Philippines.
Watch what happens if we search for “typhoons” instead. Clearly this is the favoured term in East Asia, especially Japan and the Philippines:
Number of paragraphs containing ‘typhoons’ from 39 geographies
We face a similar challenge in language when mapping policies about those most affected by climate change. Who is vulnerable and how they are described varies greatly by country. Importantly, while the difference between “hurricane” and “typhoon” is rather uncontroversial — we want to capture both, but people are unlikely to take offence at either — the opposite is true for terminology about groups of people most impacted by climate change.
Tail-end distributions are key
Climate impacts tend to be worse for groups that are already marginalised or disadvantaged. This makes climate vulnerability deeply entwined with some very personal issues, such as health, ethnicity, gender and religion. Let’s focus on indigenous groups in this section. Indigenous groups are often marginalised and directly reliant on nature, which means they are among the most affected by climate change. They are also a particularly difficult test-case. Oral traditions are an important part of how many Indigenous cultures share knowledge, and they don’t show up in our data, which is based on written documents. Plus, while our dataset is extensive, it mostly focuses on national and international organisations where Indigenous Peoples are often underrepresented.
To get us started, let’s again try a simple search first:
Number of paragraphs containing ‘indigenous’ and ‘native’
Just searching for “indigenous” and “native” is already getting us a lot of hits, but there are a few unexpected things going on here.
It’s surprising to me that Peru is such a large outlier. Most of this seems to be caused by a few large projects in the Peruvian Amazon. One project description in particular stands out as it has over 500 mentions of the word "indigenous". The project's budget of over $100 million is primarily intended for preserving and promoting biodiversity, but it seems to be quite wide ranging. Many of the sub-projects involve Indigenous Peoples — a quick glance shows cooks promoting Indigenous culinary practices, efforts to establish better forest monitoring on Indigenous lands and projects generating new income for Indigenous communities from sustainable cocoa plantations. So this project is an outlier, but it does seem highly relevant at least.
More worryingly, many of the countries that have large indigenous populations are showing up in roughly equal amounts as countries where indigenous issues are much less salient. There are three reasons for this.
One large reason is simply a lack of data.
Below is a map of the complete CPR dataset. As you can see, some countries have published much more text than others. Worryingly, many of the countries where we have the least data are also the most affected by climate change. Solving that problem is much larger than this blog post, but we can at least try to account for it a little by calculating the relative importance — that is, dividing the raw count per country by the total number of paragraphs in our dataset from that same country.
A map of the complete CPR dataset
The second reason is that “indigenous” and “native” can also be used to refer to indigenous plant- or animal species. To select for impacted people, we should add terms like “community”, “people” and “group”. Combining such different terms can quickly become quite a chore to write out. Luckily our tool accepts regex inputs so we can construct complex queries without typing out every possible combination and alternative selling:
(native|indigenous)(-|\\s)((communit(y|ies)|people(s)?|group(s)?)|(s)|organi(s|z)ation(s)?)
This should get us a lot closer to what we want, though you'll notice that we're getting only about a third as many results, so it's worth checking if we need to add even more terms.
Number of paragraphs containing ‘native’, ‘indigenous’, ‘communities’, ‘people’, ‘groups’ and ‘organisations’
The most important reason is that indigenous groups are referred to by a whole host of terms. In Canada, for example, First Nations is often the preferred terminology. There is a good reason for this: it emphasises that indigenous populations were there long before European settlers arrived.
However, as you can see below, this term is relatively unique to Canada. Other governments have referred to indigenous groups in other ways, such as Native Peoples (mainly USA) or Aboriginal (mainly Australia and sometimes also Canada).
Number of paragraphs containing ‘first nation(s)’ from 22 geographies
We shouldn’t just pay attention to the words of governments though. CPR’s data also includes policy documents and opinions written by indigenous groups themselves. In many cases, such documents use terms like “ancestral lands”. Although we are interested in impacted people, not lands, including such terms is still crucial if we want to give a complete view. After all:
a) we are talking about communities with distinct cultures and therefore distinct language;
b) these communities have often been marginalised and excluded, meaning some communities are not inclined or less able to use language confirming to the “mainstream”, especially if…
c) this mainstream language comes from governments that have historically been hostile to Indigenous ways of being, knowing and doing.
Number of paragraphs containing ‘ancestral’ from 68 geographies
Finally, there are terms which refer to a specific Indigenous Peoples. There are many, many different Indigenous Peoples (the number depends a bit on your definition, which is contested, but this wikipedia list, for example, has hundreds of examples). Most of these Peoples are never mentioned in our data, but some are. Māori, for example, are mentioned hundreds of times, exclusively in New Zealand; the Nordic countries mention Sámi; Waorani are mentioned by Ecuador and Quechua across South America, etcetera. We will need to do some further unpacking here — our list is not complete. In many cases, the English name for the Indigenous Peoples is also the name for their language, although we might not care about that difference in this context per se.
For now, let’s create a final map of relative frequency, combining general terms for Indigenous Peoples as well as terms specific to different geographies and communities. To make adding lists of keywords a bit easier, we make sure that terms can be entered as a comma separated list. The query then becomes:
(native|indigenous)(-|\\s)((communit(y|ies)|people(s)?|group(s)?|organi(s|z)ation(s)?|m(a|e)n|wom(a|e)n|child(ren)?|elders|leader(s)?)), first(-|\\s)nation(s)?, native(-|\\s)american(s)?, aboriginal(s)?, adivasi, janajati, s(á|aa)mi, inuit, m(a|ā)ori, torres strait islander(s)?, quechua, kichwa, aymara, berber, masaai,twa,batwa,pygm(y|ies), miwaguno, waorani, tribe(s)?, tribal, ancestor(s)?, ancestral
Final map of relative frequency
Many of these individual search terms don’t add a whole lot of results, but together they do make a noticeable difference. Almost a third of the results here are coming from our long list of added terms.
Or, to put it in data science terms: it is tempting to ignore the tail-end of the distribution as this introduces a lot of edge cases and unknowns for relatively few results. However, for issues such as this one, where diversity of opinion and perspectives are crucial, the tail-end is crucial too: that is often where we find the perspectives of marginalised groups themselves. It's also an argument for doing this work non-profit: it lets us focus on what has an impact, rather than simply on what's lucrative.
The second instalment of this blog series will cover how we invite and navigate diverse perspectives.