Analyzing raw OONI data, a case study

The goal of this notebook is to explain some of the common workflows that can be adopted when performing analysis of OONI data. This will be done within the context of a specific case study and will focus on the analysis of Web Connectivity data.

We will be focusing on answering the following 2 research questions:

It can be useful, before you dive into more extensive analysis, to get a sense for what you are likely to find in the data by using the Measurement Aggregation Toolkit. For example you can pick a certain country and plot the anomalies with a per-domain breakdown (it's often helpful to limit the domains to categories that are most relevant, so as to focus on interesting insight).

In doing so, you will understand if there is something interesting to investigate in the country in question at all and will also help in identifying some examples of interesting sites that you might want to further investigate.

It's also posisble to use the same API the MAT relies on, for downloading the anomaly,confirmed,failure,ok breakdowns to be used in your own analysis or plotting tooling. Depending on the type of analysis you need to do, this might be sufficient, however keep in mind that the anomaly flag is suscpetible to false positives.

It's also useful, while you are performing the analysis, to refer to OONI Explorer to inspect the measurements that present anomalies, so as to be able to identify patterns that you can use to further improve your detection heuristics.

At a high level the workflow we are going to look at is the following:

High level overview

Downloading the data

Once you have gotten a feel for the data, it's time to download the raw dataset.

We offer a tool called oonidata (that's currently in BETA and be sure you have at least v0.2.3), which can be installed by running:

pip install oonidata

To download all OONI data for this example notebook, run the following command (you should have at least 38GB on disk):

oonidata sync --start-day 2022-02-23 --end-day 2022-03-17 --probe-cc RU --test-name web_connectivity --output-dir ooni-russia-data

OONI Explorer utility functions

Below are a couple of useful utility functions when dealing with measurements. They take a dataframe row and return (or print) the OONI Explorer URL. This is useful to get a link to OONI explorer to more easily inspect the raw measurement to better understand what is going on.

Extracting metadata from raw measurements

The OONI raw data is very rich, but for most analysis use-cases you just need a subset of the fields or some value that is derived from them.

Below are functions that will extract all the metadata we care about from the web_connectivity test.

Parsing raw files on disk, filtering and transforming them

Below are functions that will list the files on disk, given a search query, and return an iterator of the raw measurement dict.

These functions are then called by either msmt_to_csv or get_msmt_df, which write the processed data to a CSV file or load it in memory as a pandas DataFrame respectively.

It's generally recommended, when you are dealing with very large datasets, to write the minimised form of the data to a file on disk so that you don't have to re-parse everything if your notebook crashes.

In this example the compressed raw dataset of ~38GB is minimised to a 7.7GB CSV file. The minimisation process took 1.5h on a pretty fast machine.

Once the data is minimised, loading it back in memory from the 7.7GB file is fast.

Here we do the actual conversion to CSV.

We then load the CSV file in memory as a pandas dataframe for more analysis

When dealing with websites, we generally care to look at data from a domain centric perspective. This allows us to group together URLs that are of the same domain, but that have different paths.

Since the raw dataset doesn't include the domain we add this column here.

Hunting for blocking fingerprints

We can have a very high confidence that the blocking is intentional (and not caused by transient network failures), when it fits in the following classes:

The first two classes, though, are susceptive to false positives, because sometimes the IP returned in a DNS query can differ based on the geographical location (think CDNs) and sometimes the content of a webpage can also vary from request to request (think the homepage of a news site).

On the other hand, once we find a blocking fingerprint, we can with great confidence claim that access to that particular site is being restricted. For example we might notice that when a site is blocked on a particular network, the DNS query always returns a given IP address or we might know that the HTTP title for a blockpage is always "Access to this website is denied".

Our goal now to come up with some heuristics that will allow us to, in a way, hunt for these blockpage fingerprints in the big dataset that we have available.

Same title, but different page

One heuristic which we can apply to spotting blockpages, is that we can say that a web page that looks exactly the same for many different sites. Based on this fairly simple intuition, we can look for blockpage fingerprints by just counting for the number of domains that share the same HTTP title tag.

As we can see in the breakdown below, all these blockpage fingerprints look fairly suspicious and are quite likely to be an indication of blocking. Some of them, however, might be signs of server-side blocking (ex. Geoblocking or DDOS prevention). This is why it's best, to obtain a high degree of accuracy, to investigate these manually and add them to a fingerprint database.

This is a shared effort amonst censorship research projects, for example you can find a repo of known blocking fingerprints maintained by the CitizenLab here: https://github.com/citizenlab/filtering-annotations

Once we have confirmed that a fingerprint is known to implement blocking, we can use it to which domains are being restricted.

DNS level interference

We can use a similar heuristics for DNS level interference. The assumption is the same, when we see one IP being mapped to multiple hostnames, it's an indication of it potentially being an IP used to implement blocking.

In this case, we need to be careful of false positives that might be caused by the use of CDNs, as these will be hosting multiple sites. In the sections below we can see what techniques we can adopt to reduce these false positives further.

We are going to make use of a IP to ASN database for some of our heuristics. In particular we are going to download the one from db-ip, which has a fairly permissive license and is compatible with the maxmind database format.

DNS inconsistency false positive removal

To understand if what we are looking at is a real blocking IP or not, we can use the following heuristics:

  1. Does the IP in question have a PTR record pointing to something that looks like a blockpage (ex. a hostname that is related to the ISP)
  2. What information can we get about the IP by doing a whois lookup
  3. Is the ASN of the IP the same as the network where the measurement was collected
  4. Do we get a valid TLS certificate for one of the domains in question when doing a TLS handshake and specifying the SNI

Using these 4 conditions, we are generally able to understand if it's in fact a blocking IP or not

True positive example

In the following example we can see that the IP 188.186.157.49:

  1. Has a PTR record pointing to k8s-lb-onlyhttp-cluster-ingress.static.cc.ertelecom.ru
  2. The whois record shows it's owned by the ISP
  3. The AS network name is the same as the measured network
  4. We get a certificate with a common name "*.dom.ru" (i.e. it's not valid for sci-hub.se)

This gives is a strong indication that it is in fact a blockpage IP

False positive example

In the following example we can see that the IP 188.114.97.7:

  1. Doesn't have a PTR record
  2. The whois record shows it's owned by the Cloudflare
  3. The ASN is not the same as the measured network
  4. We get a valid certificate for mastodon.cloud when doing a TLS handshake

We can conclude that this is most likely a false positive

We can then rinse and repeat this process multiple times, until we have divided all these anomalous IPs into those confirmed to be associated to blocking or those that are false positive.

Similarly we can do this for the HTML titles.

Putting it all together

We can then proceed to automating the detection on the full dataset. Our goal is that of recomputing the blocking feature for each individual measurement based on our improved heuristics.

In addition to the previously discussed DNS and HTTP based blocking, we are going to additionally classify blocking that happens at different layers of the network stack.

Specifically, we are going to be using the following identifiers for the various ways in which blocking might occur:

DNS

HTTP

These are all blocking types related to plaintext HTTP requests:

TLS

These are all blocking types related to TLS:

TCP/IP

This is when blocking is implemented by targeting the IP address of the host:

Let's see on how many networks we were able to confirm the blocking of sites

And let's check out how many sites were confirmed to be blocked based on our fingerprints

From the perspective of presenting the data and digging deeper into the blocking of specific sites, since the data has so many dimensions, it's often useful to restrict your analysis to a subset of some of the axis.

Common choices for this, is to use a subset of all the domains or a subset of all the networks.

In this example we are going to pick some domains that have very good testing coverage and are highly relevant.

Let's start off by looking at the ways through which sites are blocked accross the networks we have selected to have enough measurements. To make the data easier to look at, we are going to fix the domain.

As we can see above, the means through which blocking is implemented across different ISPs varies significantly. In some of them, we can also see that the block is not being implemented at all.

We can use the above chart to navigate our exploration of individual measurements on a per-ISP basis.

Through the above function, we now have the power to plot a chart that shows us the blocking of a certain domain and ISP over time. In doing so we can determine if the methods through which the blocking is happening are consistent or if there is some variation.

Having a stable signal that doesn't show different ways through which the block is implemented (in cases where the root-cause may be a transient network failure) gives you higher confidence in the data.

Here we can see that the block is happening through a connection reset most of the time. The only outliers are cause by what very likely are old versions of the probe (in many cases you may want to exclude older versions of probes from your analysis, if you have enough data).

The only case that probably deserves further investigation, is the OK measurement on the 16th. Let's find it and open it in OONI Explorer.

In two of the OK measurements, it looks like there are different IP addresses returned in the DNS query. Let's inspect the measurement to see if it's a false negative.

Nope, it doesn't look like it. When looking at the blocked metrics, we can see that the IP used is always the "151.101.12.81" one. This means it's quite likely that the blocking by closing the connection through a RST packet is also matching on the endpoint.

Let's look at the other potentially false negative measurements

This looks like an actual false negative, which is caused by our blockpage detection heuristics not being good enough.

Let's add this fingerprint to our fingerprint DB and re-annotate the measurements.

At this point we would iterate the process of filtering out any additional false positives and false negatives, until we feel quite confident that we have eliminated most of the outliers (or come up with an explaination as to why we are seeing them).

Once this process is done, it might be desirable to create a CSV export of this cleaned data in preparation for publication ready charts (ex. through tools like Tableau).

Since charting tools generally work best with data where the values you need to plot are in the cells and the columns indicate the category of the value, we will reshape the data using the pivot_table function. This basically takes the values of blocking_recalc and puts them as columns, the value of the cells, in this case, is always going to be one. It's generally quite easy to do further aggregation and grouping inside of the charting tool itself.