This post deals with the technical implementation of a tool to analyse the links between articles on Wikipedia.
Another article focussed on the results of this project will be available soon.
The source code for this project is available here.
The purpose of this project was:
The key part of this project was extracting the link graph formed by internal links between articles on Wikipedia. To get this data I used the XML dumps which are regularly generated by Wikimedia. These weigh in at about 17GB compressed and 60GB when uncompressed (for English Wikipedia) and include the text from all articles (but not images or other media).
To parse the xml dump I used a two pass process. As the parsing stage only needs to be run once per XML dump I didn’t place a huge emphasis on speed. While all the parsing can be done in a single pass it was more difficult and I found it to be more error prone in normalizing the link graph. The output of this parsing stage is a compact TSV file representing the link graph. These TSV files are about 850MB-1GB depending on the options used and are then used for the data analysis.
The code for parsing can be found in parse.rs with some more detail provided in the comments.
The first pass over the data identifies all valid articles and redirects and recursively resolves the redirects until they point at an actual article. At this stage the title of the page is checked to identify if it is a non-article page (it might be a file, image, discussion, template ect). Additionally ‘list of’, ‘index of’, ‘table of’ and disambiguation pages are skipped.
Wikipedia uses a lot of special tags to classify pages and there are often multiple ways of tagging the same thing or slight variations of tags that produce slightly different results. In this step I didn’t try to catch every edge case and instead aimed to cover the vast majority of cases.
At the end of this pass we have a list of valid pages, a map of page name to page index (an integer used to compactly represent the page) and a map of page redirects.
The second pass parses the article contents for any outgoing links. Similarly to the first pass we want to skip some of the links. I skipped the infobox that is commonly found on the right hand side of wikipedia pages and contains general information as well as geographic location (for example on this page). The reasoning for this was the information was too general and lead to a large number of links.
Within articles the links typically come in the form
[[article name#optional_anchor|display name]], so I used regex to find this structure and then filtered out the article name and normalized it. In wikipedia the article names are not case sensitive for the first letter, but are case sensitive otherwise so I normalized to an uppercase first letter. Next I check if the link points to a valid article (if its a redirect, the redirected to article is taken) and if so we add it to the links graph.
In this pass the links graph is represented using either incoming links to an article or outgoing links from an article in an adjacency list (see below). The graph itself is identical in either case however it is much faster to for example find the outgoing link tree from a node on an outgoing link representation than an incoming link representation. In the analysis stage I used both representations for this reason.
The output from the second pass is a map of article names to article indices and a graph. These are then joined and saved to disk as a TSV file which is kept and re-used for the analysis stage.
After parsing the wikipedia dataset and trimming it down there are about 5.3 million articles (nodes in the graph) and 120 million internal links (edges in the graph). With the most common graph representations the storage cost would be:
O(V + E)and would require ~1GB of ram (using 32 bit integer indices in the list)
O(V^2)and would require ~4TB of ram (using bit arrays)
O(V * E)and would require ~75TB of ram (using bit arrays)
Clearly an adjacency list is the best fit for such a sparse graph. To help keep the graph as compact as possible and make lookups fast I assigned an int to each article name (which I named the article index). The article index is the lookup index in the adjacency list, and each row in the list is a list of the article indices the node is connected to.
In terms of rust datastructures this is a vector of vectors. In rust both vectors and strings are implemented as fat pointers so they consist of 3 fields (all of size
A diagram of the graph storage structure:
With capacity (C) set to
6x10^6, size (S) of
~5.35x10^6 and a 64 bit
usize (64 bit machine), memory usage for the main graph vec:
C * 3 * sizeof(usize)= 144MB
S * 22.6 * sizeof(index)= 483MB (
u32index) or 996MB (
There are two operations I do that require graph search:
What I mean by step groups I would like to find the articles added at each step. To illustrate this for a subset of a graph looking like this, evaluated at the root node a:
a (Step 0, group: [a]) / \ b c (Step 1, group: [b, c]) \ d (Step 2, group: [d])
The link graph does not have differing (or negative) weights so an appropriate algorithm for both these operations is breadth first search (BFS). BFS has a time complexity of
O(V + E) and a space complexity of
O(V) which is good for this application considering the sizes of our E and V. Additionally using BFW we can also easily produce the step groups by saving the vertex frontier at the end of each BFS step.
For finding the shortest path it would also be possible to use iterative deepening depth first search (IDDFS) which has time complexity
O(b ^ d) for branching factor
b and depth of shortest solution
d. The Wikipedia link graph is not well balanced so
d would differ wildly depending on your starting (and ending) point, so I stuck with BFS as it is more likely to have a better average-case and worst-case runtime.
The relevant specs of the machine used for performance testing are:
If you’re interested at looking at the code its available in analyze.rs.
For benchmarking I used a helpful tool called
perf, which logs the performance counter stats from the CPU while running a program. I benchmarked the
step-groups option as this is by far the most intensive in terms of processing and memory pressure. I ran 3 iterations for the top 100 pages with the most incoming links and averaged the results. I also skipped logging the performance counters for the first 10 seconds of each iteration as it is mostly spent loading the dataset from disk. To get more detailed logs I used the
perf stat -vvv option as this provides much deeper insight into what is happening in the CPU.
To see the performance difference between using 32 bit and 64 bit indices I benchmarked both and found that the 32 bit index was around 20% faster. Even with 32 bit indices the working set is much larger than the cache of my CPU (more info below). In general there should be good spatial and temporal locality for memory accesses within individual rows of the adjacency list however the rows themselves will be accessed in a fairly random order and hence have poor locality. Also unfortunately due to the large use of heap-allocated structures there is an extra indirection step when accessing rows which would cause TLB and cache pressure. Due to these constraints I had expected that overall performance would be limited by the cache + TLB size and RAM throughput + latency.
As expected, when benchmarking I saw that the processor was wasting a lot of time on cache (and TLB) misses and handling page faults. To try and fix this I modified the code for checking if an article had already been visited as this is an hot area of code (it is run for every graph edge) so would contribute a lot to cache pressure. Previously I used a
Vec<bool> which actually allocates a byte per element (in rustc 1.45.0), so I changed this to a bit array accessed in
usize chunks. This takes more instructions but reduces the size of this lookup table from ~42MB to ~5.2MB. In practice using a bit array drastically reduced the number of page faults (from 479K down to 563) and reduced cache + TLB misses for a runtime saving of about 9%. After this optimization the IPC of the application improved but is still low at 0.55 with much of the time the processor being stalled in the backend, indicating that it is waiting on cache/RAM as expected earlier.
These are the
perf dumps comparing
Vec<bool> and a bit array. With the packed bit array:
An optimization I thought of but didn’t attempt would be to use some space in the article index to store the links list length, ie:
32bit article index = [Length (8 bits) | Sub index (24 bits)]. The sub index is simply a number counting up from 0 used to address an article in the group. For length 0 theres no need to store or check a links array and for length greater than 254
0xFF would be stored in the length field. With a 24 bit sub index each length group can store 16.7 million articles which is more than enough for the Wikipedia link graph.
The advantage of this approach is that the links are stored directly in the array with no overhead storing fat pointers for the rust vectors. Due to the distribution of links within the graph with 254 bins about 99% of the graph can be stored in this way and any remaining articles with more than 254 links are handled in a vector as before. With this approach we shave off almost all of the memory overhead from storing the vector fat pointers, saving about 30% of RAM usage for the graph. Additionally the initial lookup table is much smaller and more cache + TLB friendly.
The disadvantages of this approach are that there would need to be more pre-processing of the link graph and this method does not handle runtime graph changes well. For this specific use case these disadvantages are fine as we pre-process anyway and graph updates do not happen at runtime.
analyze step-groups option is multithreaded on a per root article basis using a thread pool. A per-article split for the threads avoids any shared mutable state meaning that locking is avoided during calculations which minimizes the synchronization overhead. The graph itself is shared but immutable, however for the viewed article bit array a copy is required for each thread (as it is locally mutable for each thread) which means extra memory usage and cache+TLB pressure.
analyze step-groups --use-most-linked 1000 with different numbers of threads to see how it scaled:
On my CPU scaling starts out well at 2-4 threads, quite close to the best case scaling, however after this the curve begins to flatten out. After 6 cores there is still some improvement with 12 threads ~15% faster than 6 threads even though the CPU only has 6 cores (but 12 hardware threads). This improvement is because the CPU can swap hardware threads and keep doing useful work while waiting on resources (eg memory or ALU) for the original thread. In particular this is great for hiding the latency of waiting for memory to be loaded or stored to RAM which I expect to be the limiting factor in this case.
To check this I looked at the
perf stat -ddd output for 6 threads:
and 12 threads:
It looks like in practice the limiting factor is keeping the execution units fed with new instructions. This seems to mainly be due to icache misses which nearly double with the extra threads. A single icache miss can cause a pipeline stall for hundreds of clockcycles, so even though the percentage of L1 icache load misses is low a small change can have a large effect. Due to the extra threads there is extra cache pressure at every level so its expected that misses become more frequent. Particularly for the Zen 2 microarchitecture a smaller L1 icache and unified (ie store both data and instructions) L2 and L3 caches means that the cached instructions at higher levels must also fight with the extra pressure from cached data.
I wanted to look at the memory usage of the whole program to see how it compared to what my mental model expected when writing it. Unlike C you don’t explicitly malloc memory in Rust, but it should still be relatively clear when you are allocating memory (provided you know a bit about Rust). I was interested to see if the various overheads in Rust (eg fat pointers) differed greatly compared to what I’d expect a C implementation would use. This comparison is of course a tad unfair as Rust has a much larger and more useful standard library than C and to achieve the same thing in C I’d either need to write the bits I needed or pull in external libraries that would need to malloc too.
For memory analysis I used
valgrind --tool=massif together with a release build that included debug symbols. This monitors heap usage over time and identifies the source of heap allocations. This dumps an output file which can then be analysed using
The graph of memory usage over time from
ms_print has the Y axis as memory usage and the X axis as number of instructions executed. For the same
analyze step-groups operation used earlier the heap memory usage looks like this:
GB 1.198^ # | #:@@:::::::@:::::::::::::::::@@:::@::::::@:::::::@::::: | #:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | @#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | :@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | ::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | @::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | @:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | @@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | ::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | @::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | :::@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | ::: :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | ::: :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | ::: :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | ::: :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | ::: :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | ::: :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: | ::: :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@::::: 0 +----------------------------------------------------------------------->Gi 0 183.0
So after loading the graph and associated data from the TSV memory usage levels off. Looking at after the graph levels off
massif identifies that the useful heap usage is about 1.1GB, with and additional 150MB of overhead for managing the heap.
As in the earlier calculations:
C = 6 * 10^6
S = ~5.3 * 10^6
sizeof(usize) = 8
L = 19
N = 22.6
Articles (Article struct has only one element, links:
sizeof(u32) * N * S= 479MB
40.26% (501,305,756B) 0x1DA4D8: alloc::vec::Vec<T>::reserve (vec.rs:500)
Article name to article index lookup table:
next_power_2(C * 8/7) * (3 * sizeof(usize) + 4 + 1)= 243MB
22.23% (276,824,080B) 0x153C81: with_capacity<alloc::string::String,u32> (map.rs:239)
Outer graph vector (
C * 3 * sizeof(usize)= 144MB
11.57% (144,000,000B) 0x153CF1: with_capacity<wikipedia_analysis::parse::Article> (vec.rs:358)
String storage (just article names):
S * L= 107MB
08.39% (104,446,343B) 0x154119: to_string (string.rs:2235)
Article index to article name lookup table (
S * sizeof(usize)= 42.4MB
03.44% (42,834,688B) 0x154798: generate_index_lookup_table (main.rs:426)
So notably there is some unexpected overhead in the article link vectors. I did some looking around and found that when there were 1, 2, or 3 elements in the vector the capacity is set to 4. For any other length (including 0) the size and capacity of the vector match. Using the histogram of number of pages with each link count we can calculate the overhead for these 3 cases (which have 3, 2 and 1 wasted element respectively):
Short links vector overhead = 3 * sizeof(u32) * 834 * 10 ^ 3 2 * sizeof(u32) * 568 * 10 ^ 3 1 * sizeof(u32) * 427 * 10 ^ 3 = 15MB
This extra 15MB plugs the hole in the vector size calculation. I’m not sure if this behavior is a bug or is expected in rust, I believe its expected as the collect function checks the iterator for a size hint which is then used to set the vector capacity. It may be that the hint is incorrect for small sizes (which is an allowed scenario according to the docs) or otherwise it may be a bug in the vector sizing logic. I haven’t looked into the exact cause (yet).
For the article name to article index lookup table I used a rust standard library hashmap. Rust uses an implementation of swisstable for hashmaps which has a very low overhead of 1 byte per entry. Hash tables also require some space overhead to avoid low performance due to collisions and for swisstable this load factor is targeted at 7/8 = 87.5%. So its expected that in practice to set the size of the table it will calculate the overhead required for the capacity factor by:
capacity_overhead = 8/7 * desired_capacity
However swisstable also uses power of two sized hash arrays to improve lookup efficiency as its much cheaper to calculate a power-2 modulo because its just a bitwise AND. It will then will take the next largest power of two equal to or greater than
capacity_overhead. In this case
capacity_overhead = 6.86x10^6 so
storage_capacity = 2^23. Then with the size of the
String fat pointer key,
u32 value and 1 byte overhead I expected the memory usage to be:
storage_capacity * (sizeof(usize) * 3 + sizeof(u32) + 1) = 243MB
In practice the size is actually:
storage_capacity * (sizeof(usize) * 4 + 1) = 276MB which indicates that the
u32 is padded out to
usize somewhere. Having a look through the hashtable code nothing stood out to me as the cause of this, but it seems to be because the key and value are stored as a tuple in the table. Having a look at the rust language reference there are no guarantees about tuple storage layout, so I suspect the extra padding is from there.
In total there was about 5% of extra memory usage in the main datastructures that I didn’t expect. Overall I think thats pretty reasonable and isn’t enough for me to worry about in most cases.
Overall I found rust to be a great fit for this project. The strong type system let me write and refactor code with much more confidence than C and C++, giving the same sort of ‘if it compiles its probably fine’ feeling as haskell. For this specific project I didn’t find rust to be much less productive than python either, thanks in part to the large package ecosystem available at crates.io, good compiler hints and great docs. This was the first nontrivial project I’ve completed in rust and I’m definitely keen to use it more in the future.