Wikipedia link graph tool

| Wikipedia link graph tool

| A tool to parse and analyse the link graph formed between articles on Wikipedia

| Date: 18 November 2019

| Tags: rust

This post deals with the technical implementation of a tool to analyse the links between articles on Wikipedia.

Another article focussed on the results of this project will be available soon.

The source code for this project is available here.

# Motivation

The purpose of this project was:

I was inspired by the Wiki Game and wanted to find out how connected the Wikipedia link graph was.
I thought this project was a good application for rust. I have previously implemented this project in python and C++, so wanted to try it in rust as a point of comparison and to help learn the language.

# Technical overview

# Parsing wikipedia

The key part of this project was extracting the link graph formed by internal links between articles on Wikipedia. To get this data I used the XML dumps which are regularly generated by Wikimedia. These weigh in at about 17GB compressed and 60GB when uncompressed (for English Wikipedia) and include the text from all articles (but not images or other media).

To parse the xml dump I used a two pass process. As the parsing stage only needs to be run once per XML dump I didn’t place a huge emphasis on speed. While all the parsing can be done in a single pass it was more difficult and I found it to be more error prone in normalizing the link graph. The output of this parsing stage is a compact TSV file representing the link graph. These TSV files are about 850MB-1GB depending on the options used and are then used for the data analysis.

The code for parsing can be found in parse.rs with some more detail provided in the comments.

# First pass

The first pass over the data identifies all valid articles and redirects and recursively resolves the redirects until they point at an actual article. At this stage the title of the page is checked to identify if it is a non-article page (it might be a file, image, discussion, template ect). Additionally ‘list of’, ‘index of’, ‘table of’ and disambiguation pages are skipped.

Wikipedia uses a lot of special tags to classify pages and there are often multiple ways of tagging the same thing or slight variations of tags that produce slightly different results. In this step I didn’t try to catch every edge case and instead aimed to cover the vast majority of cases.

At the end of this pass we have a list of valid pages, a map of page name to page index (an integer used to compactly represent the page) and a map of page redirects.

# Second pass

The second pass parses the article contents for any outgoing links. Similarly to the first pass we want to skip some of the links. I skipped the infobox that is commonly found on the right hand side of wikipedia pages and contains general information as well as geographic location (for example on this page). The reasoning for this was the information was too general and lead to a large number of links.

Within articles the links typically come in the form [[article name#optional_anchor|display name]], so I used regex to find this structure and then filtered out the article name and normalized it. In wikipedia the article names are not case sensitive for the first letter, but are case sensitive otherwise so I normalized to an uppercase first letter. Next I check if the link points to a valid article (if its a redirect, the redirected to article is taken) and if so we add it to the links graph.

In this pass the links graph is represented using either incoming links to an article or outgoing links from an article in an adjacency list (see below). The graph itself is identical in either case however it is much faster to for example find the outgoing link tree from a node on an outgoing link representation than an incoming link representation. In the analysis stage I used both representations for this reason.

The output from the second pass is a map of article names to article indices and a graph. These are then joined and saved to disk as a TSV file which is kept and re-used for the analysis stage.

# Storing the graph

After parsing the wikipedia dataset and trimming it down there are about 5.3 million articles (nodes in the graph) and 120 million internal links (edges in the graph). With the most common graph representations the storage cost would be:

Adjacency lists are O(V + E) and would require ~1GB of ram (using 32 bit integer indices in the list)
Adjacency matrices are O(V^2) and would require ~4TB of ram (using bit arrays)
Incidence matrices are O(V * E) and would require ~75TB of ram (using bit arrays)

Clearly an adjacency list is the best fit for such a sparse graph. To help keep the graph as compact as possible and make lookups fast I assigned an int to each article name (which I named the article index). The article index is the lookup index in the adjacency list, and each row in the list is a list of the article indices the node is connected to.

In terms of rust datastructures this is a vector of vectors. In rust both vectors and strings are implemented as fat pointers so they consist of 3 fields (all of size usize):

A pointer to heap-allocated storage
Current size of the object
Capacity of the heap allocated space (before requiring re-allocation)

A diagram of the graph storage structure:

With capacity (C) set to 6x10^6, size (S) of ~5.35x10^6 and a 64 bit usize (64 bit machine), memory usage for the main graph vec:

Vector overhead = C * 3 * sizeof(usize) = 144MB
Data size = S * 22.6 * sizeof(index) = 483MB (u32 index) or 996MB (u64 index)

# Graph search algorithms

There are two operations I do that require graph search:

Finding the shortest path between two articles
Finding the step groups

What I mean by step groups I would like to find the articles added at each step. To illustrate this for a subset of a graph looking like this, evaluated at the root node a:

    a       (Step 0, group: [a])
   / \
  b   c     (Step 1, group: [b, c])
       \
        d   (Step 2, group: [d])

The link graph does not have differing (or negative) weights so an appropriate algorithm for both these operations is breadth first search (BFS). BFS has a time complexity of O(V + E) and a space complexity of O(V) which is good for this application considering the sizes of our E and V. Additionally using BFW we can also easily produce the step groups by saving the vertex frontier at the end of each BFS step.

For finding the shortest path it would also be possible to use iterative deepening depth first search (IDDFS) which has time complexity O(b ^ d) for branching factor b and depth of shortest solution d. The Wikipedia link graph is not well balanced so b and d would differ wildly depending on your starting (and ending) point, so I stuck with BFS as it is more likely to have a better average-case and worst-case runtime.

# Performance optimization

The relevant specs of the machine used for performance testing are:

AMD Ryzen 3600 (6 cores, 12 threads, 32MB L3 cache)
16GB (2x8GB dual channel) 3200MHz RAM (C14)
1TB Corsair MP510 NVME SSD as main system drive (w/ swap allocated to this)
Kubuntu 20.04
Rustc 1.45.0

If you’re interested at looking at the code its available in analyze.rs.

For benchmarking I used a helpful tool called perf, which logs the performance counter stats from the CPU while running a program. I benchmarked the step-groups option as this is by far the most intensive in terms of processing and memory pressure. I ran 3 iterations for the top 100 pages with the most incoming links and averaged the results. I also skipped logging the performance counters for the first 10 seconds of each iteration as it is mostly spent loading the dataset from disk. To get more detailed logs I used the perf stat -vvv option as this provides much deeper insight into what is happening in the CPU.

To see the performance difference between using 32 bit and 64 bit indices I benchmarked both and found that the 32 bit index was around 20% faster. Even with 32 bit indices the working set is much larger than the cache of my CPU (more info below). In general there should be good spatial and temporal locality for memory accesses within individual rows of the adjacency list however the rows themselves will be accessed in a fairly random order and hence have poor locality. Also unfortunately due to the large use of heap-allocated structures there is an extra indirection step when accessing rows which would cause TLB and cache pressure. Due to these constraints I had expected that overall performance would be limited by the cache + TLB size and RAM throughput + latency.

As expected, when benchmarking I saw that the processor was wasting a lot of time on cache (and TLB) misses and handling page faults. To try and fix this I modified the code for checking if an article had already been visited as this is an hot area of code (it is run for every graph edge) so would contribute a lot to cache pressure. Previously I used a Vec<bool> which actually allocates a byte per element (in rustc 1.45.0), so I changed this to a bit array accessed in usize chunks. This takes more instructions but reduces the size of this lookup table from ~42MB to ~5.2MB. In practice using a bit array drastically reduced the number of page faults (from 479K down to 563) and reduced cache + TLB misses for a runtime saving of about 9%. After this optimization the IPC of the application improved but is still low at 0.55 with much of the time the processor being stalled in the backend, indicating that it is waiting on cache/RAM as expected earlier.

These are the perf dumps comparing Vec<bool> and a bit array. With the packed bit array:

    

    







        
 Performance counter stats for './target/release/wikipedia-analysis analyze --input ./datasets/incoming-no-ignore.tsv step-groups --use-most-linked 100 -j1' (3 runs):

         61,651.86 msec task-clock                #    0.860 CPUs utilized            ( +-  0.03% )
             6,540      context-switches          #    0.106 K/sec                    ( +-  0.58% )
                22      cpu-migrations            #    0.000 K/sec                    ( +- 31.11% )
               563      page-faults               #    0.009 K/sec                    ( +-  0.06% )
   258,200,769,370      cycles                    #    4.188 GHz                      ( +-  0.05% )  (33.32%)
     1,827,589,039      stalled-cycles-frontend   #    0.71% frontend cycles idle     ( +-  4.77% )  (33.33%)
   195,234,782,900      stalled-cycles-backend    #   75.61% backend cycles idle      ( +-  0.15% )  (33.34%)
   141,912,845,871      instructions              #    0.55  insn per cycle
                                                  #    1.38  stalled cycles per insn  ( +-  0.13% )  (33.35%)
    37,482,288,741      branches                  #  607.967 M/sec                    ( +-  0.15% )  (33.36%)
       911,335,037      branch-misses             #    2.43% of all branches          ( +-  0.15% )  (33.36%)
    36,428,900,470      L1-dcache-loads           #  590.881 M/sec                    ( +-  0.34% )  (33.35%)
    16,504,944,023      L1-dcache-load-misses     #   45.31% of all L1-dcache hits    ( +-  0.11% )  (33.34%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
       639,734,331      L1-icache-loads           #   10.377 M/sec                    ( +- 10.71% )  (33.34%)
         9,192,254      L1-icache-load-misses     #    1.44% of all L1-icache hits    ( +- 14.27% )  (33.33%)
     4,393,749,645      dTLB-loads                #   71.267 M/sec                    ( +-  0.06% )  (33.33%)
       804,150,797      dTLB-load-misses          #   18.30% of all dTLB cache hits   ( +-  0.13% )  (33.32%)
            16,698      iTLB-loads                #    0.271 K/sec                    ( +- 35.74% )  (33.32%)
             5,418      iTLB-load-misses          #   32.44% of all iTLB cache hits   ( +- 10.57% )  (33.31%)
     7,393,124,069      L1-dcache-prefetches      #  119.917 M/sec                    ( +-  0.10% )  (33.31%)
   <not supported>      L1-dcache-prefetch-misses

           71.6933 +- 0.0210 seconds time elapsed  ( +-  0.03% )

    

With the Vec<bool>:

    

    







        
 Performance counter stats for './target/release/wikipedia-analysis analyze --input ./datasets/incoming-no-ignore.tsv step-groups --use-most-linked 100 -j1' (3 runs):

         67,612.61 msec task-clock                #    0.871 CPUs utilized            ( +-  0.12% )
             6,812      context-switches          #    0.101 K/sec                    ( +-  1.24% )
                34      cpu-migrations            #    0.001 K/sec                    ( +- 28.36% )
           479,482      page-faults               #    0.007 M/sec                    ( +-  0.00% )
   283,020,187,493      cycles                    #    4.186 GHz                      ( +-  0.14% )  (33.32%)
    16,700,789,883      stalled-cycles-frontend   #    5.90% frontend cycles idle     ( +-  0.65% )  (33.32%)
   185,523,834,828      stalled-cycles-backend    #   65.55% backend cycles idle      ( +-  0.29% )  (33.33%)
   111,599,071,243      instructions              #    0.39  insn per cycle
                                                  #    1.66  stalled cycles per insn  ( +-  0.16% )  (33.34%)
    38,918,555,995      branches                  #  575.611 M/sec                    ( +-  0.03% )  (33.34%)
       903,514,023      branch-misses             #    2.32% of all branches          ( +-  0.07% )  (33.35%)
    38,193,886,423      L1-dcache-loads           #  564.893 M/sec                    ( +-  0.25% )  (33.35%)
    15,997,466,672      L1-dcache-load-misses     #   41.88% of all L1-dcache hits    ( +-  0.12% )  (33.35%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
     1,332,080,498      L1-icache-loads           #   19.702 M/sec                    ( +-  1.91% )  (33.34%)
        13,750,618      L1-icache-load-misses     #    1.03% of all L1-icache hits    ( +-  0.19% )  (33.34%)
     7,815,259,013      dTLB-loads                #  115.589 M/sec                    ( +-  0.10% )  (33.33%)
     1,443,854,789      dTLB-load-misses          #   18.47% of all dTLB cache hits   ( +-  0.09% )  (33.33%)
            19,083      iTLB-loads                #    0.282 K/sec                    ( +- 31.28% )  (33.32%)
             8,380      iTLB-load-misses          #   43.91% of all iTLB cache hits   ( +- 12.44% )  (33.32%)
     5,252,210,634      L1-dcache-prefetches      #   77.681 M/sec                    ( +-  0.21% )  (33.32%)
   <not supported>      L1-dcache-prefetch-misses

           77.6594 +- 0.0775 seconds time elapsed  ( +-  0.10% )

    

# A possible graph structure optimization

An optimization I thought of but didn’t attempt would be to use some space in the article index to store the links list length, ie: 32bit article index = [Length (8 bits) | Sub index (24 bits)]. The sub index is simply a number counting up from 0 used to address an article in the group. For length 0 theres no need to store or check a links array and for length greater than 254 0xFF would be stored in the length field. With a 24 bit sub index each length group can store 16.7 million articles which is more than enough for the Wikipedia link graph.

The advantage of this approach is that the links are stored directly in the array with no overhead storing fat pointers for the rust vectors. Due to the distribution of links within the graph with 254 bins about 99% of the graph can be stored in this way and any remaining articles with more than 254 links are handled in a vector as before. With this approach we shave off almost all of the memory overhead from storing the vector fat pointers, saving about 30% of RAM usage for the graph. Additionally the initial lookup table is much smaller and more cache + TLB friendly.

The disadvantages of this approach are that there would need to be more pre-processing of the link graph and this method does not handle runtime graph changes well. For this specific use case these disadvantages are fine as we pre-process anyway and graph updates do not happen at runtime.

# Multithreading performance

The analyze step-groups option is multithreaded on a per root article basis using a thread pool. A per-article split for the threads avoids any shared mutable state meaning that locking is avoided during calculations which minimizes the synchronization overhead. The graph itself is shared but immutable, however for the viewed article bit array a copy is required for each thread (as it is locally mutable for each thread) which means extra memory usage and cache+TLB pressure.

I tested analyze step-groups --use-most-linked 1000 with different numbers of threads to see how it scaled:

On my CPU scaling starts out well at 2-4 threads, quite close to the best case scaling, however after this the curve begins to flatten out. After 6 cores there is still some improvement with 12 threads ~15% faster than 6 threads even though the CPU only has 6 cores (but 12 hardware threads). This improvement is because the CPU can swap hardware threads and keep doing useful work while waiting on resources (eg memory or ALU) for the original thread. In particular this is great for hiding the latency of waiting for memory to be loaded or stored to RAM which I expect to be the limiting factor in this case.

To check this I looked at the perf stat -ddd output for 6 threads:

    

    







        
 Performance counter stats for './target/release/wikipedia-analysis analyze --input ./datasets/incoming-no-ignore.tsv step-groups --use-most-linked 1000 -j6' (3 runs):

        744,101.86 msec task-clock                #    5.474 CPUs utilized            ( +-  0.52% )
            78,002      context-switches          #    0.105 K/sec                    ( +-  2.06% )
             3,204      cpu-migrations            #    0.004 K/sec                    ( +-  4.27% )
           345,299      page-faults               #    0.464 K/sec                    ( +- 13.37% )
 3,060,567,532,200      cycles                    #    4.113 GHz                      ( +-  0.45% )  (33.33%)
    56,316,461,768      stalled-cycles-frontend   #    1.84% frontend cycles idle     ( +- 17.88% )  (33.33%)
 2,215,138,496,711      stalled-cycles-backend    #   72.38% backend cycles idle      ( +-  0.56% )  (33.33%)
 1,453,069,355,503      instructions              #    0.47  insn per cycle
                                                  #    1.52  stalled cycles per insn  ( +-  0.22% )  (33.34%)
   386,946,467,158      branches                  #  520.018 M/sec                    ( +-  0.22% )  (33.34%)
     9,223,896,852      branch-misses             #    2.38% of all branches          ( +-  0.29% )  (33.34%)
   357,996,252,870      L1-dcache-loads           #  481.112 M/sec                    ( +-  0.23% )  (33.33%)
   171,020,520,856      L1-dcache-load-misses     #   47.77% of all L1-dcache hits    ( +-  0.19% )  (33.33%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
     7,545,464,019      L1-icache-loads           #   10.140 M/sec                    ( +-  1.11% )  (33.33%)
       101,512,452      L1-icache-load-misses     #    1.35% of all L1-icache hits    ( +-  1.07% )  (33.33%)
    38,975,018,990      dTLB-loads                #   52.379 M/sec                    ( +-  7.17% )  (33.33%)
     8,095,336,896      dTLB-load-misses          #   20.77% of all dTLB cache hits   ( +-  0.30% )  (33.33%)
         1,658,838      iTLB-loads                #    0.002 M/sec                    ( +- 11.12% )  (33.33%)
           533,663      iTLB-load-misses          #   32.17% of all iTLB cache hits   ( +- 11.53% )  (33.33%)
    78,033,948,371      L1-dcache-prefetches      #  104.870 M/sec                    ( +-  0.19% )  (33.33%)
   <not supported>      L1-dcache-prefetch-misses

           135.924 +- 0.888 seconds time elapsed  ( +-  0.65% )

    

and 12 threads:

    

    







        
 Performance counter stats for './target/release/wikipedia-analysis analyze --input ./datasets/incoming-no-ignore.tsv step-groups --use-most-linked 1000 -j12' (3 runs):

      1,163,747.72 msec task-clock                #   10.430 CPUs utilized            ( +-  0.54% )
           191,772      context-switches          #    0.165 K/sec                    ( +-  8.75% )
               214      cpu-migrations            #    0.000 K/sec                    ( +- 44.42% )
           808,013      page-faults               #    0.694 K/sec                    ( +-  9.43% )
 4,788,696,023,567      cycles                    #    4.115 GHz                      ( +-  0.67% )  (33.33%)
 1,734,895,465,882      stalled-cycles-frontend   #   36.23% frontend cycles idle     ( +-  0.34% )  (33.33%)
   185,488,043,644      stalled-cycles-backend    #    3.87% backend cycles idle      ( +-  2.99% )  (33.33%)
 1,441,285,498,579      instructions              #    0.30  insn per cycle
                                                  #    1.20  stalled cycles per insn  ( +-  0.18% )  (33.33%)
   383,446,707,529      branches                  #  329.493 M/sec                    ( +-  0.17% )  (33.33%)
     9,304,654,376      branch-misses             #    2.43% of all branches          ( +-  0.76% )  (33.34%)
   344,970,641,223      L1-dcache-loads           #  296.431 M/sec                    ( +-  0.38% )  (33.34%)
   172,042,432,738      L1-dcache-load-misses     #   49.87% of all L1-dcache hits    ( +-  0.08% )  (33.34%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
    11,085,809,043      L1-icache-loads           #    9.526 M/sec                    ( +-  7.11% )  (33.34%)
       197,428,946      L1-icache-load-misses     #    1.78% of all L1-icache hits    ( +-  4.56% )  (33.34%)
    51,027,176,794      dTLB-loads                #   43.847 M/sec                    ( +-  0.36% )  (33.34%)
     8,815,361,448      dTLB-load-misses          #   17.28% of all dTLB cache hits   ( +-  0.56% )  (33.33%)
           844,078      iTLB-loads                #    0.725 K/sec                    ( +- 22.67% )  (33.33%)
           588,067      iTLB-load-misses          #   69.67% of all iTLB cache hits   ( +- 16.04% )  (33.33%)
    78,581,985,187      L1-dcache-prefetches      #   67.525 M/sec                    ( +-  0.08% )  (33.33%)
   <not supported>      L1-dcache-prefetch-misses

            111.58 +- 1.45 seconds time elapsed  ( +-  1.30% )

    

It looks like in practice the limiting factor is keeping the execution units fed with new instructions. This seems to mainly be due to icache misses which nearly double with the extra threads. A single icache miss can cause a pipeline stall for hundreds of clockcycles, so even though the percentage of L1 icache load misses is low a small change can have a large effect. Due to the extra threads there is extra cache pressure at every level so its expected that misses become more frequent. Particularly for the Zen 2 microarchitecture a smaller L1 icache and unified (ie store both data and instructions) L2 and L3 caches means that the cached instructions at higher levels must also fight with the extra pressure from cached data.

# Total program memory usage

I wanted to look at the memory usage of the whole program to see how it compared to what my mental model expected when writing it. Unlike C you don’t explicitly malloc memory in Rust, but it should still be relatively clear when you are allocating memory (provided you know a bit about Rust). I was interested to see if the various overheads in Rust (eg fat pointers) differed greatly compared to what I’d expect a C implementation would use. This comparison is of course a tad unfair as Rust has a much larger and more useful standard library than C and to achieve the same thing in C I’d either need to write the bits I needed or pull in external libraries that would need to malloc too.

For memory analysis I used valgrind --tool=massif together with a release build that included debug symbols. This monitors heap usage over time and identifies the source of heap allocations. This dumps an output file which can then be analysed using ms_print.

The graph of memory usage over time from ms_print has the Y axis as memory usage and the X axis as number of instructions executed. For the same analyze step-groups operation used earlier the heap memory usage looks like this:

    GB
1.198^                 #
     |                 #:@@:::::::@:::::::::::::::::@@:::@::::::@:::::::@:::::
     |                 #:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     |                @#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     |              :@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     |             ::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     |            @::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     |          @:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     |         @@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     |       ::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     |      @::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     |     :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     |   :::@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     | ::: :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     | ::: :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     | ::: :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     | ::: :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     | ::: :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     | ::: :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
     | ::: :@::@@:@::@@#:@ :: :: :@:::::: ::: ::::: @ :::@::::::@:::::::@:::::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   183.0

So after loading the graph and associated data from the TSV memory usage levels off. Looking at after the graph levels off massif identifies that the useful heap usage is about 1.1GB, with and additional 150MB of overhead for managing the heap.

# Usage breakdown

As in the earlier calculations:

Initialized capacity of datastructures C = 6 * 10^6
Actual number of articles S = ~5.3 * 10^6
sizeof(usize) = 8
Average length of article name L = 19
Average number of links (for full dataset) N = 22.6
Number of usize sized fields in a fat pointer (for String and Vec) = 3

Articles (Article struct has only one element, links: Vec<u32>):

Expected = sizeof(u32) * N * S = 479MB
Actual = 40.26% (501,305,756B) 0x1DA4D8: alloc::vec::Vec<T>::reserve (vec.rs:500)

Article name to article index lookup table:

Expected = next_power_2(C * 8/7) * (3 * sizeof(usize) + 4 + 1) = 243MB
Actual = 22.23% (276,824,080B) 0x153C81: with_capacity<alloc::string::String,u32> (map.rs:239)
(See below for an explanation to the expected calculation)

Outer graph vector (Vec<Article>):

Expected = C * 3 * sizeof(usize) = 144MB
Actual = 11.57% (144,000,000B) 0x153CF1: with_capacity<wikipedia_analysis::parse::Article> (vec.rs:358)

String storage (just article names):

Expected = S * L = 107MB
Actual = 08.39% (104,446,343B) 0x154119: to_string (string.rs:2235)

Article index to article name lookup table (Vec<&String>):

Expected = S * sizeof(usize) = 42.4MB
Actual = 03.44% (42,834,688B) 0x154798: generate_index_lookup_table (main.rs:426)

So notably there is some unexpected overhead in the article link vectors. I did some looking around and found that when there were 1, 2, or 3 elements in the vector the capacity is set to 4. For any other length (including 0) the size and capacity of the vector match. Using the histogram of number of pages with each link count we can calculate the overhead for these 3 cases (which have 3, 2 and 1 wasted element respectively):

 Short links vector overhead =
  3 * sizeof(u32) * 834 * 10 ^ 3
  2 * sizeof(u32) * 568 * 10 ^ 3
  1 * sizeof(u32) * 427 * 10 ^ 3
  = 15MB

This extra 15MB plugs the hole in the vector size calculation. I’m not sure if this behavior is a bug or is expected in rust, I believe its expected as the collect function checks the iterator for a size hint which is then used to set the vector capacity. It may be that the hint is incorrect for small sizes (which is an allowed scenario according to the docs) or otherwise it may be a bug in the vector sizing logic. I haven’t looked into the exact cause (yet).

For the article name to article index lookup table I used a rust standard library hashmap. Rust uses an implementation of swisstable for hashmaps which has a very low overhead of 1 byte per entry. Hash tables also require some space overhead to avoid low performance due to collisions and for swisstable this load factor is targeted at 7/8 = 87.5%. So its expected that in practice to set the size of the table it will calculate the overhead required for the capacity factor by: capacity_overhead = 8/7 * desired_capacity

However swisstable also uses power of two sized hash arrays to improve lookup efficiency as its much cheaper to calculate a power-2 modulo because its just a bitwise AND. It will then will take the next largest power of two equal to or greater than capacity_overhead. In this case capacity_overhead = 6.86x10^6 so storage_capacity = 2^23. Then with the size of the String fat pointer key, u32 value and 1 byte overhead I expected the memory usage to be: storage_capacity * (sizeof(usize) * 3 + sizeof(u32) + 1) = 243MB

In practice the size is actually: storage_capacity * (sizeof(usize) * 4 + 1) = 276MB which indicates that the u32 is padded out to usize somewhere. Having a look through the hashtable code nothing stood out to me as the cause of this, but it seems to be because the key and value are stored as a tuple in the table. Having a look at the rust language reference there are no guarantees about tuple storage layout, so I suspect the extra padding is from there.

In total there was about 5% of extra memory usage in the main datastructures that I didn’t expect. Overall I think thats pretty reasonable and isn’t enough for me to worry about in most cases.

# Thoughts on rust

Overall I found rust to be a great fit for this project. The strong type system let me write and refactor code with much more confidence than C and C++, giving the same sort of ‘if it compiles its probably fine’ feeling as haskell. For this specific project I didn’t find rust to be much less productive than python either, thanks in part to the large package ecosystem available at crates.io, good compiler hints and great docs. This was the first nontrivial project I’ve completed in rust and I’m definitely keen to use it more in the future.