Rion Correia$^{1,2,3}$, Nathan Ratkiewicz$^{1}$, Alain Barrat$^{4,5}$ and Luis M. Rocha$^{1,3,*}$
Weighted graphs are often used to capture distance associations among a set of node variables: every edge is a positive real number denoting a distance between the linked nodes. These networks are useful as knowledge graphs for big data inference, e.g. in drug interaction extraction from social media [2] or automated fact-checking [4]. We have shown that such networks obtained from real-World data are not typically metric, but are rather semi-metric [5, 6]. This means that the triangle inequality of Euclidean geometry is not observed for every edge in the graph: the shortest distance between at least two nodes in the graph is not the direct edge between them, but rather an indirect path via other nodes. When this occurs, we say that the edge is semi-metric. The computation of shortest paths (e.g. via Dijkstra's algorithm [7]) is isomorphic to a particular transitive closure we refer to as the metric closure and which enforces the triangle inequality on the closed graph: if an edge in the original graph is semi-metric, its weight gets replaced by the value of the shortest indirect path between the nodes it links, enforcing that the resulting (closed) graph is metric [6].
Interestingly, there is an invariant sub-graph of the original graph that does not change with the closure computation and which is sufficient to compute all shortest paths. We refer to this subgraph as the metric backbone of a complex network [8]. It contains all metric edges of the original graph: those that already observe the triangle inequality prior to computing the metric closure. The size of the backbone subgraph, in relation to the size of the original graph, defines the amount of redundancy in the network: edges not on this backbone are superfluous in the computation of shortest paths, as well as all network measures derived from shortest paths (e.g. efficiency, betweenness centrality). It has been shown that there is typically massive redundancy of this form in graphs obtained from various types of data ranging from topical spaces of large document corpora [5, 9, 10] to brain connectome and functional networks [8, 11-13]. For instance, the knowledge graph of more than 3 million concepts extracted from Wikipedia used for automated fact-checking is 98% semi-metric [4]. This means that its backbone contains only 2% of the original edges, which are sufficient to compute all shortest-paths of the original graph used to infer factual associations.
Epidemiology of infectious diseases is one of the fields in which network science has led to the most concrete advances [14]. The structure of the contact network on which the disease propagates plays a crucial role, with heterogeneous networks strongly favoring the spread[15]. We use six of the SocioPatterns datasets [16] from offices, schools and hospitals, which yield duration of contact between pairs of individuals (via wearable sensors), and have been used in models of epidemic spread to evaluate containment policies [17, 18]. We show that the SocioPatterns contact networks are very redundant, with the proportion of semi-metric edges ranging from 50-91%---four of the six networks with larger than 80% redundancy. This means that all shortest paths can be computed with fewer than 50 to 9% of the edges (which comprise the metric backbone). The figure shows the primary school contact network [1] and its metric backbone, which contains only 9% of the edges in the original network. Importantly, the social structure of the contact network is preserved in the backbone subnetwork---in the figure, the community structure of the backbone subgraph was recomputed after edge removal, but main student communities (10 classrooms), and community pairs (5 grades) did not change from those in original network, neither did the teacher nodes. The preservation of community structure in the metric backbone is observed in all of the six SocioPatterns networks. We will show that alternative methods for removing edges (e.g. thresholding) destroy the community structure of the original networks after removal of much fewer edges than those removed with the computation of the metric backbone. Finally, we will also compare epidemic spread simulations, with the SI and SIR models, using the original and backbone networks, for different spreading rates.