Rion Correia$^{1,2,3}$, Nathan Ratkiewicz$^{1}$, Alain Barrat$^{4,5}$ and Luis M. Rocha$^{1,3,*}$
Weighted graphs are often used to capture distance associations among a set of node variables: every edge is a positive real number denoting a distance between the linked nodes. These networks are useful as knowledge graphs for big data inference, e.g. in drug interaction extraction from social media [2] or automated fact-checking [4]. We have shown that such networks obtained from real-World data are not typically metric, but are rather semi-metric [5, 6]. This means that the triangle inequality of Euclidean geometry is not observed for every edge in the graph: the shortest distance between at least two nodes in the graph is not the direct edge between them, but rather an indirect path via other nodes. When this occurs, we say that the edge is semi-metric. The computation of shortest paths (e.g. via Dijkstra's algorithm [7]) is isomorphic to a particular transitive closure we refer to as the metric closure and which enforces the triangle inequality on the closed graph: if an edge in the original graph is semi-metric, its weight gets replaced by the value of the shortest indirect path between the nodes it links, enforcing that the resulting (closed) graph is metric [6].
Interestingly, there is an invariant sub-graph of the original graph that does not change with the closure computation and which is sufficient to compute all shortest paths. We refer to this subgraph as the metric backbone of a complex network [8]. It contains all metric edges of the original graph: those that already observe the triangle inequality prior to computing the metric closure. The size of the backbone subgraph, in relation to the size of the original graph, defines the amount of redundancy in the network: edges not on this backbone are superfluous in the computation of shortest paths, as well as all network measures derived from shortest paths (e.g. efficiency, betweenness centrality). It has been shown that there is typically massive redundancy of this form in graphs obtained from various types of data ranging from topical spaces of large document corpora [5, 9, 10] to brain connectome and functional networks [8, 11-13]. For instance, the knowledge graph of more than 3 million concepts extracted from Wikipedia used for automated fact-checking is 98% semi-metric [4]. This means that its backbone contains only 2% of the original edges, which are sufficient to compute all shortest-paths of the original graph used to infer factual associations.
Epidemiology of infectious diseases is one of the fields in which network science has led to the most concrete advances [14]. The structure of the contact network on which the disease propagates plays a crucial role, with heterogeneous networks strongly favoring the spread[15]. We use six of the SocioPatterns datasets [16] from offices, schools and hospitals, which yield duration of contact between pairs of individuals (via wearable sensors), and have been used in models of epidemic spread to evaluate containment policies [17, 18]. We show that the SocioPatterns contact networks are very redundant, with the proportion of semi-metric edges ranging from 50-91%---four of the six networks with larger than 80% redundancy. This means that all shortest paths can be computed with fewer than 50 to 9% of the edges (which comprise the metric backbone). The figure shows the primary school contact network [1] and its metric backbone, which contains only 9% of the edges in the original network. Importantly, the social structure of the contact network is preserved in the backbone subnetwork---in the figure, the community structure of the backbone subgraph was recomputed after edge removal, but main student communities (10 classrooms), and community pairs (5 grades) did not change from those in original network, neither did the teacher nodes. The preservation of community structure in the metric backbone is observed in all of the six SocioPatterns networks. We will show that alternative methods for removing edges (e.g. thresholding) destroy the community structure of the original networks after removal of much fewer edges than those removed with the computation of the metric backbone. Finally, we will also compare epidemic spread simulations, with the SI and SIR models, using the original and backbone networks, for different spreading rates.
References:
[1] Stehle, J., et al., PLoS One, 2011. 6(8): p. e23176.
[2] Correia, R.B., L. Li, and L.M. Rocha, Pac Symp Biocomput, 2016. 21: p. 492-503.
[3] Jacomy, M., et al., PLoS One, 2014. 9(6): p. e98679.
[4] Ciampaglia, G.L., et al., PloS one, 2015. 10(6): p. e0128193.
[5] Rocha, L.M., in Soft Computing Agents: A New Perspective for Dynamic Information Systems, V.Loia, Editor. 2002, IOS Press. p. 137-163.
[6] Simas, T. and L.M. Rocha, Network Science, 2015. 3(2): p. 227-268.
[7] Dijkstra, E.W., Numerische mathematik, 1959. 1(1): p. 269-271.
[8] Simas, T., O. Sporns, and L.M. Rocha, 2018: p. Submitted.
[9] Rocha, L.M., et al. 2005, IEEE Press. p. 565-571.
[10] Simas, T. and L.M. Rocha. in Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology-Volume 03. 2012. IEEE Computer Society.
[11] Simas, T. and J. Suckling, Frontiers in Neuroscience, 2016. 10.
[12] Peeters, S., et al., NeuroImage: Clinical, 2015. 9: p. 607-616.
[13] Kalavri, V., T. Simas, and D. Logothetis, Proceedings of the VLDB Endowment, 2016. 9(9): p. 672-683.
[14] Pastor-Satorras, R., et al., Reviews of Modern Physics, 2015. 87(3): p. 925-979.
[15] Barrat, A., M. Barthelemy, and A. Vespignani, Dynamical Processes on Complex Networks. 2008: Cambridge University Press. 368.
[16] SocioPatterns: data-driven social dynamics and human activity. 2017: http://www.sociopatterns.org.
[17] Gemmetto, V., A. Barrat, and C. Cattuto, BMC Infectious Diseases, 2014. 14.
[18] Gauvin, L., et al., arXiv:1501.02758, 2015.
The Sociopatterns project is a collaborative effort to gather data about collocational encounters between individuals in a variety of settings. The project is primarily geared toward utilizing interdisciplinary approaches and expertise to model the transmission of disease. The available data sets range in date from 2008 to 2015, and were collected from a number of different environments: a primary school [1], a high school [2], a hospital [3], a workplace [4], a conference [5], and even a museum exhibit specifically dedicated to the project [6]. The data was collected using Radio-Frequency Identification (RFID) devices whose range was limited such that contact was only recorded when individuals were within 1-1.5 meters and facing one another.
In each of the data sets there is a contact record file where each row consists of two IDs of individuals, as well as a the time, measured in the number of seconds since midnight on the day the experiment was started. Multiple contacts were possible during the same time frame, and are represented in their individual rows.
For each Socio Patterns dataset, contact sequences between individuals $i$ and $j$ were grouped into different time windows resulting in a symmetric graph $R_w$. These graphs can be represented as adjacency matrices $R_w$ where entries $r_{ij}$ denote the number of time windows individuals $i$ and $j$ were measured together. For $r_{ii}$, the diagonal values of the adjacency matrix, we consider two options. In the first, $r_{ii}$ denotes the total number of time windows $w$ in which individual $i$ was observed in the experiment, thus accounting for the potential of an individual to influence spread by the amount of time he or she is present in the experiment. The second focuses on the strength of social interaction, whereby $r_{ii} = \sum_{j \neq i} r_{ij}$; in other words, in the latter case, $r_{ii}$ denotes the total number of time windows $w$ in which individual $i$ was measured in a social interaction with any other individual $j$. We refer to these two forms of normalizing interactions as the time and social normalization, respectively.
To obtain a normalized strength of association between individuals, we computed proximity graphs, $P_w$ for each time-window resolution $w$. Thus, the entries of the adjacency matrix $P_w$ of a proximity graph are given by:
\begin{equation} \label{eq_proximity} p_{ij} = \frac{r_{ij}}{r_{ii} + r_{jj} - r_{ij}}, \quad \forall_{i,j} \in R_w \end{equation}
where $p_{ij} \in [0,1]$; $p_{ij}=0$ for individuals $i$ and $j$ who never had any contact and $p_{ij}=1$ when they were always together when observed in the experiment. The proximity is the standard Jaccard measure [6,7], but other proximity measures can be considered [8]. We also compute distance graphs $D_w$ for the same time-window resolutions, using the map:
\begin{equation} d_{ij} = \varphi(p_{ij}) = \frac{1}{p_{ij}} -1 , \quad \forall_{i,j} \in D_w \quad . \end{equation}
Further, we compute the metric closure $D^{C}_{w}$ of the distance graphs, which is isomorphic to a specific transitive closure of the proximity graph [8]. The metric closure is equivalent to computing the shortest path between every pair of nodes in the distance graph. Thus, $d^{C}_{ij}$ is the length (sum of distance weights) of the shortest path between individuals $i$ and $j$ in the original distance graph $D_{w}$.
Semi-metric edges in $D_w$ violate the triangle inequality, $d_{ij} \leq d_{ik} + d_{kj}$ for some third individual $k$. Thus, distance edge in $D^{C}_w$ is smaller than in original distance graph $D_w$–denoting the existence of an indirect path of shorter length. Metric edges, on the other hand, are closure-invariant: $d_{ij} = d^{c}_{ij}$, and constitute the metric backbone, $D^{b}_{w}$ subgraph: the edges that have no shorter indirect paths and which are sufficient to compute all shortest paths in $D^{C}$. To measure the amount of semi-metricity in $D_{w}$, we compute two additional semi-metric graphs $S_w$ and $B_w$ [8,9], whereby $s_{ij}$ and $b_{ij}$ values are calculated as:
\begin{equation} s_{ij} = \frac{d_{ij}}{d^{C}_{ij}} \quad, \quad b_{ij} = \frac{\overline{d_{i}}}{d^{C}_{ij}} \quad . \end{equation}
$s_{ij}$ is computed for every edge originally present in the $D_w$ graph. For metric edges, $s_{ij}=1$, and for semi-metric edges $s_{ij}>1$. Some edges are not present in the original $D_w$ graph ($d_{ij}=\infty$), but are present in the closed graph $D^C_w$–individuals with no direct contacts measured in the data, but for whom there is an indirect contact path. In these cases $s_{ij} = \infty$, thus we also compute a second measure of semi-metric behavior, $b_{ij}$, by computing the mean distance to neighbors of $i$ which can be reached directly with finite distance: $\overline{d_{i}}$.
See distance closure background theory for additional details.
We want to quantify how much the metric backbone is able to preserve the social structure captured by the original network – also in comparison to alternative backbones obtained by removing the same number of edges: one removing the largest distance edge weights (threshold) and the other removing random edges. To compare the module similarity between two networks we use measures from set and information theory. After computing community detection (e.g. using the original metalabels or Louvain modularity) on the original graph, we identify a set of modules $A_i$, each comprised of a subset of nodes of the graph. After computing community detection on the backbone graph we similarly identify a set of modules $B_i$. Further, $i=1,\dots m_A$ and $j=1 \dots m_B$. Note the number of modules in each network, $m_A$ and $m_B$, are not necessarily equal, as each modularity algorithm may return a different number of modules for the original and backbone networks.
The subsethood of $A_i$ in $B_j$ and the subsethood of $B_j$ in $A_i$ can then be computed as
\begin{equation} S(A_i,B_j) = \frac{ |A_i \cap B_j| }{ |A_i| } \quad \text{and} \quad S(B_j,A_i) = \frac{ |A_i \cap B_j| }{ |B_j| } \quad , \end{equation}
where often $S(A_i,B_j) \neq S(B_j,A_i)$. We then construct probability distribution associated with each module as:
\begin{equation} \mathcal{A}_i = \big\{ (S(A_i,B_1), \dots, S(A_i,B_{m_B}) \big\} \quad \text{and} \quad \mathcal{B}_j = \big\{ (S(B_j,A_1), \dots, S(B_j,A_{m_A}) \big\} \quad , \end{equation}
These probability distributions represent how much modules $A_i$ and $B_j$ are dispersed into all modules $B_1 \dots B_{m_B}$ and $A_1 \dots A_{m_A}$, respectively. Therefore, we compute the overall dispersion of nodes from modules of one network into the modules of the backbone, and vice versa, using Shannon’s entropy $H$:
\begin{equation} h_{A \to B} = \frac{ \sum_{i}^{m_A} H(\mathcal{A}_i) }{ m_A \cdot \log_2(m_B) } \quad , \quad h_{B \to A} = \frac{ \sum_{j}^{m_B} H(\mathcal{B}_j) }{ m_B \cdot \log_2(m_A) } \quad . \end{equation}
Thus, $h_{A \to B}$ captures how much the modules of the original graph were on average dispersed into the modules of the backbone, and $h_{B \to A}$ the other way around. The smaller these normalized values are, the more the community structure of one network is preserved in the other.
To obtain a value that captures module similarity in both directions, we also compute a normalized measure based on the Jaccard proximity between all modules in original graph and all modules in backbone graph:
\begin{equation} y_{AB} = \frac{ \sum_{i}^{m_A} \sum_{j}^{m_B} P(A_i,B_j) }{ \sqrt{ m_A \cdot m_B} } \quad \text{where} \quad P(A_i,B_j) = \frac{ |A_i \cap B_j| }{ |A_i \cup B_j| } \quad . \end{equation}
In this case, the larger the values, the more similar is the community structure between original and backbone graph.
Another measure based on the Jaccard proximity is computed to compare each module in original graph, to its most similar counterpart in the graph ($J_{A \to B}$) and vice-versa ($J_{B \to A}).
\begin{equation} J_{A \to B} = \frac{ \sum_i^{m_A} \max_j^{m_B} P(A_i,B_j) }{ m_A } \quad , \quad J_{B \to A}= \frac{ \sum_j^{m_B} \max_i^{m_A} P(A_i,B_j) }{ m_B } \end{equation}
Finally, to also compute a measure of similarity that considers “nestedness”, we compute a recently proposed measure of modularity similarity between graphs named $\text{clusim}$ [10].
For all these measures, the tables in the results below compare how much the backbones capture two alternative ways of looking at the community structure of the original network: the metalabels originally assigned to nodes, and the inferred community detection on the original network. Depending on the network, sometimes the metalabels are the most accurate parsing of the social structure, but in other cases it is the community detection that is a better representative of the actual social structure.
References:
[1] Juliette Stehlé, Nicolas Voirin, Alain Barrat, Ciro Cattuto, Lorenzo Isella, Jean-François Pinton, Marco Quaggiotto, Wouter Van den Broeck, Corinne Régis, Bruno Lina, and et al. High-resolution measurements of face-to-face contact patterns in a primary school. PLoS ONE, 6(8):e23176, Aug 2011.
[2] Julie Fournet and Alain Barrat. Contact patterns among high school students. PLoS ONE, 9(9):e107878, Sep 2014.
[3] Philippe Vanhems, Alain Barrat, Ciro Cattuto, Jean-François Pinton, Nagham Khanafer, Corinne Régis, Byeul-a Kim, Brigitte Comte, and Nicolas Voirin. Estimating potential infection transmission routes in hospital wards using wearable proximity sensors. PLoS ONE, 8(9):e73970, Sep 2013.
[4] Mathieu Génois, Christian L. Vestergaard, Julie Fournet, André Panisson, Isabelle Bon-marin, and Alain Barrat. Data on face-to-face contacts in an office building suggest a low-cost vaccination strategy based on community linkers. Network Science, 3(3):326–347,Sep 2015.
[5] Lorenzo Isella, Juliette Stehlé, Alain Barrat, Ciro Cattuto, Jean-François Pinton, and Wouter Van den Broeck. What’s in a crowd? analysis of face-to-face behavioral networks. Journal of Theoretical Biology, 271(1):166–180, Feb 2011.
[6] Paul Jaccard. Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles, 37:241–272, 1901.
[7] G. Grefenstette. Explorations in Automatic Thesaurus Discovery. Kluwer Academic, 1994.
[8] Tiago Simas and Luis M. Rocha. Distance closures on complex networks. Network Science ,3:227–268, 6 2015.
[9] Luis M Rocha. Semi-metric behavior in document networks and its application to recommendation systems. Soft Computing Agents: A New Perspective for Dynamic Information Systems, 83:137, 2002.
[10] Alexander Gates, Ian Wood, William Hetrick, Yong-Yeol Ahn. On comparing clusterings: an element-centric framework unifies overlaps and hierarchy, arXiv preprint arXiv:1706.06136, 2017.
Results on the following networks can be accessed on their own Jupiter Notebook. For each, there are a computation and a results notebook. For reproducibility purposes, the former contains necessary code for the computation of each formula, the latter contains only resulting images and tables.
Network | Nodes | Edges | Metric (backbone size) | Semi-metric | ||
---|---|---|---|---|---|---|
Primary School | 242 | 8,317 | 790 (9.50%) | 7,527 (90.50%) | summary | computations |
High School | 327 | 5,818 | 603 (10.36%) | 5,215 (89.64%) | summary | computations |
Workplace | 232 | 4,274 | 745 (17.43%) | 3,529 (82.57%) | summary | computations |
Exhibit\* | 200 | 714 | 356 (49.86%) | 358 (50.14%) | summary | |
Hospital | 75 | 1,139 | 217 (19.05%) | 922 (80.95%) | summary | |
Hypertext | 113 | 2,196 | 308 (14.03%) | 1,888 (85.97%) | summary |
RBC is supported by CAPES Foundation Grant No. 18668127; Instituto Gulbenkian de Ciência (IGC); and Indiana University Precision Health to Population Health (P2P) Study. LMR was partially funded by the National Institutes of Health, National Library of Medicine Program, grant 01LM011945-01, by a Fulbright Commission fellowship, and by NSF-NRT grant 1735095 "Interdisciplinary Training in Complex Networks and Systems". The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.