Network science has provided many insights into the organization of complex systems. The success of this approach is its ability to capture the organization of multivariate interactions as networks or graphs without explicit dynamical rules for node variables. As the field matures, however, there is a need to move from understanding to controlling complex systems. This is particularly true in systems biology and medicine, where increasingly accurate models of biochemical regulation have been produced, or in social networks to study epidemic spread for pandemic management. We have contributed to this goal with two mathematical concepts which allow us to remove different forms of redundancy in networks: 1) distance closures, and 2) canalization via schema re-description. The first concept allows us to infer the invariant subgraph that is sufficient to compute all shortest paths in a weighted graph: the distance backbone. This has demonstrated that there is massive redundancy in many networks in different domains, whereby most edges in a network are not necessary to compute shortest paths (e.g. 90% of edges in some brain networks). Removing redundant edges can facilitate computation and discovery of important pathways in many applications. The removal of this redundancy simplifies and indeed enables the characterization of information transmission and dynamics on large biochemical, brain, social, technological, and knowledge (including for automatic fact-checking) networks, which are otherwise too large to study analytically [Correia et al, 2022; Correia, Barrat, and Rocha, 2022; Simas, Correia, and Rocha, 2022; Teixeira et al, 2020; Simas and Rocha, 2015; Ciampaglia et al, 2015; Simas and Rocha, 2012; Rocha, 2002] .
The prime example of a Document Network (DN) is the World Wide Web (WWW). But many other types of such networks exist: Wikipedia, electronic health records, bibliographic databases containing scientific publications, social media platforms, as well as databases of datasets used in scientific endeavors. Each of these databases possesses several distinct relationships among documents and between documents and semantic tags or indices that classify documents appropriately. For instance, documents in the WWW are related via a hyperlink network, while documents in bibliographic databases are related by citation and collaboration networks. Furthermore, documents can be related to semantic tags such as keywords used to describe their content. Given these relations, we can compute distance functions (typically via co-occurrence measures) amongst documents and/or semantic tags, thus creating associative, weighted networks between these items—which denote stronger or weaker co-associations. We have used such distance and proximity networks for inference and discovery of protein-protein and drug-drug interactions, health and comorbidity risks, gender and age biases, automatic fact-checking, pharmacokinetic parameters in drug interaction and adverse reaction studies, protein sequence and structure prediction, functional annotation of transcription data, enzyme annotation publications, etc (see publications listed below). We have also used distance and proximity graphs to uncover modules or clusters in the network that may be associated with a particular topic or community of interest. We have applied clustering methods to biomedical, social, knowledge, scientific co-authorship and citation networks (see our bibliome informatics and adaptive web projects for more details),etc. We have also used information-theoretical approaches to classify documents of interest in probabilistic graphs of citation co-occurrence in scientific citation networks [Kolchinsky et al, 2010].
We study the non-metric network topologies that arise in weighted graphs obtained from real-world data (e.g. co-occurrence statistics). In particular, we have developed measures to extract the graph edges which most violate the triangle inequality: semi-metric associations (which are removed to reveal distance backbones above). Our working hypothesis is that strong semi-metric associations can be used to identify trends, items with a higher probability of co-occurring in the future, as well the dynamics of such networks in general. This methodology has been successfully applied to networks of biochemical and biomedical entities, published documents, recommender systems for digital libraries at the Los Alamos National Laboratory, web search and recommendation by the givealink.org project, networks of felons obtained from intelligence records, and gene networks (see publications below). This work has been pursued in the Identification of Interests, Trends and Dynamics in Document Networks Project as well as in a Los Alamos Homeland Security LDRD DR project, “Advanced Knowledge Integration (LDRD Reserve)” to discover latent associations in social networks (internal report available). Some of this work was also funded by NSF grant from the Human and Social Dynamics program, With Eliot Smith and Rob Goldstone—which received some attention in the media for being one of the first to look at facebook data.
We have developed a stochastic model of vertex aging in networks, to better predict network growth [Simas and Rocha, 2008]. Real world networks display a cut-off in the power-law node degree distributions of complex networks, not expected by the canonical Barabasi-Albert Model. Amaral et al had shown that this cut-off behavior can be computationally modeled with vertex aging. We produced a mathematical model of vertex aging, which allows accurate predictions of the equilibrium point of active vertices and relate network growth with probability of aging.
Luis M. Rocha (PI)
Paulo Navarro Costa
Rion Brattig Correia
Felipe Xavier Costa
Andreia Sofia Teixeira