Collective Social Behavior in Health-Related Problems

The recent availability of large-scale population data from web searches and social media now allows us to study collective social behavior on a global scale. Using various sources of large-scale social data, such as electronic health records, social media, web searches, public forum, and mobile application data, we are working to understand the causes and solutions for various health-related problems. These range from understanding the human patterns of reproduction and interest in sex at a global scale, to uncovering adverse drug reactions. [Correia, Wood, Bollen & Rocha, 2020; Correia et al, 2019 ].

Public health surveillance methods using social media, web search, and electronic health record data provide an increasingly detailed large-scale and fine-grained record of the behavior of a considerable fraction of the world’s population that biomedical and public health researchers can leverage to observe human behavior directly at high temporal and geographic resolution. Biomedical, public health, and government researchers and agents can leverage this data to observe human behavior directly at high temporal and geographic resolution. Indeed, the ability to study humans as their own model organism via this type of big data is now a more reasonable prospect for biomedical research than ever before [Correia, Wood, Bollen & Rocha, 2020].

Defined by the National Institutes of Health as Big Data, social media and web search data provide large-scale measurements of individual psychology and collective behavior, which enhance the ability of public health surveillance methods to uncover new patient-stratification principles and unknown disease correlations. Leveraging these kinds of data constitutes a novel opportunity to obtain real-time, accurate measures of relevant epidemiological signals. The potential of using data from Twitter, Instagram, Facebook, Reddidt and other sources for discovery and prediction of Drug-Drug Interactions (DDI), Adverse Drug reactions (ADR), and even sudden death in epilepsy, has been demonstrated by our group and others [Correia, Wood, Bollen & Rocha, 2020; Wood et al, 2022 ]. See our work on Public health monitoring using social media, electronic health records, and other unconventional data sources.

Network of term proximity build out of Instagram timelines that mentioned a drug used to treat depression. Term co-occurence on one week window resolution; largest connected component and weight >= 0.05 shown. From: [Correia, Li & Rocha, 2016 ]

Video summarizing our use of computational social science to solve a long-standing sociobiology question about human reproduction cycles [Wood et al, 2017].

We have also pioneered the use of spectral methods from multivariate time-series and literature mining on the sentiment analysis of social media data. This eigenmood methodology was first used to study cycles of human reproduction at the planetary level [Wood et al, 2017]. We have been generalizing and expanding it to extract and characterize public mood factors associated with excess mortality in COVID-19 with collaborators. We are working to demonstrate that: a) regular language usage obscures sentiment analysis of social media, but can be removed to reveal the precise components of sentiment that are associated with phenomena of interest [ten Thij, Wood, Rocha, & Bollen, 2019]; and b) such precise extraction of components of public sentiment is useful to characterize and predict excess mortality and other biomedical phenomena that affect human populations [Correia, Wood, Bollen & Rocha, 2020].

We currently analyze various web search and social media sources such as: Google Trends, Wikipedia, Twitter, Facebook, ChaCha, Reddit, and the Epilepsy Foundation public forums, and have focused on studying depression, epilepsy, and opioid abuse, as well as other health-related problems [Correia, Wood, Bollen & Rocha, 2020] such as human-reproduction [Wood et al, 2017] and even automated online fact-checking [Ciampaglia et al, 2015.]. See publications below for additional details on all these threads.




Yearly Classification results on the US House of Representatives. Classification based on 3.000 textual features extracted from house floor speeches. From: [Correia, Chan, & Rocha, 2015]

Discourse Polarization in the US Congress

Congressional politics in the United States has become increasingly polarized across the aisle in recent decades. However, based on roll call votes or bill cosponsorship data, common estimates of polarization tell us little about the lawmakers' agendas and values.

We address this issue by studying the U.S. House floor speeches using text mining and machine learning techniques. Our results shows that predicting party affiliation from textual features improves with more recent speeches, suggesting intensification of polarized discourse. Moreover, polarization is more serious in some topics but less remarkable in others [Correia, Chan, & Rocha, 2015].

We also show that building knowledge networks on feature relations shows a preliminary road to the study of policy agendas and values. This findings will facilitate future analyses of the use of framing devices in political communication such as "dog whistles".




Detecting conflict in social unrest using Instagram

Public protests and civil disobedience have been a recurring means to change the political status quo via social activism. After the introduction of mobile communication and the adoption of social media, it has become possible to obtain and measure real-time, large-scale quantitative data about social unrest situations.

Occasionally, social activism can degenerate into unrest and violence. In such conflict situations, protests can transition from peaceful to violent, including riots that damage property and clashes with police. Here we address the question of whether the build-up in tension in such protest activities can be identified and ultimately predicted using social media data. We collected data from the social media platform Instagram related to the 2014 clashes in Ferguson and Hong Kong.

Public Instagram posts that matched our event specific hashtags on the service’s API were collected. Only posts with geo-located within the protest area were kept. Posts were curated and annotated for traces of violence or tension build-up. We divided the geographical area in a 2-dimensional grid of rectangular cells and aggregated the data in 15 minutes intervals. We analyzed this space-time data using the Singular Value Decomposition (SVD). Our goal was to identify the (time- and space-) singular vectors most correlated with tension build-up or the onset of violence.

Our results indicate that it is clearly possible to pinpoint the exact location of the main gatherings solely by calculating cell density. Furthermore, some singular vectors characterize well the dynamics of increased social conflict. In this paper, we describe and visualize the dynamics of social conflict in Ferguson and Hong Kong.

Our work demonstrates that current geo-tagged social media posts can be an accurate source of data to predict tension build-up and ultimately violence in social unrest situations. This method could be useful for journalists, human-rights agencies, and government orgnizations.

(left) Instagram posts over Hong Kong. (right) Instagram posts over Ferguson (MO). Violent posts shown in red.

Funding Project partially funded by




Project Members (Current and Former)

Luis Rocha (PI)

Johan Bollen

Rion Brattig Correia

Joana Gonçalves-Sá

Lang Li

Wendy Miller

Aehong Min

Ian B Wood




Selected Project Publications