Mining Real-Time Social Media Big Data to Monitor HIV: Development and Ethical Issues

Abstract: Social “big data” holds information with wide-ranging implications for addressing issues along the HIV care continuum. Social big data refers to information from social media and online platforms on which individuals and communities create, share, and discuss content. One in four people worldwide, or over a billion people, are publically documenting their activities, intentions, moods, opinions, and social interactions on these sites. They are doing so with increasing volume and velocity, including 400 million “tweets” per day on Twitter and 4.75 billion content items shared per day on Facebook. With an increasing number of these platforms supporting access to publicly-available user data, social big data analysis is a promising new approach for attaining organic observations of behavior that can be used to monitor and predict real-world public health problems, such as HIV incidence. New tools such as social data are therefore needed to supplement existing HIV data collection methods. In preliminary research, our team developed the first approach that identifies psychological and behavioral characteristics from social big data (>550 million tweets) found to be associated with HIV diagnoses. Since groups at the highest risk for HIV (e.g., minority populations) are the fastest growing Twitter users, and because social media users have been found to publicly share personal information, we identified and collected tweets suggesting HIV risk behaviors (e.g., drug use, high-risk sexual behaviors, etc.) and modeled them alongside CDC statistics on HIV diagnoses. We found a significant positive relationship between HIV- related tweets and county-level HIV cases, controlling for socioeconomic status measures and other variables. The problem is that this approach is not currently scalable for use by HIV researchers and public health organizations. Although public health agencies are interested in mining social data to address HIV, current tools are not accessible to most health scientists, as the tools require advanced computer science expertise. For example, analyzing 500 million tweets a day requires expertise in big data engineering, advanced machine learning, natural language processing, and artificial intelligence. Developing a single platform for mining social data that has been designed and tested by and for HIV researchers could provide a significant impact on HIV prevention, testing, and treatment. We seek to create a single automated platform that collects social media data; identifies, codes, and labels tweets that suggest HIV-related behaviors; and ultimately predicts regional HIV incidence. Because of the potential ethical issues associated with mining people’s data, we also seek to interview staff at local and regional HIV organization and participants affected by HIV to gain their perspectives on the ethical issues associated with this approach. The software developed from this application will be shared with HIV researchers and health care workers to provide additional tools that can be used to combat the spread of HIV.

Project Number: 1R01AI32030-01 (2017); 1R56AI125105-01A1 (2016)