Background. There are a number of systems that use data from the Internet (such as, news, social media and crowd-sourced reports) and other digital sources (e.g., cell phones, wearable devices) to monitor disease spread, assess population attitudes towards vaccines, and improve understanding of the interaction between population behavioral changes and health. In addition to challenges in extracting public health signals from the noise inherent in these data sources, there are significant biases due to differences in the representation of individuals from different locations, age and race/ethnic backgrounds. Although there have been several publications discussing the limitations of these data sources, no project has developed a rigorous and comprehensive approach to systematically investigate these limitations and explore mitigation strategies.
Specific Aims. We seek to improve methods for automated inference of key demographic traits – including age, race/ethnicity and gender – of Twitter users. We will then use these tools to assess the quality and representativeness of health information provided by users on Twitter, as well as examine how the public discusses personal health using these data. Through this research we seek to improve the way researchers use Twitter as a means of learning about – and eventually improving – the health of the American public.
Funding. This project is funded by the Robert Wood Johnson Foundation.