Social Media Mining Toolkit (SMMT)

  • Tekumalla, Ramya (Department of Computer Science, Georgia State University) ;
  • Banda, Juan M. (Department of Computer Science, Georgia State University)
  • Received : 2020.03.16
  • Accepted : 2020.05.22
  • Published : 2020.05.28


There has been a dramatic increase in the popularity of utilizing social media data for research purposes within the biomedical community. In PubMed alone, there have been nearly 2,500 publication entries since 2014 that deal with analyzing social media data from Twitter and Reddit. However, the vast majority of those works do not share their code or data for replicating their studies. With minimal exceptions, the few that do, place the burden on the researcher to figure out how to fetch the data, how to best format their data, and how to create automatic and manual annotations on the acquired data. In order to address this pressing issue, we introduce the Social Media Mining Toolkit (SMMT), a suite of tools aimed to encapsulate the cumbersome details of acquiring, preprocessing, annotating and standardizing social media data. The purpose of our toolkit is for researchers to focus on answering research questions, and not the technical aspects of using social media data. By using a standard toolkit, researchers will be able to acquire, use, and release data in a consistent way that is transparent for everybody using the toolkit, hence, simplifying research reproducibility and accessibility in the social media domain.



  1. PubMed search: social media. Bethesda: National Library of Medicine, 2020. Accessed 2020 Dec 3. Available from:
  2. Jain P, Zaher Z, Mazid I. Opioids on Twitter: a content analysis of conversations regarding prescription drugs on social media and implications for message design. J Health Commun 2020;25:74-81.
  3. Yun GW, Morin D, Park S, Joa CY, Labbe B, Lim J, et al. Social media and flu: Media Twitter accounts as agenda setters. Int J Med Inform 2016;91:67-73.
  4. Moessner M, Feldhege J, Wolf M, Bauer S. Analyzing big data in social media: text and network analyses of an eating disorder forum. Int J Eat Disord 2018;51:656-667.
  5. Jeri-Yabar A, Sanchez-Carbonel A, Tito K, Ramirez-delCastillo J, Torres-Alcantara A, Denegri D, et al. Association between social media use (Twitter, Instagram, Facebook) and depressive symptoms: are Twitter users at higher risk? Int J Soc Psychiatry 2019;65:14-19.
  6. Gabarron E, Bradway M, Fernandez-Luque L, Chomutare T, Hansen AH, Wynn R, et al. Social media for health promotion in diabetes: study protocol for a participatory public health intervention design. BMC Health Serv Res 2018;18:414.
  7. O'Connor K, Pimpalkhute P, Nikfarjam A, Ginn R, Smith KL, Gonzalez G. Pharmacovigilance on twitter? Mining tweets for adverse drug reactions. AMIA Annu Symp Proc 2014;2014:924-933.
  8. Sarker A, Ginn R, Nikfarjam A, O'Connor K, Smith K, Jayaraman S, et al. Utilizing social media data for pharmacovigilance: a review. J Biomed Inform 2015;54:202-212.
  9. Nikfarjam A, Sarker A, O'Connor K, Ginn R, Gonzalez G. Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J Am Med Inform Assoc 2015;22:671-681.
  10. Chowell G, Mizumoto K, Banda JM, Poccia S, Perrings C. Assessing the potential impact of vector-borne disease transmission following heavy rainfall events: a mathematical framework. Philos Trans R Soc Lond B Biol Sci 2019;374:20180272.
  11. Tekumalla R, Asl JR, Banda JM. Mining's twitter stream grab for pharmacovigilance research gold. Preprint at (2019).
  12. Sarker A, DeRoos A, Perrone J. Mining social media for prescription medication abuse monitoring: a review and proposal for a data-centric framework. J Am Med Inform Assoc 2020;27:315-329.
  13. SMMT. San Francisco: GitHub, 2020. Accessed 2020 Dec 3. Available from:
  14. Bisong E. Google colaboratory. In: Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners (Bisong E, ed.). Berkeley: Apress, 2019. pp. 59-64.
  15. Apply for access. Twitter developers. San Francisco: Twitter Inc., 2020. Accessed 2020 Mar 12. Available from:
  16. Standard stream parameters. San Francisco: Twitter Inc., 2020. Accessed 2020 Mar 12. Available from:
  17. Ellendorff TR, van der Lek A, Furrer L, Rinaldi F. A combined resource of biomedical terminology and its statistics. In: Proceedings of the International Conference Terminology and Artificial Intelligence (Poibeau T, Faber P, eds.), 2015 Nov 4-6, Granada, Spain. Spanish Terminology Association, 2015. pp. 39-49.
  18. Hripcsak G, Ryan PB, Duke JD, Shah NH, Park RW, Huser V, et al. Characterizing treatment pathways at scale using the OHDSI network. Proc Natl Acad Sci U S A 2016;113:7329-7336.
  19. PubDictionaries. Accessed 2020 Mar 12. Available from:
  20. spaCy. Industrial-strength natural language processing in Python. Explosion AI, 2017. Accessed 2020 Mar 12. Available from:
  21. Stenetorp P, Pyysalo S, Topic G, Ohta T, Ananiadou S, Tsujii J. BRAT: a web-based tool for NLP-assisted text annotation. Stroudsburg: Association for Computational Linguistics, 2012. Accessed 2020 Mar 12. Available from:
  22. Kim JD, Wang Y. PubAnnotation: a persistent and sharable corpus and annotation repository. Stroudsburg: Association for Computational Linguistics, 2012. Accessed 2020 Mar 12. Available from:
  23. Archive Team: the Twitter Stream grab. San Francisco: Internet Archive, 2020. Accessed 2020 Mar 12. Available from: