The supervised learning approaches require labeled training data to generate a predictive model. One of the essential preconditions for this approach is that the training set has to have correct labels. In other words, the training set has to be error free. The data generated on social media can be misleading so that it can be challenging for this approach. Sadilek et al. (2012) argued that their observations are limited by users talk about their health, and by their ability to identify these tweets in the flood of other types of messages. One of the deterministic factors for this approach is that the classifier able to analyze the natural language. For example, “I’m sick of you.” matches a keyword “sick”, but “sick of” usually does not indicate someone’s health condition. With appropriate and well-labeled training set, supervised approaches theoretically can distinguish such difference. Aramaki et al. (2011) have experimented with several supervised approaches such as the Naive Bayes, logistic regression and the support vector machine (SVM).They claimed the SVM with a polynomial kernel showed feasibility from both viewpoints of accuracy and the training time. Their model used human-annotated tweets as training sets, influenza corpus used as keywords to identify related tweets from 2008-2010 as test data, the statistics from Japan Infection Disease Surveillance Center (IDSC) as the validation data. The work considers as one of the first attempt to apply the machine learning techniques to predict epidemics using social media data. One of the issues in their model is that they did not consider the tweet’s geo-location feature, but the IDSC data can be regionally specific. The spatial feature is one of the essential characteristics of the infectious diseases. Researches in this field need to consider such factors in the study unless the model is based on a global scale. The reported result shows the predictive values are highly correlated with the statistic from IDSC. However, their result also indicates the SVM is sensitive to news on Twitter. This model failed to distinguish messages between the health-related news and personal health condition messages.
Broniatowski et al. (2013) have investigated a similar supervised classification model that overcomes the barrier by separating tweets indicating influenza infection from those indicate influenza awareness or concern. Their model used 269 health-related keywords and SVM to filter relevant tweets, two logistic regression models trained with human–annotated data to predict whether or not a person is infected with influenza, using the US Centers for Disease Control and Prevention (CDC) statistical data as validation set, the two classifiers were respectively estimated to have 67% and 74% precision and 87% and 87% recall. Unlike previous work, this work applied the geo-location filtering on the data and compared with CDC statistic which is regionally specific. Although this model’s prediction converges with CDC data at the maximum point, the CDC surveillance shows a steep increase in the outpatient statistic at the beginning of the flu season, this model only predicts with a gradual increase in the influenza prevalence. It may cause by user did not discuss their health condition on Twitter.
Both studies mentioned in this section adopted the supervised learning approach. They were bounded by the disease-related keywords that have to be mentioned by the potential infected twitter user. Despite the limitation of this approach, both works showed a promising result that using social media data forecast the influenza seasons. In particular, the SVM outperforms other methods. Semi-supervised Learning Semi-supervised learning is one of the machine learning techniques that use both labeled data and unlabeled data for training particular model.
Sadilek et al. (2012) adopted a semi- supervised cascade-based approach to learn a robust SVM classifier with a large area under the ROC curve (i.e., consistently high precision and high recall). This work used two different binary SVM classifiers, one of them focusing on penalizing false positives, another one focusing on heavy penalizing creating false negatives, evaluation of the final SVM shows 0.98 precision and 0.97 recall. By applying two SVM that concentrate on two different penalizing strategies, this approach shows notable improvement compared with previous methods. Both the precision and the recall are increased by 10% higher than the highest estimation in the model developed by Broniatowski et al. (2013). Sadilek et al. (2012) also explored the spread of diseases based on geo-location and relationship between Twitter users. The result indicates a person’s social ties are highly related to the spread of the infectious diseases. This work only used conditional probabilities to investigate the social relations effect on the disease transmission, further research on different methods are required.
Unlike the above methods, unsupervised learning uses the unlabeled data set to discover interesting patterns. In some sense, it behaves more like a human level intelligence method. Lim et al. (2017) proposed a method that uses an unsupervised machine learning model to discover latent infectious diseases without using predetermined disease attributes. The method employed the SentiStrength an unsupervised sentiment analysis based on the NLP, using a list of biomedical terminologies to determine if a tweet contains symptoms, body parts, and a pain location (Lim, Tucker & Kumara, 2017). Like previous methods that mentioned in the above sections, this approach also uses a list of keywords to filter social media messages, but it made some improvements. Instead of matching diseases related keyword, it matches a list of 3 symptoms, body parts, and pain location expression, as well as their relationship. More comprehensive collection of tweets considered as the target of this model. Lim et al. (2017) used the F1 score to evaluate their model’s performance and reported having 0.724 on the F1 score.
Using precision and recall to calculate the F1 score, Broniatowski et al. (2013) model have the F1 score of 0.87 and 0.70, Sadilek et al. (2012) model have the F1 score of 0.97. By definition, the F1 score equal to 1 indicates perfect precision and recall. Therefore, this model did not show any advantages in terms of the statistical evaluations. However, the unsupervised approaches can proactively to predict disease outbreaks because it does not rely on any training set. The traditional method adopted by CDC that relies on statistical data from hospitals which could be time-consuming. Consider a hypothetical scenario, a new infectious disease outbreak in a populated city. User talks about their symptom online, and it is unlike any known diseases. In theory, the unsupervised model can discover this pattern and combine spatial analysis to generate a prediction. It should be able to predict the outbreak before the CDC or other disease surveillance agencies release their report.
There are several other researches also focus on this problem. This review only discussed a small subset of methods that are representative. The primary challenge for using the social media to predict disease outbreak is the NLP. No doubt, the machine learning can produce a promising model and outperform the other traditional methods such as keyword filtering system. Three different types of learning methods have various advantages and limitations. The earlier studies have adopted supervised or semi-supervised techniques shows feasible solutions. By surveillance the social media, machine learning models can monitor disease outbreak. SVM is the most popular choice. The more recent research applied the unsupervised learning method provides more a proactive way. However further study is needed to improve the performance.
These models discussed in above sections did not consider the nature of the infectious disease, such as types of transmissions or incubation period. Additional studies that incorporate features of the infectious disease can develop a more complex model. A method that also includes spatial analysis can make predictions on a global scale. For example, the model discovered a potential outbreak in a region, subsequently applying the travel pattern analysis of this area combine with the disease’s characteristic, it is possible to present a prediction on a global scale. Such models will have powerful impacts on pandemic prevention. Current models also limited on the language that only applied the NLP to English, additional languages are subject to more investigations.