You are on page 1of 15

We introduced a new Diner classifier that is a three stages module system .

The first
phase is a disease news filtering that emphasis on filtering of disease or health related
news headlines from disease related corpus or non disease related corpus. The second
phase is Outbreak news filtering that emphasis on getting outbreak news headlines
from these above extracted disease related news headlines which is done by using
various classifiers like SVM, random forest , naive bayes . The third stage gives emphasis
on classifying the news headlines into three categories which are incidence reporting
headlines, trend reporting headlines and future predicting headlines. This chapter
explains the basic architecture of diner classification model . This chapter also gives
overview of modules used , input sources and the execution methodologies used .
Experiments and results are shown in the later chapter .

The first phase uses TF-IDF i.e Term frequency inverse document frequency approach in
which unigram ,bigram and trigram frequencies are considered for getting disease
related news from non-disease related corpus. TF-IDF is a information retrieval
approach in which we consider the importance of a word or set of words in a corpus .
The weights are assigned to the important grams. For the second phase , features are
extracted from disease related corpuses and natural language processing . The nave
bayes in natural language processing is used to get outbreak news headlines which
considers the polarity of a sentence which can be positive or negative .Various classifier
models are used like SVM ,random forest ,nave bayes. Training data has been collected
from various disease related corpus and non disease related news corpuses .POS i.e
part of speech tagger in natural language processing tool is used to get past and future
related health related news headlines which are finally used to train classifier model to
get three classes as incidence reporting headlines, trend reporting headlines and future
prediction reporting headlines. The various classifier model have been used for that like
nave bayes , SVM , Random forest.This chapter describes the inputs sources , various
methodologies and phases used ,preparation of training data sets ,feature engineering
for our classifier, algorithm used .
Fig - BBDBBDBBBBBBBBBB

DISEASE RELATED CORPUS NON DISEASE RELATED CORPUS

FEATURES DEVELOPMENT

FEATURES EXTRACTION FEATURES EXTRACTION

CLASSIFIER

MODEL

Fig - BBDBBDBBBBBBBBBB

INPUTS SOURCES:-

The diner classifier considers sets of inputs: - Training datasets and candidate sources.
Various methods are used to get inputs from sources. The training data sources and
methods of preparation are given below:
TRAINING DATA:

The disease related and non-disease related corpuses are considered to get the training
data. They are taken in the ratio of 20:80. The sources and preparation of data from
both sources are listed in subsections:-

Disease related corpus:

There are various disease related corpuses which are available publicly. They directly
serve as positive instances for classification. Given below are the different major disease
related corpuses that are considered:

HEALTHMAP:-This is a public repository provided by World Health Organization (WHO).


This is the online available disease news repository from which 10 percentage of disease
outbreak news are taken. The diseasedaily news repository is used to get outbreak
news.

CDC:-Center for Disease Control and Prevention contributes 5 percentage to get the
disease outbreak news. Only the recent outbreak news from 2015 and 2016 are
considered for training data.

IDSP: - Integrated Disease Surveillance program is a project that provides the health or
disease outbreak information for India. The monthly and weekly reports are collected
manually .It provides Health news for villages, Districts and states. This contributes 5
percentage to our training data.

WIKIPEDIA: - As Wikipedia is a great source of information. It is being used to get the list
of disease names or emergent disease names which are then used to get disease related
news.

NEWS CORPUS WITH DISEASE NAMES: - The other news corpuses have been used to
get more disease names as having list of all disease names will help to get more disease
related headlines. This system also needs information to be extracted from other news
corpuses like biomedical news sources, clinical news sources.
For extraction of all these disease corpuses the crawler and scrappers are being used to
aggregate disease news articles. The crawler is being run for every week for a month to
aggregate news headlines related to disease or health.

Fig - hkhsfgfdgfkdggkbgk

LIST OF DISEASE NAMES

HEADLINES CONTAINING ANY DISEASE NAME

FEATURES EXTRACTION USING TF-IDF

HEADLINES FROM DISEASE RELATED CORPUS

H: NEWS HEADLINES FROM NON DISEASE RELATED CORPUS

DISEASE RELATED HEADLINES

1. NAIVE BAYES TO GET NEGATIVE OUTBREAK NEWS HEADLINES


1. SEMANTIC ANALYSIS OF SENTENCE
2. RULE BASED APPROACH
2. COSINE SIMILARITY MATCHING

Ho:-DISEASE OUTBREAK NEWS HEADLINES OR TREND REPORTING HEADLINES

IF H HO

CLASSIFIER
NON OUTBREAK DISEASE RELATED HEADLINES
MODELS

POS TAGGING AND SEMANTIC ANALYSIS

CLASSIFIER
INCIDENCE REPORTING HEADLINES FUTURE PREDICTION REPORTING HEADLINES MODELS
NON DISEASE CORPUS: - The news repository in which news headlines from various
news sources are crawled and stored on daily basis is used . This contributes to the 80
percentage of training data being used for classification.

Modules Used:

It works in three modules:-

1. Fetching disease related news headlines

2. Isolating disease outbreak news from disease related headlines

3. Classifying the disease related news articles into three classes like incidence reporting
headlines, trend reporting headlines and future predicting headlines.

PREPROCESSING :-

CORPUS CLEANING :- As the news headlines may contains irrelevant data like unicode
characters and delimeters , so they are removed during corpus cleaning stage . The
disease names are extracted from PubMed and wikipedia but to get emergent disease
names disease related news corpuses which gives outbreak news are used.Ambiguities
are resolved for disease names to some extent by collecting different disease names for
a particular disease.

FETCHING DISEASE RELATED NEWS HEADLINES:-

This stage first collects the disease names from news corpuses. Then the news headlines
from non-disease related corpuses are matched against these disease names and
headlines containing any disease names are extracted. These above extracted headlines
then are used for feature development and extraction. Those features helps to get more
disease related headlines that might not contain any disease name.
FIG SSSSSSSSSSS

DISEASE AND NON DISEASE RELATED HEADLINES

DISEASE HEADLINES CONTAINING ANY DISEASE NAME

COUNTING TF-IDF

IF
CONFIDENCE>=40

INCLUDE IN DISEASE RELATED HEADLINES DISCARD

FEATURE DEVELOPMENT AND EXTRACTION:-

This uses Term frequency Inverse document frequency that weights the term either
binary or non-binary

IN BINARY WEIGHTAGE:-

If term appears in a document appears in a document then it is weighted as 1 else


weighted as 0.

0, , =0
, =
1, , >0
IN NON BINARY WEIGHTAGE: -

, | |
, = log
, , | | , , >0 |

It doesnt take frequency of a grams as the deciding factor but it also considers that the
words that are commonly used as the, an, isetc should contribute less so should be
assigned less weightage to extract features.

The features are extracted using Bag of Words approach in which unigram, bigrams or
trigrams are considered. The headlines which contain any disease name serves as an
input from which features in the form of unigrams, bigrams and trigrams are extracted.
Those having confidence equal or greater than 40 percent are considered to be
important features .The confidence here is the proportion or say number of lines that
contain those features ,so here those features which exists in at least 40 percent lines of
total lines are considered to get more disease related headlines.

Few features extracted are:- Fall-ill , Hospital, reported with cases/death , suspected
with , cases or dies/death Cases detected , Outbreaks ,Flu ,Sickened , died of , Disorder
,illness , Vaccine ,Uneasiness , Patient , Doctor, Injuries , spread ,virus ,people died from
, is positive etc.
ISOLATING DISEASE OUTBREAK NEWSHEADLINES:-

To train the classifier to classify disease related news headlines into outbreak news
headlines and non-outbreak news headlines , first the outbreak headlines are filtered
from non-outbreak news .There were 4 approaches used for that:-

1. Semantic approach
2. Cosine similarity
3. Nave bayes
4. Rule based approach
SEMANTIC APPROACH: - It uses sentence level features extraction from disease related
corpuses to get disease outbreak headlines. The headlines crawled from disease related
Corpuses are fed into system that checks the most occurring terms in them. Few
examples of extracted terms are outbreak, cases rises in, reemerges, emerging
in , increase in number of cases , at risk, spreading, widespread
,grow,plague,influenza,stricks ,reaches etc. Top 40 features are considered.

COSINE SIMILARITY:-In text retrieval, this cosine similarity helps in knowing that how
similar are two string or sentence. The text is represented in non-zero vector form.
Suppose A and B are two vectors then the cosine similarity

&' *+, & . '


! " = cos % = =
||&||( ||'||(
- *+, & ( - *+, ' (

The similarity measures ranges from -1 to 1 where 1 indicate that they are exactly same
while 0 indicate that they are decorrelated and -1 indicates that they are exactly
opposite .As it also uses term frequency as the parameter so its value cannot be in
negative. So we uses 0 to 1 range to find how similar two headlines are to get outbreak
disease news .The value closer to 1 have many words in common.

So the cosine similarity measure is used to find headlines from non-disease related
corpus which are quite similar to headlines collected from outbreak disease related
corpus. This helped to get more outbreak news

From non-disease related corpus that will contribute to training set to classifier to
classify outbreak headlines and Non outbreak headlines
NAIVE BAYES:-

Naive bayes in NLP helps to give polarity value of a sentence. Polarity here tells that
whether the semantic of a sentence is positive or negative.

Negative polarity with any disease name are considered as outbreak news. Headlines
like No cases of ebola suspected in 2 days or drop in chikunguniacases yesterday

RULE BASED APPROACH:-

The outbreak headlines extracted using first approach might not actually be an disease
outbreak news. So the making rules for them filter those from outbreak disease news.

1. Some disease outbreak news that shows historic outbreak news.

Scientist are studying the reason behind outbreak happened in 2015

This headlines gets included in outbreak news if semantic approach is used But it should
not be included in disease outbreak news .This was resolved by making a rule that if any
outbreak news contains words like a year ago ,last year , years ,months
ago,months or contain any integer value that does not show the current year.

2. Some outbreak news might talk about hypothesis not the actual outbreak .

Billions of people could have died if ebola virus could spread through communication .

The headlines that seems to belong to outbreak headlines but are actually hypothesis is
filtered by using rule that if the outbreak headlines contains any modal verbs then they
are not actually outbreak news. In the the example given above the modal verbs could
was there which is identified by POS part of speech tagger.

3. Some news identified as disease related new having special keywords.

4 cases of malaria suspected yesterday

The headlines containing keywords like today ,yesterday or days ,morning or


containing any particular day of a week are considered to a disease outbreak news .
DATE EXTRACTION: - The information related to the outbreak news may be very useful
in many purposes. So getting dates or duration when outbreak happened is very
important knowledge nd may help immigrates for making decision. The date when the
news was published is considered as the date of outbreak with the assumption that the
disease outbreak might have happened on or before few days only.

CHOICE OF CLASSIFIER:-

There were 756 outbreak news headlines were collected and 1145 non outbreak news
headlines were collected. After collecting training data for both class outbreak and non-
outbreak disease news headlines, the features are extracted to get feature vector .The
feature vector is feed into the classifier model for learning. For deciding which classifier
to be used the evaluation matrix is evaluated by training the various classifier models.

EVALUATION METRIC

The evaluation matric is accuracy of the classifier i.ethe number of headlines it correctly
classifies out of total number of input headlines given to the classifier. Overfitting of the
classifier is being avoided by doing 10-fold cross validation in which 9 folds are used to
train the classifier and last one is used for testing purpose. The classifier that gives best
of all accuracy is supposed to be trained better. Accuracy is measured by considering F1
Score.

F1 SCORE: - It is the harmonic mean of Recall and Precision. Precision is the ratio of
relevant instances out of retrieved instances while precision is the ratio of relevant
instances out of total instances.

TRAINING OF CLASSIFIER:-

The classifier is to be trained so that it can accurately classify the test data . So the
feature vector has to be found. The feature vector is constructed by combining sentence
and semantic features.
Semantic features discussed here are unigram and bigram. So the classifier is trained for
both sentences and features. The top k unigrams and bigrams are found from both
outbreak or non outbreak news . The stop words and the common features between
both the classes and the words that are not included in file containing features are being
ignored so zero weightage is being given to them . The features are given binary
weightage based on whether they are top k features are not . The top k features of
outbreak disease class are given high weightage in 10^6 and features of non-outbreak
disease are given weightage as 1. The Classifier is trained for the sum of keywords in a
sentence.

BINARY CLASSIFICATION

Sum= >1000000 // Outbreak news

Else //Non outbreak news

Where Sum is the sum of weightage of all relevant keywords in a sentences with high
weightage for the Keywords related to outbreak disease related class.

Various binary classifier models have been tried for this like SVM, Random forest or
Naive bayes . The Support vector machine gives the high accuracy so is considered for
classification.

CLASSIFICATION OF DISEASE RELATED NEWS HEADLINES:-

For the final stage of the proposed model i.e the classification of disease related news
headlines into incidence reporting headlines , Trend reporting headlines and future
predicting headlines , We have trend reporting headlines as the outbreak disease
headlines , We need to isolate incidence reporting headlines and future trend headlines
so that classifier can be trained accordingly.

Filtering of incidence reporting headlines and future prediction headlines from Non
Outbreak news headlines:-
The two approaches used are:

1. POS tagging
Part of speech tagger is a natural language processing tool that tags the
keywords of a sentence with the part of speech tags they belong to. Example

hypertension may cause diabities

POS Tagger result:-hypertension/NN may/MD cause/VB diabities/NN


Where NN stands for Noun, MD stands for modal verbs and VB stands for
simple verb form.

The Part of speech tagging have been very useful to know most of past or future
health event. VBD and VBN tags helps to get most of past health events from
non-outbreak disease related news while MD tags is used to get most of future
predicting headlines.
MD verbs contains may,must,should,would

TAG MEANING CLASS

VBD VERB, PAST TENSE INCIDENCE REPORING

VBN VERB,PAST PARTICIPLE INCIDENCE REPORTING

MD MODAL VERBS FUTURE PREDICTING

2. Semantic features
The Semantic structure of the sentences are being used. All the facts and studies
are being included in future predicting headlines. The features are like linked
to, causes ,tied to , could ,saysetc are classified in future predicting
disease related new headlines .
ebola is linked withheart attacks
malariacauses fever

EVALUATION MATRIC AND TRAINING DATA:-

There are 756 headlines for trend reporting headlines , 503 for future trend headlines
and 642 for incidence reporting headlines that are used for training the classifier model
.The various classifier like random forest , SVM or Nave bayes have been tried to train
the training data . For training the feature Vector only unigrams are considered. The
weights are assigned in three ranges. The evaluation matric considered here is only
accuracy not recall or precision.

ALGORITHM

U: Unigrams
B : Bigrams
T : Trigrams
C : .75* Hf
R : Predefined Rules
Fd : Features for disease related
Fo: Features for outbreak disease related news data

INPUT :

Hd: Disease Related headlines


Hn: Non Disease related headlines
Dn: List of disease names

OUTPUT :

On : Non outbreak News headlines


Od: Outbreak disease related news headlines
Hf:Hadlines containing any disease name
Hp: Headlines containing incidence reporting headlines
Hf : Headlines containing Future prediction reporting headlines

For each h in Hn
For each d in Dn
If d in h
Hf =HfU h
Hd=HdU h
For each h in Hf
Find count of (U,B,T)
If count >= C
Fd =Fd U (U,B,T)
If (F in h) and (h not in Hd)
Hd=HdU h

For each h in Hd
Fo = Fo U (U,B,T)
If Foin h
Od = OdU h
Else if( 0.2<Cosine_Similarity(Hd,Od)<1)
Od = OdU Hd
Else if ( ( Polarity (Hd) >0) and (Foin Hd))
Od = OdU Hd
Else if (Hd R)
Od = OdU Hd

For each S in On
Data=Pos_tag(s)
For each pair in data
If pair[1]==MD
Hf=HfU S
Else if pair[1] ==VBN or pair[1]==VBD
Hf=HfU S

Return Hd ,Hn

ALGORITHM TO EXTRACT AND TRAIN CLASSIFIER MODEL

Z: Unigrams

INPUT:
Sum=0
On : Non outbreak News headlines
Od: Outbreak disease related news headlines
OUTPUT
Fo : Unigrams as features for Outbreak disease news headlines
Fn: unigrams as features for Non outbreak disease news headlines
Fu : F-Score for unigram
For each S in Od
Find Z
If count (Z)>.65
Fo =Fo U Z
For each S in On
Find Z
If count (Z)>.65
Fn =Fn U Z

For each S in On
Token =Tokenize(S)
If (Token Fo and Token Fn)
Sum=Sum+1000000
If Sum> 1000000
Classify in Oubreak Disease related headlines
Else
Classify in Non Outbreak Disease related Headlines
Calculate FScore
Return FScore

You might also like