So sánh randomforestclassifier and naive bayes năm 2024

Natural language processing (NLP) is one of the most exciting artificial intelligence (AI) technologies today and is widely used across industries and functions. Text recognition and classification problems are popular among data scientists. Often scientists find it hard in choosing the right machine learning model to utilize in text classification problems.

Hence, this write-up makes an attempt in comparing the performance of two popular machine learning techniques in a multi-output text classification problem: RandomForest classifier and Multinomial Naive Bayes classifier. The two models were individually applied in the multi-output classification of messages received during disaster with a goal to predicting the category(ies) a message could belong to.

The dataset used was from ‘Figure8' and include two .csv files: a ‘message.csv’ file and a ‘categories.csv’ file. The ‘message.csv’ data contains columns: message (raw message received during disaster), the message_id, the genre (whether it is social media, news, or direct text messages), while the ‘categories.csv’ contain the message_id and the different categories the message belongs to. Some of the categories include: medical related, search and rescue, request help, military, etc., with 36 possible categories. These two datasets were cleansed and preprocessed to obtain a dataframe used for model preparation and building.

Distribution of Message Categories

Feature Engineering:

As part of model building, each of the messages received had to be processed, and important features extracted from them. A function tokenize that will apply all the necessary text processing techniques on any given sentence was written. Such techniques include:

Normalization:— which basically converts all letters to small letters while removing punctuation marks.

Tokenization: — which breaks each document/row/sentence into tokens or words and return a list of all the tokens

Lemmatization: — which takes verbs to their respective verb roots and strip the white spaces, and finally

Removal of stopwords:— which include removing all the unnecessary words contained in a bag of words called stop words that are found in the tokenized documents.

Thus the tokenize function returns for each message a list of words that have undergone all the processing stages. Further, important features were extracted using the following techniques:

CountVectorizer:— which will create a matrix for all token words and assign a 1 for a single occurrence, and 0 for non-occurrence of the word

Term Frequency — Inverse-Term Frequency (TFIDF) transformer:— which will return a normalized/transformed form of each token including a term-frequency times inverse document-frequency.

By the end of feature extraction stage, the independent variable (messages) is in a matrix form. while the dependent variables is the matrix of 1s and 0s of all the 36 categories in the dataset, and both are ready for some modelling.

Similarity of categories using a seaborn heatmap

Model Building:

To train a machine learning model for text classification, a pipeline was designed that will apply all the needed transformations described above together with the multi-output classification for each estimator. Since the goal is to compare the performance of two machine learning techniques, the estimators were individually applied for models: ‘model1.’-RandomForest Classifier and ‘model2’ — Multinomial NB.

estimator = RandomForestClassifier()

estimator = MultinomialNB() pipeline = Pipeline([ (‘transformer’, Pipeline([

 (‘vect’, CountVectorizer(tokenizer = tokenize)),  
 (‘tfidf’, TfidfTransformer())  
])), (‘clf’, MultiOutputClassifier(estimator)) ])

A grid search that will run the machine learning pipeline through all the different parameters provided was set up to find the best performing parameters for each model, after which this was used to train the model. Take a look at the parameters and grid search used, and the selected best performing in each method.

parameters = { ‘transformer__vect__max_features’: [5000, 3000, 1000], ‘transformer__vect__ngram_range’: ((1,1),(1,2)), ‘transformer__tfidf__use_idf’: (True, False) }

Using RandomForestClassifier:

model1 = GridSearchCV(pipeline, param_grid = parameters) model1.fit(X_train,Y_train)

Using Multinomial Naive Bayes:

model2 = GridSearchCV(pipeline, param_grid = parameters) model2.fit(X_train,Y_train)`

where ‘X_train’ and ‘Y_train’ are the training datasets (80% of the given dataset) for the independent and dependent variables respectively. The parameters include the ‘max-features’ which is the limit to use top most frequent words for the CountVectorizer, ‘n-gram range’ is the range to use in deciding how words could be combined and analyzed together, and ‘use-idf’ for tf-idf transformer is the parameter to choose whether to use inverse document frequency (idf part of tf-idf) or not.

Best performing:

model1.best_params_

 {‘transformer__tfidf__use_idf’: False,  
 ‘transformer__vect__max_features’: 1000,  
 ‘transformer__vect__ngram_range’: (1, 2)}  
model2.best_params_
 {‘transformer__tfidf__use_idf’: False,  
 ‘transformer__vect__max_features’: 1000,  
 ‘transformer__vect__ngram_range’: (1, 1)}

Results:

The results show best parameters for RandomForest Classifier to include: use_idf as False which implies using only the term-frequency in the tf-idf vectorizer and would depend solely on the density of a given word in a document without taking away the effect of too much use of such word, max_features as 1000 which is the smallest number among the max_features parameters tested for, and n_gram range as 1-2, which will prefer up to two-word combinations. The MultinomialNB returns best parameters same with the RandomForest except for n_gram range of single word counts. Further, the models were used to predict on the test data and the metrics of performance for the two models were compared. An evaluate_model function was written that will take-in the dataframe, y_pred, y_test, and the column labels and return a dataframe of the important score metrics such as the f1-score, accuracy, precision, and recall for a given category. Among the score metrics, precision seem to be the best metric in this scenario for selecting the best performing among the two models. This is largely based on the fact that it measures the ability of a model to detect only the relevant instances, here, the 1(s) . And, since precision is a measure of the true positives over the sum of true and false positive predictions, using it will help minimize false positives, thus minimizing incorrect predictions of disaster categories. A comparative plot of the performance of the two models as shown by precision is shown below.

Model performance comparism

Discussion:

The performance plot shows that RandomForest Classifier will perform better for the larger part of the categories in a multi-output classification problem like this. This is assuming a default of 10 number of trees. If increasing the number of trees could lead to better performance, it is believed that further grid search may produce an increased performance of RandomForest classifier for this problem. Further, the Naive Bayes model seem to perform better for categories with more training data size such as: ‘aid-related’, ‘direct_report’, ‘water’, ‘medical_related’.

The application of the models also sheds a different light on building machine learning models . While the precision values appear lower for categories with lager training data size, it is seen that a model seem to more accurately predict such categories. For instance, a prediction of a text such as ‘Our house is being raided by gun shots and we hear the sound of bombs dropping’ will accurately return Millitary as the category while the text ‘With the way things are going our house is going to be razed and the fire will not stop here’ will NOT return Fire as the appropriate category. Check the notebook for more on this.

Conclusions and Recommendations:

This write-up has been able to show a test case of utilizing RandomForest Classifier and Naive Bayes classifier in a multioutput classification of texts. The results can be summarized as follows:

1. Using grid search in a a machine learning model is always helpful in choosing the best parameters for use. However this could be time-consuming. Such search shows

  1. Choice of 1000 as max_feature in a CountVectorizer among 1000, 3000, and 5000
  2. Preference not to use inverse document frequency in Tfidf-transformer
  3. n_gram range could vary depending on model

2. While RandomForest Classifier will give a better performance in a multioutput classification if the available training data is small, Multinomial Naive Bayes will perform better for larger test sample size.

3. The more the training data size, the more accurate the machine learning models in identifying the categories, but does not translate to increase in performance metric values.

Further work will compare the recall and f1-score of the two models to see if a trade-off between number of relevant predictions and all relevant predictions can give further insights on the models performances. Also, a grid search on the RandomForest Calssifier will help in choosing the best performing number of features.

To learn more about this work and the web app built with the machine learning model, check the GitHub repository here

About the Author:

Chijioke Idoko has a bachelors in physics and a masters in geophysics. At time of publication of this post, he is completing a Nano degree in Data Science from Udacity. He has extensive experience in computational modelling, and has peer-reviewed publications in scientific journals.