Lab 7
Introduction
Spam-bombing or spamming is the practice of sending undesired messages that might contain an even seedier purpose, such as a scam. As most of us are aware there is a target population that is vulnerable. Spamming takes many forms, through SMS messages and also through emails. It is possible to classify messages, either emails or SMS as spam or non-spam (ham in this case) using machine learning. For such a classification tasks, support vector machines (SVC) represent a good alternative given that this problems are medium complexity with a datasets of not having excessively large dimensionality and size. The Spam filter was usually a task assigned to Naive Bayes classifiers, as these were traditionally used for text-based tasks.
It is important that you remember that SVM, and more specifically for this report SVC are discriminative models. This means that they separate the data points or instances in the classes they belong. This, in the essential case of a linear separation, is done by finding the optimum separation line, or hyperplane. As seen in class the optimum hyperplane maximizes the distance between support vectors from the classes.
Report
DO NOT FORGET TO INCLUDE YOUR IMAGES IN THE REPORT AND ATTACH YOUR CODE AND PDF!
For this report you will receive a dataset consisting of a significant number of instances that have previously been labeled as either spam or not-spam (ham). This data can be downloaded form the class website here. Now that you have the file downloaded there are priors that you need to do. As we stated in the EDA module, before starting your classification task you must undertake the proper exploratory data analysis and identify the following: what is the proper ratio for thew spam and non-spam instances. This must be included in the report and not only discussed but the proper figure must be plotted. So do not forget to import the proper Python library.
Going back to the data, your file will consist of two columns. One with the labels and another with the messages or data. It is important that you need to generate features from the text. It is important that we find the presence of common words, because later we will use words to differentiate spam from non-spam. For this purpose we use Counter, so from collections import Counter
. With the information from counter report the top 15 words from spam and non-spam messages and plot these. In the report what can you say about these words? What kinds of words are these? Are these complex words, for example are these words names, places, things, or connectors?
In a previous lesson we talked about methods that included bag of words, vectorizations and embeddings. For this lab, we will be using the python implementation of: sklearn.feature_extraction.text.CountVectorizer
. We use such approaches to create new features from data. Using the documents we are creating feature vectors. This is part of feature engineering.
Regarding the model construction phase, your report must describe as indicated below the results from the training the model using the default parameters. You must include the following metrics precision, recall, and accuracy. It is also necessary to provide the confusion matrix. It is acceptable to generate the classification report.
Afterwards, you must change the values of C and gamma in the training loop. If you think changing the kernel would make a difference go ahead. You must train the model and measure the indicated metrics (precision, recall and accuracy). You can do one of the following: 1) include the test set inside this training loop, or 2) use the model that generates the highest performance metrics and run it in the test split. Depending on your selection your discussion will change and you must explain what you did. Both approaches are acceptable for this exercise. Finally report the confusion matrix for the best model and explain what the confusion matrix represents, in terms of how your model is predicting messages.
Finally, tell us, is your model good at discriminating between span and non-spam? Is it sending non-spam to your spam list? What is worse?
Model Construction and Parameter Tuning
For model construction we will use SVC, so we must import many libraries: from sklearn import model_selection , metrics, svm
.
Clearly prior you must have already also included feature_extraction
. So according to your EDA you have two classes spam and non-spam or ham. But clearly, your SVC will need this to be binary. So go ahead and change these classes. This is a great practice and a good idea. Remember to always check for this. You cannot imagine how many times your models do not work because your label is not a the right variable.
Now split the data set. This is why we have loaded model_selection
. Please read about how to use model_selection.train_test_split()
. So it is recommended that you split it around 70% training and 30% testing.
At this point you should start training your SVC.
So using the information from the documentation you can constrcut your SVM classifier. An initial approach should tell you that you could use the default parameters, and measure the performance metrices, including the Confusion Matrix. For the Report use default parameters on training set and measure performance metrics. Ask yourself, can this values be improved? Start exploring with the C and gamma parameters. For both create a variable C=(x,y,z)
and use it in within a training loop for sklearn.svm.SVC()
.
Further tuning: The real deal of tuning is done with Grid Search, as it name implies it looks through a grid of parameters to find the best combination. So looking at the documentation you will find how to properly set up your search for the best combination of C, gamma and kernel.