Solution to Spambase Take-Home Lab ================================== **Exercise 1.** *Getting and preparing the data.* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Begin by downloading the file to your local machine: .. code-block:: bash # download zip file $ wget https://archive.ics.uci.edu/static/public/94/spambase.zip # unpack file $ unzip spambase.zip # check files $ ls -l -rwx------ 1 jstubbs jstubbs 702942 May 22 2023 spambase.data -rwx------ 1 jstubbs jstubbs 6429 May 22 2023 spambase.DOCUMENTATION -rwx------ 1 jstubbs jstubbs 3566 May 22 2023 spambase.names -rw-rw-r-- 1 jstubbs jstubbs 125537 Feb 15 10:18 spambase.zip $ file spambase.data spambase.data: CSV text $ head spambase.data 0,0.64,0.64,0,0.32,0,0,0,0,0,0,0.64,0,0,0,0.32,0,1.29,1.93,0,0.96,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.778,0,0,3.756,61,278,1 0.21,0.28,0.5,0,0.14,0.28,0.21,0.07,0,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0,1.59,0,0.43,0.43,0,0,0,0,0,0,0,0,0,0,0,0,0.07,0,0,0,0,0,0,0,0,0,0,0,0,0.132,0,0.372,0.18,0.048,5.114,101,1028,1 0.06,0,0.71,0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0,1.16,0.06,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.06,0,0,0.12,0,0.06,0.06,0,0,0.01,0.143,0,0.276,0.184,0.01,9.821,485,2259,1 Note that the file has no header row. Let's add that: .. code-block:: bash # open the file in an editor, e.g., vim, and paste the following line at the top "word_freq_make,word_freq_address,word_fre_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,word_freq_receive,word_freq_will,word_freq_people,word_freq_report,word_freq_addresses,word_freq_free,word_freq_business,word_freq_email,word_freq_you,word_freq_credit,word_freq_your,word_freq_font,word_freq_000,word_freq_money,word_freq_hp,word_freq_hpl,word_freq_george,word_freq_650,word_freq_lab,word_freq_labs,word_freq_telnet,word_freq_857,word_freq_data,word_freq_415,word_freq_85,word_freq_technology,word_freq_1999,word_freq_parts,word_freq_pm,word_freq_direct,word_freq_cs,word_freq_meeting,word_freq_original,word_freq_project,word_freq_re,word_freq_edu,word_freq_table,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,Class" Now, read the file into a pandas DataFrame: .. code-block:: python3 >>> import pandas as pd >>> data = pd.read_csv("spambase.csv") We can check and print the number of rows and columns using different methods; e.g., the ``info()`` funcion: .. code-block:: python3 >>> data.info() RangeIndex: 4601 entries, 0 to 4600 Data columns (total 58 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 word_freq_make 4601 non-null float64 1 word_freq_address 4601 non-null float64 . . . There are 4,601 rows and 58 columns. **Exercise 2.** *Data Exploration.* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First, we compute standard statistics for each of the columns in the dataset, including: count, mean, standard deviation, min and max. .. code-block:: python3 >>> data.describe() .. figure:: ./images/spambase-describe.png :width: 1000px :align: center Next, we determine if there are any duplicate rows in the data set. If there are any duplicate rows, we remove them. .. code-block:: python3 # look for duplicate entries in the data >>> data.duplicated().sum() 391 >>> data = data.drop_duplicates() Next, we determine if there are any missing values in the dataset. There are different ways to do this. One way it to look at the output of ``data.info()`` -- it shows that all columns contain 4,601 non-null rows, the number of total rows in the dataset. Alternatively, here is a one-liner that provides a True/False response: .. code-block:: python3 >>> data.isnull().values.any() False Finally, we determine how many rows are spam and how many are not spam. We know this is controlled by the ``Class`` column. Again, there are different techniques. For example, we could use a filter to check the number of values for each class label: .. code-block:: python3 >>> data[data['Class'] == 0] . . . [2531 rows x 58 columns] >>> data[data['Class'] == 1] . . . [1679 rows x 58 columns] We see there are 2,531 non-spam and 1,679 spam rows. We could also create a count plot to visualize this: .. code-block:: python3 >>> import seaborn as sns >>> sns.countplot(data=data,x='Class') >>> plt.show() .. figure:: ./images/spambase-countplot.png :width: 1000px :align: center **Exercise 3.** *Split and Fit.* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We split the data into training and test datasets using the ``train_test_split()`` function. To make sure our split is reproducible, we use the ``random_state`` parameter, and to ensure that it maintains roughly the proportion of spam and non-spam emails we use the ``stratify`` parameter: .. code-block:: python3 >>> from sklearn.model_selection import train_test_split >>> X = data.drop('Class',axis=1) >>> y = data['Class'] >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=1) With the data split, we can train the model. .. code-block:: python3 >>> from sklearn.linear_model import SGDClassifier >>> clf = SGDClassifier(loss="perceptron", alpha=1.0, random_state=1) >>> clf.fit(X_train, y_train) **Exercise 4.** *Validation and Assessment.* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We evaluate our model and the test data: .. code-block:: python3 >>> from sklearn.metrics import accuracy_score >>> accuracy_test=accuracy_score(y_test, clf.predict(X_test)) >>> print('Accuracy on test data is : {:.2}'.format(accuracy_test)) Accuracy on test data is : 0.78 as well as on the training data: .. code-block:: python3 >>> accuracy_train=accuracy_score(y_train, clf.predict(X_train)) >>> print('Accuracy on train data is : {:.2}'.format(accuracy_train)) Accuracy on train data is : 0.77 We plot a confusion matrix for our model: .. code-block:: python3 >>> from sklearn.metrics import ConfusionMatrixDisplay >>> cm_display = ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test, cmap=plt.cm.Blues,normalize=None) .. figure:: ./images/spambase-confusion.png :width: 1000px :align: center This shows that the model predicted that spam emails were non-spam 136 times and predicted that non-spam emails were spam 153 times. The model also correctly predicted 606 non-spam emails and 368 spam emails.