--- library_name: transformers language: - en metrics: - f1 base_model: - distilbert/distilbert-base-uncased datasets: - AbdulHadi806/mail_spam_ham_dataset pipeline_tag: text-classification --- # Model Card for Model ID Text classification Model for Spam Detection (Deep Learning Project). ## Model Details ### Model Description Model developped for the "Deep Learning with Python" course Project - **Developed by:** Diavila Rostaing Engandzi - **Model type:** Binary Text Classification - **Language(s) (NLP):** English - **Finetuned from model:** DistilBERT ### Model Sources - **Demo [optional]:** https://huggingface.co/picket-cliff/deepl-project ## Uses The model is intended to be used to sort spam in emails. Clone and Run the app.py file in the Demo to see it in action. ## Training Details ### Training Data Subset from the email_data.csv dataset [card]. A benchmark dataset for email classification with around 5000 emailed classified between "ham" and "spam". To evaluate the model, data was separated between training and test datasets (80-20 split). #### Preprocessing Deep learning models cannot process raw text; they require numerical tensors. We utilized the Hugging Face DistilBertTokenizer. 1. Sub-word Tokenization: Instead of splitting by spaces (which struggles with typos and rare words), DistilBERT uses WordPiece tokenization. For example, an out-of-vocabulary word might be broken into known sub-words, preventing the model from encountering "Unknown" tokens. 2. Special Tokens: The tokenizer automatically prepends the [CLS] (Classification) token to the start of the sequence and the [SEP] (Separator) token at the end. The final hidden state corresponding to the [CLS] token is what the model uses for the binary classification decision. 3. Truncation and Padding: Transformer models require fixed-size input matrices for batch processing. Based on our EDA length distribution, we set max_length = 128. o Sentences longer than 128 tokens were truncated. o Sentences shorter than 128 tokens were padded with the [PAD] token (ID 0). 4. Attention Masks: To prevent the model from performing Self-Attention on meaningless padding tokens, the tokenizer generates an attention_mask (an array of 1s for real words and 0s for padding). ## Evaluation Results obtained directly from training on the training dataset then evaluating the model on the testing data. Result are compared to a baseline (dummy classifier) for reference. ### Testing Data, Factors & Metrics #### Testing Data #### Metrics Accuracy, f1 score (macro and weighted) ### Results When evaluated on a 80-20 split we obtained: • Accuracy: 99.10% • Macro Average F1-Score: 0.98 • Weighted Average F1-Score: 0.99 Meanwhile the dummy achieved 86.6% accuracy. #### Summary The model performance is