VerifiedPrompts commited on May 27, 2025

Commit

9315bd7

verified ·

1 Parent(s): a0999a9

Upload folder using huggingface_hub

Browse files

Files changed (18) hide show

README.md +33 -85
checkpoint-22500/config.json +36 -0
checkpoint-22500/model.safetensors +3 -0
checkpoint-22500/optimizer.pt +3 -0
checkpoint-22500/rng_state.pth +3 -0
checkpoint-22500/scheduler.pt +3 -0
checkpoint-22500/trainer_state.json +0 -0
checkpoint-22500/training_args.bin +3 -0
config.json +36 -0
model.safetensors +3 -0
runs/May27_07-55-36_r-verifiedprompts-context-detector-71s4c8h7-06974-pp7jf/events.out.tfevents.1748332538.r-verifiedprompts-context-detector-71s4c8h7-06974-pp7jf.69.0 +2 -2
runs/May27_07-55-36_r-verifiedprompts-context-detector-71s4c8h7-06974-pp7jf/events.out.tfevents.1748336058.r-verifiedprompts-context-detector-71s4c8h7-06974-pp7jf.69.1 +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +56 -0
training_args.bin +3 -0
training_params.json +30 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,91 +1,39 @@
-📘 Training Report: Mistakes & Lessons from Context Classification Project (cntxt-class-final)
-🧠 Project Summary
-The primary objective of this project was to develop a prompt context detector capable of classifying input prompts into one of three categories:
-"has context"
-"Intent is unclear, Please input more context"
-"missing platform, audience, budget, goal"
-Initially, the project utilized a Text-to-Text model (flan-t5-base). However, to optimize for speed, cost-efficiency, and stability, the approach was transitioned to a Text Classification model (distilbert-base-uncased).
-🔴 All Mistakes Made (Chronological)
-🔹 Phase 1: Using flan-t5-base (Seq2Seq)
-Mistake
-Description
-Fix / Outcome
-❌ No NLTK awareness
-AutoTrain UI used ROUGE/BLEU by default, which triggered an NLTK error (punkt_tab).
-Switched the task to classification to avoid the dependency and error.
-❌ Default metric usage
-Failed to disable default metrics, leading to crashes during training.
-Removed metrics or moved to a classification task where they were not applicable.
-❌ Used full dataset immediately
-Attempted to train on 200,000 rows from the very beginning, resulting in significant cost and time waste.
-Later adopted a strategy of using smaller "burn-in" runs (e.g., 50,000 rows).
-🔹 Phase 2: Switching to Text Classification
-Mistake
-Description
-Fix
-❌ Used string labels directly
-AutoTrain treated string labels as raw Value types, which lacked the .names attribute, causing crashes.
-Attempted to convert labels to integer representations.
-❌ Mapped string labels to ints
-Despite mapping string labels to integers, AutoTrain still expected a ClassLabel type, not raw integers.
-The crash persisted due to the incorrect label type.
-❌ Assumed ClassLabel auto-casting
-Believed that the AutoTrain UI would automatically detect and cast string labels to ClassLabel.
-AutoTrain UI does not auto-cast; it uses raw Value types for labels.
-❌ Tried using dataset.py logic with CSVs
-Expected Hugging Face to interpret dataset.py scripts during training when using CSV-based repositories.
-AutoTrain ignores custom scripts like dataset.py when datasets are provided as raw CSVs.
-🔹 Phase 3: Attempted Fixes
-Mistake
-Description
-Fix
-❌ Repeated uploads of CSVs
-Continuously uploaded new versions of CSVs, mistakenly believing the label formatting was incorrect.
-Realized the core issue was the expected label type (ClassLabel), not just formatting.
-❌ Tried casting in-place
-Attempted to use cast_column(...) within AutoTrain UI-compatible ZIPs.
-This approach does not work unless the dataset is pre-processed using the datasets library.
-❌ Trusted UI log errors too literally
-Error messages in the UI logs sometimes pointed to the wrong column (e.g., label, token).
-Verified that the columns themselves were correct; the actual issue was the label's data type.
-❌ Spent hours debugging CSVs
-Dedicated extensive time to debugging CSV files, which were ultimately found to be correctly formatted.
-The problem was the label type. The solution was to push the dataset with ClassLabel via datasets.push_to_hub().
-🔹 Final Fix (Success)
-Fix
-Description
-✅ Used datasets library
-The label column was correctly cast to ClassLabel using the datasets library.
-✅ Used DatasetDict + push_to_hub()
-Ensured the dataset was formatted correctly (DatasetDict) for AutoTrain to properly read .names.
-✅ Pushed dataset to VerifiedPrompts/cntxt-class-final
-This finally resolved the AttributeError and allowed the training process to proceed.
-✅ Used distilbert-base-uncased
-Selected a lightweight, T4-friendly model, which successfully completed training.
-✅ Setup AutoTrain with clean splits + logging
-The entire training pipeline ran smoothly with healthy logs, indicating successful configuration.
-🔍 Root Causes of Failure
-The primary root causes of the challenges encountered were:
-AutoTrain UI's CSV Loader Limitations: The AutoTrain UI's CSV loader is highly literal. It does not perform automatic type casting (e.g., from string to ClassLabel) and does not interpret custom Python scripts like dataset.py within CSV-based repositories.
-Strict Label Type Expectation: AutoTrain explicitly expects the label column to be of type ClassLabel, not raw integers or strings. The .names error specifically arises when the system anticipates class names (provided by ClassLabel) but instead receives raw value types.
-Dataset Format Dependency: The correct label type (ClassLabel) must be established and defined within the dataset's format itself, rather than being inferred or attempted to be cast during the AutoTrain UI's processing.
-✅ Lessons Learned
-These experiences provided crucial insights for future projects:
-Lesson
-Impact
-Always push classification datasets via datasets.push_to_hub()
-This method ensures proper label casting and avoids common data type issues during AutoTrain ingestion.
-Never assume AutoTrain UI reads dataset.py
-Custom dataset logic defined in dataset.py is ignored when using CSV-based repositories in the AutoTrain UI.
-Set ClassLabel early and test with .features["label"]
-Explicitly defining and verifying the ClassLabel type in the dataset's features guarantees compatibility and avoids runtime errors.
-Start with 50k rows for burn-in runs
-Utilizing smaller subsets for initial training runs significantly reduces computational cost and time during the experimentation phase.
-Prefer distilbert for text classification on a T4
-distilbert models are more lightweight, cost-effective, and generally avoid tokenizer-related issues when deployed on T4 GPUs.

+---
+library_name: transformers
+tags:
+- autotrain
+- text-classification
+base_model: distilbert/distilbert-base-uncased
+widget:
+- text: "I love AutoTrain"
+datasets:
+- VerifiedPrompts/cntxt-class-final
+---
+# Model Trained Using AutoTrain
+- Problem type: Text Classification
+## Validation Metrics
+loss: 0.0
+f1_macro: 1.0
+f1_micro: 1.0
+f1_weighted: 1.0
+precision_macro: 1.0
+precision_micro: 1.0
+precision_weighted: 1.0
+recall_macro: 1.0
+recall_micro: 1.0
+recall_weighted: 1.0
+accuracy: 1.0

checkpoint-22500/config.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "_name_or_path": "distilbert/distilbert-base-uncased",
+  "_num_labels": 3,
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "Intent is unclear, Please input more context",
+    "1": "has context",
+    "2": "missing platform, audience, budget, goal"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "Intent is unclear, Please input more context": 0,
+    "has context": 1,
+    "missing platform, audience, budget, goal": 2
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "problem_type": "single_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.48.0",
+  "vocab_size": 30522
+}

checkpoint-22500/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e6cdd9c518a84c5978bbcabd1ab1471f2bf99b2037cca35e40e439f5035bb528
+size 267835644

checkpoint-22500/optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e9608059a4e05af8de926fecb80612e2bdacc98e8849cfe550f782489f49db2f
+size 535733434

checkpoint-22500/rng_state.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b12d01f98c4bd57244722a532a4fef3ced73279bc6c66e1769c9f41674dbe5ab
+size 14244

checkpoint-22500/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:95e0c874c4b32ab2996744d1e36277aaf7f5b82859a8468bb4b4eba2319905f3
+size 1064

checkpoint-22500/trainer_state.json ADDED Viewed

The diff for this file is too large to render. See raw diff

checkpoint-22500/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cbd2abe463b6fb49523cf5f62c2c06d6c102108d469f3e3f49abe3ffc7808caf
+size 5368

config.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "_name_or_path": "distilbert/distilbert-base-uncased",
+  "_num_labels": 3,
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "Intent is unclear, Please input more context",
+    "1": "has context",
+    "2": "missing platform, audience, budget, goal"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "Intent is unclear, Please input more context": 0,
+    "has context": 1,
+    "missing platform, audience, budget, goal": 2
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "problem_type": "single_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.48.0",
+  "vocab_size": 30522
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e6cdd9c518a84c5978bbcabd1ab1471f2bf99b2037cca35e40e439f5035bb528
+size 267835644

runs/May27_07-55-36_r-verifiedprompts-context-detector-71s4c8h7-06974-pp7jf/events.out.tfevents.1748332538.r-verifiedprompts-context-detector-71s4c8h7-06974-pp7jf.69.0 CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3efb6d1f8497b53d4fee8e3b4140d694f256f034479932c4c8be3dd6dadf3ea1
-size 505745

 version https://git-lfs.github.com/spec/v1
+oid sha256:8f6ee793f588b5524becafcc1fc194c5327606a4e5e948e17927e02ffae242c4
+size 586073

runs/May27_07-55-36_r-verifiedprompts-context-detector-71s4c8h7-06974-pp7jf/events.out.tfevents.1748336058.r-verifiedprompts-context-detector-71s4c8h7-06974-pp7jf.69.1 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c1a4aaae518ee46210470e64b1ff86bcff4eec3e23d7b40cb26fbbf1c7e92b9c
+size 936

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cbd2abe463b6fb49523cf5f62c2c06d6c102108d469f3e3f49abe3ffc7808caf
+size 5368

training_params.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+    "data_path": "VerifiedPrompts/cntxt-class-final",
+    "model": "distilbert/distilbert-base-uncased",
+    "lr": 5e-05,
+    "epochs": 3,
+    "max_seq_length": 128,
+    "batch_size": 8,
+    "warmup_ratio": 0.1,
+    "gradient_accumulation": 1,
+    "optimizer": "adamw_torch",
+    "scheduler": "linear",
+    "weight_decay": 0.0,
+    "max_grad_norm": 1.0,
+    "seed": 42,
+    "train_split": "train",
+    "valid_split": "validation",
+    "text_column": "text",
+    "target_column": "label",
+    "logging_steps": -1,
+    "project_name": "autotrain-8z0a6-ohqum",
+    "auto_find_batch_size": false,
+    "mixed_precision": "fp16",
+    "save_total_limit": 1,
+    "push_to_hub": true,
+    "eval_strategy": "epoch",
+    "username": "VerifiedPrompts",
+    "log": "tensorboard",
+    "early_stopping_patience": 5,
+    "early_stopping_threshold": 0.01
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff