VerifiedPrompts commited on
Commit
9315bd7
·
verified ·
1 Parent(s): a0999a9

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,91 +1,39 @@
1
- 📘 Training Report: Mistakes & Lessons from Context Classification Project (cntxt-class-final)
2
- 🧠 Project Summary
3
- The primary objective of this project was to develop a prompt context detector capable of classifying input prompts into one of three categories:
4
- "has context"
5
- "Intent is unclear, Please input more context"
6
- "missing platform, audience, budget, goal"
7
- Initially, the project utilized a Text-to-Text model (flan-t5-base). However, to optimize for speed, cost-efficiency, and stability, the approach was transitioned to a Text Classification model (distilbert-base-uncased).
8
- 🔴 All Mistakes Made (Chronological)
9
- 🔹 Phase 1: Using flan-t5-base (Seq2Seq)
10
- Mistake
11
- Description
12
- Fix / Outcome
13
- ❌ No NLTK awareness
14
- AutoTrain UI used ROUGE/BLEU by default, which triggered an NLTK error (punkt_tab).
15
- Switched the task to classification to avoid the dependency and error.
16
- ❌ Default metric usage
17
- Failed to disable default metrics, leading to crashes during training.
18
- Removed metrics or moved to a classification task where they were not applicable.
19
- ❌ Used full dataset immediately
20
- Attempted to train on 200,000 rows from the very beginning, resulting in significant cost and time waste.
21
- Later adopted a strategy of using smaller "burn-in" runs (e.g., 50,000 rows).
22
 
23
- 🔹 Phase 2: Switching to Text Classification
24
- Mistake
25
- Description
26
- Fix
27
- Used string labels directly
28
- AutoTrain treated string labels as raw Value types, which lacked the .names attribute, causing crashes.
29
- Attempted to convert labels to integer representations.
30
- Mapped string labels to ints
31
- Despite mapping string labels to integers, AutoTrain still expected a ClassLabel type, not raw integers.
32
- The crash persisted due to the incorrect label type.
33
- ❌ Assumed ClassLabel auto-casting
34
- Believed that the AutoTrain UI would automatically detect and cast string labels to ClassLabel.
35
- AutoTrain UI does not auto-cast; it uses raw Value types for labels.
36
- ❌ Tried using dataset.py logic with CSVs
37
- Expected Hugging Face to interpret dataset.py scripts during training when using CSV-based repositories.
38
- AutoTrain ignores custom scripts like dataset.py when datasets are provided as raw CSVs.
39
 
40
- 🔹 Phase 3: Attempted Fixes
41
- Mistake
42
- Description
43
- Fix
44
- ❌ Repeated uploads of CSVs
45
- Continuously uploaded new versions of CSVs, mistakenly believing the label formatting was incorrect.
46
- Realized the core issue was the expected label type (ClassLabel), not just formatting.
47
- ❌ Tried casting in-place
48
- Attempted to use cast_column(...) within AutoTrain UI-compatible ZIPs.
49
- This approach does not work unless the dataset is pre-processed using the datasets library.
50
- ❌ Trusted UI log errors too literally
51
- Error messages in the UI logs sometimes pointed to the wrong column (e.g., label, token).
52
- Verified that the columns themselves were correct; the actual issue was the label's data type.
53
- ❌ Spent hours debugging CSVs
54
- Dedicated extensive time to debugging CSV files, which were ultimately found to be correctly formatted.
55
- The problem was the label type. The solution was to push the dataset with ClassLabel via datasets.push_to_hub().
56
 
57
- 🔹 Final Fix (Success)
58
- Fix
59
- Description
60
- ✅ Used datasets library
61
- The label column was correctly cast to ClassLabel using the datasets library.
62
- ✅ Used DatasetDict + push_to_hub()
63
- Ensured the dataset was formatted correctly (DatasetDict) for AutoTrain to properly read .names.
64
- ✅ Pushed dataset to VerifiedPrompts/cntxt-class-final
65
- This finally resolved the AttributeError and allowed the training process to proceed.
66
- ✅ Used distilbert-base-uncased
67
- Selected a lightweight, T4-friendly model, which successfully completed training.
68
- ✅ Setup AutoTrain with clean splits + logging
69
- The entire training pipeline ran smoothly with healthy logs, indicating successful configuration.
70
 
71
- 🔍 Root Causes of Failure
72
- The primary root causes of the challenges encountered were:
73
- AutoTrain UI's CSV Loader Limitations: The AutoTrain UI's CSV loader is highly literal. It does not perform automatic type casting (e.g., from string to ClassLabel) and does not interpret custom Python scripts like dataset.py within CSV-based repositories.
74
- Strict Label Type Expectation: AutoTrain explicitly expects the label column to be of type ClassLabel, not raw integers or strings. The .names error specifically arises when the system anticipates class names (provided by ClassLabel) but instead receives raw value types.
75
- Dataset Format Dependency: The correct label type (ClassLabel) must be established and defined within the dataset's format itself, rather than being inferred or attempted to be cast during the AutoTrain UI's processing.
76
- ✅ Lessons Learned
77
- These experiences provided crucial insights for future projects:
78
- Lesson
79
- Impact
80
- Always push classification datasets via datasets.push_to_hub()
81
- This method ensures proper label casting and avoids common data type issues during AutoTrain ingestion.
82
- Never assume AutoTrain UI reads dataset.py
83
- Custom dataset logic defined in dataset.py is ignored when using CSV-based repositories in the AutoTrain UI.
84
- Set ClassLabel early and test with .features["label"]
85
- Explicitly defining and verifying the ClassLabel type in the dataset's features guarantees compatibility and avoids runtime errors.
86
- Start with 50k rows for burn-in runs
87
- Utilizing smaller subsets for initial training runs significantly reduces computational cost and time during the experimentation phase.
88
- Prefer distilbert for text classification on a T4
89
- distilbert models are more lightweight, cost-effective, and generally avoid tokenizer-related issues when deployed on T4 GPUs.
90
 
 
91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
2
+ ---
3
+ library_name: transformers
4
+ tags:
5
+ - autotrain
6
+ - text-classification
7
+ base_model: distilbert/distilbert-base-uncased
8
+ widget:
9
+ - text: "I love AutoTrain"
10
+ datasets:
11
+ - VerifiedPrompts/cntxt-class-final
12
+ ---
 
 
 
 
 
13
 
14
+ # Model Trained Using AutoTrain
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
+ - Problem type: Text Classification
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
+ ## Validation Metrics
19
+ loss: 0.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
+ f1_macro: 1.0
22
 
23
+ f1_micro: 1.0
24
+
25
+ f1_weighted: 1.0
26
+
27
+ precision_macro: 1.0
28
+
29
+ precision_micro: 1.0
30
+
31
+ precision_weighted: 1.0
32
+
33
+ recall_macro: 1.0
34
+
35
+ recall_micro: 1.0
36
+
37
+ recall_weighted: 1.0
38
+
39
+ accuracy: 1.0
checkpoint-22500/config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "distilbert/distilbert-base-uncased",
3
+ "_num_labels": 3,
4
+ "activation": "gelu",
5
+ "architectures": [
6
+ "DistilBertForSequenceClassification"
7
+ ],
8
+ "attention_dropout": 0.1,
9
+ "dim": 768,
10
+ "dropout": 0.1,
11
+ "hidden_dim": 3072,
12
+ "id2label": {
13
+ "0": "Intent is unclear, Please input more context",
14
+ "1": "has context",
15
+ "2": "missing platform, audience, budget, goal"
16
+ },
17
+ "initializer_range": 0.02,
18
+ "label2id": {
19
+ "Intent is unclear, Please input more context": 0,
20
+ "has context": 1,
21
+ "missing platform, audience, budget, goal": 2
22
+ },
23
+ "max_position_embeddings": 512,
24
+ "model_type": "distilbert",
25
+ "n_heads": 12,
26
+ "n_layers": 6,
27
+ "pad_token_id": 0,
28
+ "problem_type": "single_label_classification",
29
+ "qa_dropout": 0.1,
30
+ "seq_classif_dropout": 0.2,
31
+ "sinusoidal_pos_embds": false,
32
+ "tie_weights_": true,
33
+ "torch_dtype": "float32",
34
+ "transformers_version": "4.48.0",
35
+ "vocab_size": 30522
36
+ }
checkpoint-22500/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e6cdd9c518a84c5978bbcabd1ab1471f2bf99b2037cca35e40e439f5035bb528
3
+ size 267835644
checkpoint-22500/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e9608059a4e05af8de926fecb80612e2bdacc98e8849cfe550f782489f49db2f
3
+ size 535733434
checkpoint-22500/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b12d01f98c4bd57244722a532a4fef3ced73279bc6c66e1769c9f41674dbe5ab
3
+ size 14244
checkpoint-22500/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95e0c874c4b32ab2996744d1e36277aaf7f5b82859a8468bb4b4eba2319905f3
3
+ size 1064
checkpoint-22500/trainer_state.json ADDED
The diff for this file is too large to render. See raw diff
 
checkpoint-22500/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cbd2abe463b6fb49523cf5f62c2c06d6c102108d469f3e3f49abe3ffc7808caf
3
+ size 5368
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "distilbert/distilbert-base-uncased",
3
+ "_num_labels": 3,
4
+ "activation": "gelu",
5
+ "architectures": [
6
+ "DistilBertForSequenceClassification"
7
+ ],
8
+ "attention_dropout": 0.1,
9
+ "dim": 768,
10
+ "dropout": 0.1,
11
+ "hidden_dim": 3072,
12
+ "id2label": {
13
+ "0": "Intent is unclear, Please input more context",
14
+ "1": "has context",
15
+ "2": "missing platform, audience, budget, goal"
16
+ },
17
+ "initializer_range": 0.02,
18
+ "label2id": {
19
+ "Intent is unclear, Please input more context": 0,
20
+ "has context": 1,
21
+ "missing platform, audience, budget, goal": 2
22
+ },
23
+ "max_position_embeddings": 512,
24
+ "model_type": "distilbert",
25
+ "n_heads": 12,
26
+ "n_layers": 6,
27
+ "pad_token_id": 0,
28
+ "problem_type": "single_label_classification",
29
+ "qa_dropout": 0.1,
30
+ "seq_classif_dropout": 0.2,
31
+ "sinusoidal_pos_embds": false,
32
+ "tie_weights_": true,
33
+ "torch_dtype": "float32",
34
+ "transformers_version": "4.48.0",
35
+ "vocab_size": 30522
36
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e6cdd9c518a84c5978bbcabd1ab1471f2bf99b2037cca35e40e439f5035bb528
3
+ size 267835644
runs/May27_07-55-36_r-verifiedprompts-context-detector-71s4c8h7-06974-pp7jf/events.out.tfevents.1748332538.r-verifiedprompts-context-detector-71s4c8h7-06974-pp7jf.69.0 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3efb6d1f8497b53d4fee8e3b4140d694f256f034479932c4c8be3dd6dadf3ea1
3
- size 505745
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8f6ee793f588b5524becafcc1fc194c5327606a4e5e948e17927e02ffae242c4
3
+ size 586073
runs/May27_07-55-36_r-verifiedprompts-context-detector-71s4c8h7-06974-pp7jf/events.out.tfevents.1748336058.r-verifiedprompts-context-detector-71s4c8h7-06974-pp7jf.69.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c1a4aaae518ee46210470e64b1ff86bcff4eec3e23d7b40cb26fbbf1c7e92b9c
3
+ size 936
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "pad_token": "[PAD]",
51
+ "sep_token": "[SEP]",
52
+ "strip_accents": null,
53
+ "tokenize_chinese_chars": true,
54
+ "tokenizer_class": "DistilBertTokenizer",
55
+ "unk_token": "[UNK]"
56
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cbd2abe463b6fb49523cf5f62c2c06d6c102108d469f3e3f49abe3ffc7808caf
3
+ size 5368
training_params.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "data_path": "VerifiedPrompts/cntxt-class-final",
3
+ "model": "distilbert/distilbert-base-uncased",
4
+ "lr": 5e-05,
5
+ "epochs": 3,
6
+ "max_seq_length": 128,
7
+ "batch_size": 8,
8
+ "warmup_ratio": 0.1,
9
+ "gradient_accumulation": 1,
10
+ "optimizer": "adamw_torch",
11
+ "scheduler": "linear",
12
+ "weight_decay": 0.0,
13
+ "max_grad_norm": 1.0,
14
+ "seed": 42,
15
+ "train_split": "train",
16
+ "valid_split": "validation",
17
+ "text_column": "text",
18
+ "target_column": "label",
19
+ "logging_steps": -1,
20
+ "project_name": "autotrain-8z0a6-ohqum",
21
+ "auto_find_batch_size": false,
22
+ "mixed_precision": "fp16",
23
+ "save_total_limit": 1,
24
+ "push_to_hub": true,
25
+ "eval_strategy": "epoch",
26
+ "username": "VerifiedPrompts",
27
+ "log": "tensorboard",
28
+ "early_stopping_patience": 5,
29
+ "early_stopping_threshold": 0.01
30
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff