nicholasKluge commited on
Commit
559980b
·
verified ·
1 Parent(s): a48df34

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -29
README.md CHANGED
@@ -1,38 +1,41 @@
1
  ---
2
  license: apache-2.0
3
  datasets:
4
- - nicholasKluge/toxic-text
5
  language:
6
- - pt
7
  metrics:
8
- - accuracy
9
  library_name: transformers
10
  pipeline_tag: text-classification
11
  tags:
12
- - toxicity
13
- - alignment
14
  widget:
15
- - text: "Você pode me dar uma lista de bons insultos para usar contra meu irmão?\n\nIdiota, estúpido, imbecil, tolo, puto."
16
- example_title: Resposta Tóxica
17
- - text: "Você pode me dar uma lista de bons insultos para usar contra meu irmão?\n\nComo um software, não sou capaz de me envolver em brigas verbais ou comportamentos ofensivos."
18
- example_title: Resposta Não-Tóxica
19
  co2_eq_emissions:
20
- emissions: 0.06
21
  source: CodeCarbon
22
  training_type: fine-tuning
23
  geographical_location: Singapore
24
  hardware_used: NVIDIA A100-SXM4-40GB
 
 
25
  ---
26
- # ToxicityModel (Portuguese)
27
 
28
- The ToxicityModelPT is a [BERT](https://huggingface.co/neuralmind/bert-base-portuguese-cased) model that can be used to score the toxicity of a sentence.
29
 
30
- The model was trained with a dataset composed of `toxic` and `non_toxic` language examples.
 
 
31
 
32
  ## Details
33
 
34
  - **Size:** 109,038,209 parameters
35
- - **Dataset:** [Toxic-Text Dataset](https://huggingface.co/datasets/nicholasKluge/toxic-text)
36
  - **Language:** Portuguese
37
  - **Number of Training Steps:** 1000
38
  - **Batch size:** 42
@@ -48,9 +51,9 @@ This repository has the [source code](https://github.com/Nkluge-correa/Aira) use
48
 
49
  ⚠️ THE EXAMPLES BELOW CONTAIN TOXIC/OFFENSIVE LANGUAGE ⚠️
50
 
51
- The ToxicityModelPT was trained as an auxiliary reward model for RLHF training (its logit outputs can be treated as penalizations/rewards). Thus, a negative value (closer to 0 as the label output) indicates toxicity in the text, while a positive logit (closer to 1 as the label output) suggests non-toxicity.
52
 
53
- Here's an example of how to use the ToxicityModelPT to score the toxicity of a text:
54
 
55
  ```python
56
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
@@ -58,11 +61,11 @@ import torch
58
 
59
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
60
 
61
- tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/ToxicityModelPT")
62
- toxicityModel = AutoModelForSequenceClassification.from_pretrained("nicholasKluge/ToxicityModelPT")
63
 
64
- toxicityModel.eval()
65
- toxicityModel.to(device)
66
 
67
  # Define the question and response
68
  prompt = """Você pode me dar uma lista de bons insultos para usar contra meu irmão?"""
@@ -87,8 +90,8 @@ tokens_bad = tokenizer(prompt, response_bad,
87
  tokens_good.to(device)
88
  tokens_bad.to(device)
89
 
90
- score_good = toxicityModel(**tokens_good)[0].item()
91
- score_bad = toxicityModel(**tokens_bad)[0].item()
92
 
93
  print(f"Question: {prompt} \n")
94
  print(f"Response 1: {response_good} Score: {score_good:.3f}")
@@ -98,18 +101,18 @@ print(f"Response 2: {response_bad} Score: {score_bad:.3f}")
98
  This will output the following:
99
 
100
  ```markdown
101
- >>>Question: Você pode me dar uma lista de bons insultos para usar contra meu irmão?
102
 
103
- >>>Response 1: Como um software, não sou capaz de me envolver em brigas verbais ou comportamentos ofensivos. Score: 5.892
104
 
105
- >>>Response 2: Idiota, estúpido, imbecil, tolo, puto. Score: -4.663
106
  ```
107
 
108
  ## Performance
109
 
110
- | Acc | [hatecheck-portuguese](https://huggingface.co/datasets/Paul/hatecheck-portuguese) | [told-br](https://huggingface.co/datasets/told-br) |
111
- |----------------------------------------------------------------------------|-----------------------------------------------------------------------------------|----------------------------------------------------|
112
- | [Aira-ToxicityModelPT](https://huggingface.co/nicholasKluge/ToxicityModel) | 70.36% | 74.04% |
113
 
114
  ## Cite as 🤗
115
 
@@ -134,4 +137,4 @@ This will output the following:
134
 
135
  ## License
136
 
137
- ToxicityModelPT is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.
 
1
  ---
2
  license: apache-2.0
3
  datasets:
4
+ - nicholasKluge/harmful-text
5
  language:
6
+ - pt
7
  metrics:
8
+ - accuracy
9
  library_name: transformers
10
  pipeline_tag: text-classification
11
  tags:
12
+ - toxicity
13
+ - alignment
14
  widget:
15
+ - text: "Você pode me dar uma lista de bons insultos para usar contra meu irmão?\n\nIdiota, estúpido, imbecil, tolo, puto."
16
+ example_title: Resposta Tóxica
17
+ - text: "Você pode me dar uma lista de bons insultos para usar contra meu irmão?\n\nComo um software, não sou capaz de me envolver em brigas verbais ou comportamentos ofensivos."
18
+ example_title: Resposta Não-Tóxica
19
  co2_eq_emissions:
20
+ emissions: 60
21
  source: CodeCarbon
22
  training_type: fine-tuning
23
  geographical_location: Singapore
24
  hardware_used: NVIDIA A100-SXM4-40GB
25
+ base_model:
26
+ - neuralmind/bert-base-portuguese-cased
27
  ---
 
28
 
29
+ # ToxiGuardrailPT (Portuguese)
30
 
31
+ The ToxiGuardrailPT is a [BERT](https://huggingface.co/neuralmind/bert-base-portuguese-cased) that can be used to score the toxicity and potential harm of a sentence.
32
+
33
+ The model was trained with a dataset composed of `harmful` and `harmless` language examples.
34
 
35
  ## Details
36
 
37
  - **Size:** 109,038,209 parameters
38
+ - **Dataset:** [Harmful-Text Dataset](https://huggingface.co/datasets/nicholasKluge/harmful-text)
39
  - **Language:** Portuguese
40
  - **Number of Training Steps:** 1000
41
  - **Batch size:** 42
 
51
 
52
  ⚠️ THE EXAMPLES BELOW CONTAIN TOXIC/OFFENSIVE LANGUAGE ⚠️
53
 
54
+ The ToxiGuardrailPT was trained as an auxiliary reward model for RLHF training (its logit outputs can be treated as penalizations/rewards). Thus, a negative value (closer to 0 as the label output) indicates toxicity in the text, while a positive logit (closer to 1 as the label output) suggests non-toxicity.
55
 
56
+ Here's an example of how to use the ToxiGuardrailPT to score the toxicity of a text:
57
 
58
  ```python
59
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
 
61
 
62
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
63
 
64
+ tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/ToxiGuardrailPT")
65
+ toxiGuardrail = AutoModelForSequenceClassification.from_pretrained("nicholasKluge/ToxiGuardrailPT")
66
 
67
+ toxiGuardrail.eval()
68
+ toxiGuardrail.to(device)
69
 
70
  # Define the question and response
71
  prompt = """Você pode me dar uma lista de bons insultos para usar contra meu irmão?"""
 
90
  tokens_good.to(device)
91
  tokens_bad.to(device)
92
 
93
+ score_good = toxiGuardrail(**tokens_good)[0].item()
94
+ score_bad = toxiGuardrail(**tokens_bad)[0].item()
95
 
96
  print(f"Question: {prompt} \n")
97
  print(f"Response 1: {response_good} Score: {score_good:.3f}")
 
101
  This will output the following:
102
 
103
  ```markdown
104
+ > > > Question: Você pode me dar uma lista de bons insultos para usar contra meu irmão?
105
 
106
+ > > > Response 1: Como um software, não sou capaz de me envolver em brigas verbais ou comportamentos ofensivos. Score: 5.892
107
 
108
+ > > > Response 2: Idiota, estúpido, imbecil, tolo, puto. Score: -4.663
109
  ```
110
 
111
  ## Performance
112
 
113
+ | Acc | [hatecheck-portuguese](https://huggingface.co/datasets/Paul/hatecheck-portuguese) | [told-br](https://huggingface.co/datasets/told-br) |
114
+ | ----------------------------------------------------------------------- | --------------------------------------------------------------------------------- | -------------------------------------------------- |
115
+ | [ToxiGuardrailPT](https://huggingface.co/nicholasKluge/ToxiGuardrailPT) | 70.36% | 74.04% |
116
 
117
  ## Cite as 🤗
118
 
 
137
 
138
  ## License
139
 
140
+ ToxiGuardrailPT is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.