thebajajra commited on
Commit
1e7e15c
·
verified ·
1 Parent(s): 65f0700

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -1
README.md CHANGED
@@ -142,8 +142,33 @@ RexBERT-large was trained in **three phases**:
142
 
143
  ## Data Overview
144
 
 
145
  - **Domain mix:**
146
- - **Data quality:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
 
148
 
149
 
 
142
 
143
  ## Data Overview
144
 
145
+ - **Dataset:** [Ecom-niverse](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
146
  - **Domain mix:**
147
+
148
+ We identified 9 E-commerce overlapping domains which have significant amount of relevant tokens but required filteration. Below is the domain list and their filtered size
149
+ | Domain | Size (GBs) |
150
+ |---|---|
151
+ | Hobby | 114 |
152
+ | News | 66 |
153
+ | Health | 66 |
154
+ | Entertainment | 64 |
155
+ | Travel | 52 |
156
+ | Food | 22 |
157
+ | Automotive | 19 |
158
+ | Sports | 12 |
159
+ | Music and Dance | 7 |
160
+
161
+ Additionally, there are 6 more domains which had almost complete overlap and were picked directly out of FineFineWeb.
162
+ | Domain | Size (GBs) |
163
+ |---|---|
164
+ | Fashion | 37 |
165
+ | Beauty | 37 |
166
+ | Celebrity | 28 |
167
+ | Movie | 26 |
168
+ | Photo | 15 |
169
+ | Painting | 2 |
170
+
171
+ By focusing on these domains, we narrow the search space to parts of the web data where shopping-related text is likely to appear. However, even within a chosen domain, not every item is actually about buying or selling, many may be informational articles, news, or unrelated discussions. Thus, a more fine-grained filtering within each domain is required to extract only the e-commerce-specific lines. We accomplish this by training lightweight classifiers per domain to distinguish e-commerce context vs. non-e-commerce content.
172
 
173
 
174