VoyagerXvoyagerx commited on
Commit
f33c596
·
1 Parent(s): 407bf3f

Support DINOv2

Browse files
Files changed (7) hide show
  1. Tutorial.md +16 -7
  2. Tutorial_zh.md +23 -11
  3. app.py +23 -9
  4. configs/huggingface.yaml +6 -3
  5. images/samples.png +2 -2
  6. models/dinov2_model.py +310 -0
  7. visualize.py +1 -16
Tutorial.md CHANGED
@@ -1,4 +1,4 @@
1
- # Tutorial: EarthEmbeddingExplorer
2
 
3
  ## Background
4
 
@@ -31,7 +31,8 @@ The original tiles in Core-S2L2A are large (1068×1068 pixels), but most AI mode
31
  </div>
32
 
33
  ### Retrieval models
34
- The core of image retrieval is a family of models known as **CLIP (Contrastive Language-Image Pre-training)** [2]. We use its improved variants such as **SigLIP (Sigmoid Language-Image Pre-training)** [3], **FarSLIP (Fine-grained Aligned Remote Sensing Language Image Pretraining)** [4], and **SatCLIP (Satellite Location-Image Pretraining)** [5].
 
35
 
36
  An analogy: when teaching a child, you show a picture of a glacier and say “glacier”. After seeing many examples, the child learns to associate the visual concept with the word.
37
 
@@ -47,11 +48,14 @@ The key property is: if an image matches a text description (or location), their
47
  <em>Figure 2: How CLIP-like models connect images and text.</em>
48
  </div>
49
 
50
- The three models we use differ in their encoders and training data:
 
 
51
 
52
  | Model | Encoder type | Training data |
53
  | :--- | :--- | :--- |
54
  | SigLIP | image encoder + text encoder | natural image–text pairs from the web |
 
55
  | FarSLIP | image encoder + text encoder | satellite image–text pairs |
56
  | SatCLIP | image encoder + location encoder | satellite image–location pairs |
57
 
@@ -62,8 +66,8 @@ The three models we use differ in their encoders and training data:
62
  </div>
63
 
64
  In EarthEmbeddingExplorer:
65
- 1. We precompute embeddings for ~22k globally distributed satellite images using SigLIP, FarSLIP, and SatCLIP.
66
- 2. When you provide a query (text like a satellite image of glacier, an image, or a location such as (-89, 120)), we encode the query into an embedding using the corresponding encoder.
67
  3. We compare the query embedding with all image embeddings, visualize similarities on a map, and show the top-5 most similar images.
68
 
69
  ## System architecture
@@ -128,6 +132,7 @@ We thank the following open-source projects and datasets that made EarthEmbeddin
128
  - [SigLIP](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) - Vision Transformer model for image-text alignment
129
  - [FarSLIP](https://github.com/NJU-LHRS/FarSLIP) - Fine-grained satellite image-text pretraining model
130
  - [SatCLIP](https://github.com/microsoft/satclip) - Satellite location-image pretraining model
 
131
 
132
  **Datasets:**
133
  - [MajorTOM](https://github.com/ESA-PhiLab/MajorTOM) - Expandable datasets for Earth observation by ESA
@@ -137,11 +142,13 @@ We are grateful to the research communities and organizations that developed and
137
  ## Contributors
138
  - [Yijie Zheng](https://voyagerxvoyagerx.github.io/)
139
  - [Weijie Wu](https://github.com/go-bananas-wwj)
140
- - [Bingyue Wu](https://brynn-wu.github.io/Brynn-Wu)
 
 
141
 
142
  ## Roadmap
 
143
  - [ ] Increase the geographical coverage (sample rate) to 1.2% of of the Earth's land surface.
144
- - [ ] Support DINOv2 Embedding model and embedding datasets.
145
  - [ ] Support FAISS for faster similarity search.
146
  - [ ] What features do you want? Leave an issue [here](https://huggingface.co/spaces/ML4Sustain/EarthExplorer/discussions)!
147
 
@@ -160,3 +167,5 @@ We warmly welcome new contributors!
160
  [5] Klemmer, K. et al. (2025). SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery. AAAI 2025.
161
 
162
  [6] Czerkawski, M., Kluczek, M., & Bojanowski, J. S. (2024). Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space. arXiv preprint arXiv:2412.05600.
 
 
 
1
+ # EarthEmbeddingExplorer
2
 
3
  ## Background
4
 
 
31
  </div>
32
 
33
  ### Retrieval models
34
+ The core of image retrieval includes **CLIP (Contrastive Language-Image Pre-training)** [2] and **DINOv2 (self-supervised vision transformers)** [7]. We use CLIP's improved variants such as **SigLIP (Sigmoid Language-Image Pre-training)** [3], **FarSLIP (Fine-grained Aligned Remote Sensing Language Image Pretraining)** [4], and **SatCLIP (Satellite Location-Image Pretraining)** [5], along with **DINOv2** for pure visual similarity search [7].
35
+
36
 
37
  An analogy: when teaching a child, you show a picture of a glacier and say “glacier”. After seeing many examples, the child learns to associate the visual concept with the word.
38
 
 
48
  <em>Figure 2: How CLIP-like models connect images and text.</em>
49
  </div>
50
 
51
+ DINOv2, on the other hand, is a self-supervised vision model that learns rich visual representations without requiring paired text data. It excels at capturing visual patterns and can be used for image-to-image similarity search.
52
+
53
+ The four models we use differ in their encoders and training data:
54
 
55
  | Model | Encoder type | Training data |
56
  | :--- | :--- | :--- |
57
  | SigLIP | image encoder + text encoder | natural image–text pairs from the web |
58
+ | DINOv2 | image encoder only | web-scale natural images (self-supervised) |
59
  | FarSLIP | image encoder + text encoder | satellite image–text pairs |
60
  | SatCLIP | image encoder + location encoder | satellite image–location pairs |
61
 
 
66
  </div>
67
 
68
  In EarthEmbeddingExplorer:
69
+ 1. We precompute embeddings for ~250k globally distributed satellite images using SigLIP, DINOv2, FarSLIP, and SatCLIP.
70
+ 2. When you provide a query (text like "a satellite image of glacier", an image, or a location such as (-89, 120)), we encode the query into an embedding using the corresponding encoder.
71
  3. We compare the query embedding with all image embeddings, visualize similarities on a map, and show the top-5 most similar images.
72
 
73
  ## System architecture
 
132
  - [SigLIP](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) - Vision Transformer model for image-text alignment
133
  - [FarSLIP](https://github.com/NJU-LHRS/FarSLIP) - Fine-grained satellite image-text pretraining model
134
  - [SatCLIP](https://github.com/microsoft/satclip) - Satellite location-image pretraining model
135
+ - [DINOv2](https://huggingface.co/facebook/dinov2-large) - Self-supervised vision transformer
136
 
137
  **Datasets:**
138
  - [MajorTOM](https://github.com/ESA-PhiLab/MajorTOM) - Expandable datasets for Earth observation by ESA
 
142
  ## Contributors
143
  - [Yijie Zheng](https://voyagerxvoyagerx.github.io/)
144
  - [Weijie Wu](https://github.com/go-bananas-wwj)
145
+ - [Bingyue Wu](https://brynn-wu.github.io/Brynn-Wu)
146
+ - [Mikolaj Czerkawski](https://mikonvergence.github.io/)
147
+ - [Konstantin Klemmer](https://konstantinklemmer.github.io/)
148
 
149
  ## Roadmap
150
+ - [x] Support DINOv2 Embedding model and embedding datasets.
151
  - [ ] Increase the geographical coverage (sample rate) to 1.2% of of the Earth's land surface.
 
152
  - [ ] Support FAISS for faster similarity search.
153
  - [ ] What features do you want? Leave an issue [here](https://huggingface.co/spaces/ML4Sustain/EarthExplorer/discussions)!
154
 
 
167
  [5] Klemmer, K. et al. (2025). SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery. AAAI 2025.
168
 
169
  [6] Czerkawski, M., Kluczek, M., & Bojanowski, J. S. (2024). Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space. arXiv preprint arXiv:2412.05600.
170
+
171
+ [7] Oquab, M., et al. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193.
Tutorial_zh.md CHANGED
@@ -30,11 +30,11 @@ Core-S2L2A 中的原始卫星图像尺寸很大(1068x1068 像素),但 AI
30
  </div>
31
 
32
  ### 检索模型
33
- 图像检索核心技术是一种叫做 **CLIP (Contrastive Language-Image Pre-training)** [2] 的人工智能模型,我们使用的是的改进版本 **SigLIP (Sigmoid Language-Image Pre-training)** [3], **FarSLIP (Fine-grained Aligned Remote Sensing Language Image Pretraining)** [4], 和 **SatCLIP (Satellite Location-Image Pretraining)** [5]。
34
 
35
  想象一下教小孩子识物。你给他们看一张冰川的照片,并说“冰川”。在看了很多冰川的照片并听到这个词后,孩子就学会了将冰川的样子和“冰川”这个词联系起来。
36
 
37
- SigLIP/FarSLIP/SatCLIP 的工作原理类似,但规模要大得多。它在学习了数百万个图片-文字对或图片-地理位置对,从而理解了图像和文本/地理位置之间的关系。
38
  - 它使用图片编码器将**图像**转换成一种数学表示(一串数字),我们称之为**嵌入 (Embedding)**。
39
  - 它也使用文本/地理位置编码器将**文本**或**地理位置(经纬度坐标)**转换成类似的数学表示(嵌入)。
40
 
@@ -46,10 +46,13 @@ SigLIP/FarSLIP/SatCLIP 的工作原理类似,但规模要大得多。它在学
46
  <em>图 2:CLIP 类模型如何连接图像和文本/位置。</em>
47
  </div>
48
 
49
- 我们用到的三个模型的模型结构和训练数据是:
 
 
50
  | 模型 | 编码器类型 | 训练数据来源 |
51
  | :--- | :--- | :--- |
52
  | SigLIP | 图像编码器+文本编码器 | 互联网上的自然图像-文本对 |
 
53
  | FarSLIP | 图像编码器+文本编码器 | 卫星图像-文本对 |
54
  | SatCLIP | 图像编码器+位置编码器 | 卫星图像-地理位置对 |
55
 
@@ -60,8 +63,8 @@ SigLIP/FarSLIP/SatCLIP 的工作原理类似,但规模要大得多。它在学
60
  </div>
61
 
62
  在 EarthExplorer 中:
63
- 1. 我们将全球均匀采样的张卫星图像,分别使用 SigLIP, FarSLIP, 和 SatCLIP 的图像编码器,将卫星图像已经转换成这种数学嵌入
64
- 2. 当你输入一个查询,这个查询可以是文本(例如a satellite image of glacier),图像(一张冰川的图像),或地理位置(-89, 120),我们将你的查询也使用对应的编码器转换成嵌入。
65
  3. 然后,我们将你的查询嵌入与所有卫星图像的嵌入进行比较,将相似度在地图上可视化,并展示最相似的5张图像。
66
 
67
 
@@ -117,13 +120,9 @@ MajorTOM Core-S2L2A 的原始影像体量很大(约 23TB),以 **Parquet
117
 
118
  ## 局限性
119
 
120
- 虽然 EarthExplorer 有很大的应用潜力,但它也有一些局限性。SigLIP 模型主要是通过互联网上的自然图像(如人物、猫狗、汽车、日常用品的照片)训练的,而不是专门针对卫星图像训练的。这种训练数据和应用时数据的偏差,使得模型可能难以理解特定的科学术语或在普通网络照片中不常见的独特地理特征。而 FarSLIP 模型对非典型遥感地物的语言描述,例如 'an image of face' 的检索效果不佳。
121
-
122
- 未来的工作可以使用其他专门针对地球观测数据训练的 AI 模型来提高检索的准确性。
123
 
124
- ## 未来工作
125
- - 结合时间序列影像,实现全球变化监测
126
- - 添加不同地球基础模型,对比不同模型的检索性能
127
 
128
  ## 致谢
129
  我们感谢以下开源项目和数据集,它们使 EarthExplorer 得以实现:
@@ -132,6 +131,7 @@ MajorTOM Core-S2L2A 的原始影像体量很大(约 23TB),以 **Parquet
132
  - [SigLIP](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) - 用于图像-文本对齐的视觉Transformer模型
133
  - [FarSLIP](https://github.com/NJU-LHRS/FarSLIP) - 细粒度卫星图像-文本预训练模型
134
  - [SatCLIP](https://github.com/microsoft/satclip) - 卫星位置-图像预训练模型
 
135
 
136
  **数据集:**
137
  - [MajorTOM](https://github.com/ESA-PhiLab/MajorTOM) - 欧洲航天局(ESA)的可扩展地球观测数据集
@@ -142,6 +142,16 @@ MajorTOM Core-S2L2A 的原始影像体量很大(约 23TB),以 **Parquet
142
  - [郑祎杰](https://voyagerxvoyagerx.github.io/)
143
  - [伍炜杰](https://github.com/go-bananas-wwj)
144
  - [吴冰玥](https://brynn-wu.github.io/Brynn-Wu)
 
 
 
 
 
 
 
 
 
 
145
 
146
  ## 引用
147
  [1] Francis, A., & Czerkawski, M. (2024). Major TOM: Expandable Datasets for Earth Observation. IGARSS 2024.
@@ -155,3 +165,5 @@ MajorTOM Core-S2L2A 的原始影像体量很大(约 23TB),以 **Parquet
155
  [5] Klemmer, K. et al. (2025). SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery. AAAI 2025.
156
 
157
  [6] Czerkawski, M., Kluczek, M., & Bojanowski, J. S. (2024). Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space. arXiv preprint arXiv:2412.05600.
 
 
 
30
  </div>
31
 
32
  ### 检索模型
33
+ 图像检索核心技术包括 **CLIP (Contrastive Language-Image Pre-training)** [2] 和 **DINOv2 (自监督视觉Transformer)** [7]。我们使用的是 CLIP 的改进版本 **SigLIP (Sigmoid Language-Image Pre-training)** [3], **FarSLIP (Fine-grained Aligned Remote Sensing Language Image Pretraining)** [4], 和 **SatCLIP (Satellite Location-Image Pretraining)** [5],以及用于纯视觉相似度搜索的 **DINOv2** [7]
34
 
35
  想象一下教小孩子识物。你给他们看一张冰川的照片,并说“冰川”。在看了很多冰川的照片并听到这个词后,孩子就学会了将冰川的样子和“冰川”这个词联系起来。
36
 
37
+ CLIP 模型的工作原理类似,但规模要大得多。
38
  - 它使用图片编码器将**图像**转换成一种数学表示(一串数字),我们称之为**嵌入 (Embedding)**。
39
  - 它也使用文本/地理位置编码器将**文本**或**地理位置(经纬度坐标)**转换成类似的数学表示(嵌入)。
40
 
 
46
  <em>图 2:CLIP 类模型如何连接图像和文本/位置。</em>
47
  </div>
48
 
49
+ 另一方面,DINOv2 是一种自监督视觉模型,无需配对文本数据即可学习丰富的视觉表示。它擅长捕捉视觉模式,可用于图像到图像的相似度搜索。
50
+
51
+ 我们用到的四个模型的模型结构和训练数据是:
52
  | 模型 | 编码器类型 | 训练数据来源 |
53
  | :--- | :--- | :--- |
54
  | SigLIP | 图像编码器+文本编码器 | 互联网上的自然图像-文本对 |
55
+ | DINOv2 | 仅图像编码器 | 互连网上的自然图像(自监督) |
56
  | FarSLIP | 图像编码器+文本编码器 | 卫星图像-文本对 |
57
  | SatCLIP | 图像编码器+位置编码器 | 卫星图像-地理位置对 |
58
 
 
63
  </div>
64
 
65
  在 EarthExplorer 中:
66
+ 1. 我们将全球均匀采样的约25万张卫星图像,分别使用 SigLIP, DINOv2, FarSLIP, 和 SatCLIP 的图像编码器,将卫星图像已经转换成这种数学"嵌入"
67
+ 2. 当你输入一个查询,这个查询可以是文本(例如"a satellite image of glacier"),图像(一张冰川的图像),或地理位置(-89, 120),我们将你的查询也使用对应的编码器转换成嵌入。
68
  3. 然后,我们将你的查询嵌入与所有卫星图像的嵌入进行比较,将相似度在地图上可视化,并展示最相似的5张图像。
69
 
70
 
 
120
 
121
  ## 局限性
122
 
123
+ 虽然 EarthExplorer 有很大的应用潜力,但它也有一些局限性。SigLIP 模型主要是通过互联网上的"自然图像"(如人物、猫狗、汽车、日常用品的照片)训练的,而不是专门针对卫星图像训练的。这种训练数据和应用时数据的偏差,使得模型可能难以理解特定的科学术语或在普通网络照片中不常见的独特地理特征。
 
 
124
 
125
+ FarSLIP 模型对非典型遥感地物的语言描述,例如 'an image of face' 的检索效果不佳。
 
 
126
 
127
  ## 致谢
128
  我们感谢以下开源项目和数据集,它们使 EarthExplorer 得以实现:
 
131
  - [SigLIP](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) - 用于图像-文本对齐的视觉Transformer模型
132
  - [FarSLIP](https://github.com/NJU-LHRS/FarSLIP) - 细粒度卫星图像-文本预训练模型
133
  - [SatCLIP](https://github.com/microsoft/satclip) - 卫星位置-图像预训练模型
134
+ - [DINOv2](https://huggingface.co/facebook/dinov2-large) - 自监督视觉Transformer
135
 
136
  **数据集:**
137
  - [MajorTOM](https://github.com/ESA-PhiLab/MajorTOM) - 欧洲航天局(ESA)的可扩展地球观测数据集
 
142
  - [郑祎杰](https://voyagerxvoyagerx.github.io/)
143
  - [伍炜杰](https://github.com/go-bananas-wwj)
144
  - [吴冰玥](https://brynn-wu.github.io/Brynn-Wu)
145
+ - [Mikolaj Czerkawski](https://mikonvergence.github.io/)
146
+ - [Konstantin Klemmer](https://konstantinklemmer.github.io/)
147
+
148
+ ## 路线图
149
+ - [x] 支持 DINOv2 嵌入模型和嵌入数据集。
150
+ - [ ] 将地理覆盖率(采样率)提高到地球陆地表面的 1.2%。
151
+ - [ ] 支持 FAISS 以实现更快的相似度搜索。
152
+ - [ ] 您想要哪些功能?请在[这里](https://huggingface.co/spaces/ML4Sustain/EarthExplorer/discussions)留言!
153
+
154
+ 我们热烈欢迎新的贡献者!
155
 
156
  ## 引用
157
  [1] Francis, A., & Czerkawski, M. (2024). Major TOM: Expandable Datasets for Earth Observation. IGARSS 2024.
 
165
  [5] Klemmer, K. et al. (2025). SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery. AAAI 2025.
166
 
167
  [6] Czerkawski, M., Kluczek, M., & Bojanowski, J. S. (2024). Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space. arXiv preprint arXiv:2412.05600.
168
+
169
+ [7] Oquab, M., et al. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193.
app.py CHANGED
@@ -12,6 +12,7 @@ from concurrent.futures import ThreadPoolExecutor, as_completed
12
  from models.siglip_model import SigLIPModel
13
  from models.satclip_model import SatCLIPModel
14
  from models.farslip_model import FarSLIPModel
 
15
  from models.load_config import load_and_process_config
16
  from visualize import format_results_for_gallery, plot_top5_overview, plot_location_distribution, plot_global_map_static, plot_geographic_distribution
17
  from data_utils import download_and_process_image, get_esri_satellite_image, get_placeholder_image
@@ -29,6 +30,19 @@ config = load_and_process_config()
29
  print("Initializing models...")
30
  models = {}
31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  # SigLIP
33
  try:
34
  if config and 'siglip' in config:
@@ -396,10 +410,10 @@ def get_initial_plot():
396
  # Use FarSLIP as default for initial plot, fallback to SigLIP
397
  df_vis = None
398
  img = None
399
- if 'FarSLIP' in models and models['FarSLIP'].df_embed is not None:
400
- img, df_vis = plot_global_map_static(models['FarSLIP'].df_embed)
401
  # fig = plot_global_map(models['FarSLIP'].df_embed)
402
- elif 'SigLIP' in models and models['SigLIP'].df_embed is not None:
403
  img, df_vis = plot_global_map_static(models['SigLIP'].df_embed)
404
  return gr.update(value=img, visible=True), [img], df_vis, gr.update(visible=False)
405
 
@@ -519,9 +533,9 @@ def reset_to_global_map():
519
  """Reset the map to the initial global distribution view"""
520
  img = None
521
  df_vis = None
522
- if 'FarSLIP' in models and models['FarSLIP'].df_embed is not None:
523
- img, df_vis = plot_global_map_static(models['FarSLIP'].df_embed)
524
- elif 'SigLIP' in models and models['SigLIP'].df_embed is not None:
525
  img, df_vis = plot_global_map_static(models['SigLIP'].df_embed)
526
 
527
  return gr.update(value=img, visible=True), [img], df_vis
@@ -609,8 +623,8 @@ with gr.Blocks(title="EarthEmbeddingExplorer") as demo:
609
  <a href="https://www.modelscope.cn/studios/VoyagerX/EarthExplorer"><img src="https://img.shields.io/badge/Open in ModelScope.cn-xGPU-624aff"></a>
610
  <a href="https://www.modelscope.ai/studios/VoyagerX/EarthExplorer"><img src="https://img.shields.io/badge/Open in ModelScope.ai-CPU-624aff"></a>
611
  <a href="https://huggingface.co/spaces/ML4Sustain/EarthExplorer"><img src="https://img.shields.io/badge/Open in HF Space-CPU-FFD21E"></a>
612
- <a href="https://modelscope.cn/studios/VoyagerX/EarthExplorer/file/view/master/Tutorial.md?status=1"> <img src="https://img.shields.io/badge/Tutorial-📖-007bff"> </a>
613
- <a href="https://www.modelscope.cn/learn/3958"> <img src="https://img.shields.io/badge/中文教程-📖-007bff"> </a>
614
  </div>
615
 
616
  """)
@@ -637,7 +651,7 @@ with gr.Blocks(title="EarthEmbeddingExplorer") as demo:
637
  search_btn = gr.Button("Search by Text", variant="primary")
638
 
639
  with gr.TabItem("Image Search") as tab_image:
640
- model_selector_img = gr.Dropdown(choices=["SigLIP", "FarSLIP", "SatCLIP"], value="FarSLIP", label="Model")
641
 
642
  gr.Markdown("### Option 1: Upload or Select Image")
643
  image_input = gr.Image(type="pil", label="Upload Image")
 
12
  from models.siglip_model import SigLIPModel
13
  from models.satclip_model import SatCLIPModel
14
  from models.farslip_model import FarSLIPModel
15
+ from models.dinov2_model import DINOv2Model
16
  from models.load_config import load_and_process_config
17
  from visualize import format_results_for_gallery, plot_top5_overview, plot_location_distribution, plot_global_map_static, plot_geographic_distribution
18
  from data_utils import download_and_process_image, get_esri_satellite_image, get_placeholder_image
 
30
  print("Initializing models...")
31
  models = {}
32
 
33
+ # DINOv2
34
+ try:
35
+ if config and 'dinov2' in config:
36
+ models['DINOv2'] = DINOv2Model(
37
+ ckpt_path=config['dinov2'].get('ckpt_path'),
38
+ embedding_path=config['dinov2'].get('embedding_path'),
39
+ device=device
40
+ )
41
+ else:
42
+ models['DINOv2'] = DINOv2Model(device=device)
43
+ except Exception as e:
44
+ print(f"Failed to load DINOv2: {e}")
45
+
46
  # SigLIP
47
  try:
48
  if config and 'siglip' in config:
 
410
  # Use FarSLIP as default for initial plot, fallback to SigLIP
411
  df_vis = None
412
  img = None
413
+ if 'DINOv2' in models and models['DINOv2'].df_embed is not None:
414
+ img, df_vis = plot_global_map_static(models['DINOv2'].df_embed)
415
  # fig = plot_global_map(models['FarSLIP'].df_embed)
416
+ else:
417
  img, df_vis = plot_global_map_static(models['SigLIP'].df_embed)
418
  return gr.update(value=img, visible=True), [img], df_vis, gr.update(visible=False)
419
 
 
533
  """Reset the map to the initial global distribution view"""
534
  img = None
535
  df_vis = None
536
+ if 'DINOv2' in models and models['DINOv2'].df_embed is not None:
537
+ img, df_vis = plot_global_map_static(models['DINOv2'].df_embed)
538
+ else:
539
  img, df_vis = plot_global_map_static(models['SigLIP'].df_embed)
540
 
541
  return gr.update(value=img, visible=True), [img], df_vis
 
623
  <a href="https://www.modelscope.cn/studios/VoyagerX/EarthExplorer"><img src="https://img.shields.io/badge/Open in ModelScope.cn-xGPU-624aff"></a>
624
  <a href="https://www.modelscope.ai/studios/VoyagerX/EarthExplorer"><img src="https://img.shields.io/badge/Open in ModelScope.ai-CPU-624aff"></a>
625
  <a href="https://huggingface.co/spaces/ML4Sustain/EarthExplorer"><img src="https://img.shields.io/badge/Open in HF Space-CPU-FFD21E"></a>
626
+ <a href="https://huggingface.co/spaces/ML4Sustain/EarthExplorer/blob/main/Tutorial.md"> <img src="https://img.shields.io/badge/Tutorial-📖-007bff"> </a>
627
+ <a href="https://modelscope.cn/studios/VoyagerX/EarthExplorer/file/view/master/Tutorial_zh.md?status=1"> <img src="https://img.shields.io/badge/中文教程-📖-007bff"> </a>
628
  </div>
629
 
630
  """)
 
651
  search_btn = gr.Button("Search by Text", variant="primary")
652
 
653
  with gr.TabItem("Image Search") as tab_image:
654
+ model_selector_img = gr.Dropdown(choices=["SigLIP", "FarSLIP", "SatCLIP", "DINOv2"], value="FarSLIP", label="Model")
655
 
656
  gr.Markdown("### Option 1: Upload or Select Image")
657
  image_input = gr.Image(type="pil", label="Upload Image")
configs/huggingface.yaml CHANGED
@@ -2,11 +2,14 @@ siglip:
2
  ckpt_path: "hf"
3
  model_name: "ViT-SO400M-14-SigLIP-384"
4
  tokenizer_path: "hf"
5
- embedding_path: "hf://ML4Sustain/EarthEmbeddings/uniform_sample_250k/siglip/SigLIP_grid_sample_center_384x384_243k.parquet"
6
  farslip:
7
  ckpt_path: "hf"
8
  model_name: "ViT-B-16"
9
- embedding_path: "hf://ML4Sustain/EarthEmbeddings/uniform_sample_250k/farslip/FarSLIP_grid_sample_center_384x384_243k.parquet"
10
  satclip:
11
  ckpt_path: "hf"
12
- embedding_path: "hf://ML4Sustain/EarthEmbeddings/uniform_sample_250k/satclip/SatCLIP_grid_sample_center_384x384_243k.parquet"
 
 
 
 
2
  ckpt_path: "hf"
3
  model_name: "ViT-SO400M-14-SigLIP-384"
4
  tokenizer_path: "hf"
5
+ embedding_path: "hf://ML4Sustain/EarthEmbeddings/uniform_sample_250k/siglip/SigLIP_grid_sample_center_384x384_244k.parquet"
6
  farslip:
7
  ckpt_path: "hf"
8
  model_name: "ViT-B-16"
9
+ embedding_path: "hf://ML4Sustain/EarthEmbeddings/uniform_sample_250k/farslip/FarSLIP_grid_sample_center_384x384_244k.parquet"
10
  satclip:
11
  ckpt_path: "hf"
12
+ embedding_path: "hf://ML4Sustain/EarthEmbeddings/uniform_sample_250k/satclip/SatCLIP_grid_sample_center_384x384_244k.parquet"
13
+ dinov2:
14
+ ckpt_path: "hf"
15
+ embedding_path_224: "hf://ML4Sustain/EarthEmbeddings/uniform_sample_250k/dinov2/DINOv2_grid_sample_center_224x224_249k_MajorTOM.parquet"
images/samples.png CHANGED

Git LFS Details

  • SHA256: 122e7e4c21b01fc14325ce794d5286c0e1abbd6ae3c42cf102907c7e209df65e
  • Pointer size: 132 Bytes
  • Size of remote file: 2.78 MB

Git LFS Details

  • SHA256: f1aa5f91807c95124130f5d37e6e2e9f7095d56ccc1d808c27754bd455983aaf
  • Pointer size: 132 Bytes
  • Size of remote file: 5.63 MB
models/dinov2_model.py ADDED
@@ -0,0 +1,310 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from transformers import AutoImageProcessor, AutoModel
3
+ import numpy as np
4
+ import pandas as pd
5
+ import pyarrow.parquet as pq
6
+ import torch.nn.functional as F
7
+ from PIL import Image
8
+ import os
9
+
10
+ class DINOv2Model:
11
+ """
12
+ DINOv2 model wrapper for Sentinel-2 RGB data embedding and search.
13
+
14
+ This class provides a unified interface for:
15
+ - Loading DINOv2 models from local checkpoint or HuggingFace
16
+ - Encoding images into embeddings
17
+ - Loading pre-computed embeddings
18
+ - Searching similar images using cosine similarity
19
+
20
+ The model processes Sentinel-2 RGB data by normalizing it to true-color values
21
+ and generating feature embeddings using the DINOv2 architecture.
22
+ """
23
+
24
+ def __init__(self,
25
+ ckpt_path="./checkpoints/DINOv2",
26
+ model_name="facebook/dinov2-large",
27
+ embedding_path="./embedding_datasets/10percent_dinov2_encoded/all_dinov2_embeddings.parquet",
28
+ device=None):
29
+ """
30
+ Initialize the DINOv2Model.
31
+
32
+ Args:
33
+ ckpt_path (str): Path to local checkpoint directory or 'hf' for HuggingFace
34
+ model_name (str): HuggingFace model name (used when ckpt_path='hf')
35
+ embedding_path (str): Path to pre-computed embeddings parquet file
36
+ device (str): Device to use ('cuda', 'cpu', or None for auto-detection)
37
+ """
38
+ self.device = device if device else ("cuda" if torch.cuda.is_available() else "cpu")
39
+ self.model_name = model_name
40
+ self.ckpt_path = ckpt_path
41
+ self.embedding_path = embedding_path
42
+
43
+ self.model = None
44
+ self.processor = None
45
+ self.df_embed = None
46
+ self.image_embeddings = None
47
+
48
+ # Define the RGB bands for Sentinel-2 (B04, B03, B02)
49
+ self.bands = ['B04', 'B03', 'B02']
50
+ self.size = None
51
+
52
+ self.load_model()
53
+ if self.embedding_path is not None:
54
+ self.load_embeddings()
55
+
56
+ def load_model(self):
57
+ """Load DINOv2 model and processor from local checkpoint or HuggingFace."""
58
+ print(f"Loading DINOv2 model from {self.ckpt_path}...")
59
+ try:
60
+ if self.ckpt_path == 'hf':
61
+ # Load from HuggingFace
62
+ print(f"Loading from HuggingFace: {self.model_name}")
63
+ self.processor = AutoImageProcessor.from_pretrained(self.model_name)
64
+ self.model = AutoModel.from_pretrained(self.model_name)
65
+ elif self.ckpt_path.startswith('ms'):
66
+ # Load from ModelScope
67
+ import modelscope
68
+ self.processor = modelscope.AutoImageProcessor.from_pretrained(self.model_name)
69
+ self.model = modelscope.AutoModel.from_pretrained(self.model_name)
70
+ else:
71
+ self.processor = AutoImageProcessor.from_pretrained(self.ckpt_path)
72
+ self.model = AutoModel.from_pretrained(self.ckpt_path)
73
+
74
+ self.model = self.model.to(self.device)
75
+ self.model.eval()
76
+
77
+ # Extract the input size from the processor settings
78
+ if hasattr(self.processor, 'crop_size'):
79
+ self.size = (self.processor.crop_size['height'], self.processor.crop_size['width'])
80
+ elif hasattr(self.processor, 'size'):
81
+ if isinstance(self.processor.size, dict):
82
+ self.size = (self.processor.size.get('height', 224), self.processor.size.get('width', 224))
83
+ else:
84
+ self.size = (self.processor.size, self.processor.size)
85
+ else:
86
+ self.size = (224, 224)
87
+
88
+ print(f"DINOv2 model loaded on {self.device}, input size: {self.size}")
89
+ except Exception as e:
90
+ print(f"Error loading DINOv2 model: {e}")
91
+
92
+ def load_embeddings(self):
93
+ """Load pre-computed embeddings from parquet file."""
94
+ print(f"Loading DINOv2 embeddings from {self.embedding_path}...")
95
+ try:
96
+ if not os.path.exists(self.embedding_path):
97
+ print(f"Warning: Embedding file not found at {self.embedding_path}")
98
+ return
99
+
100
+ self.df_embed = pq.read_table(self.embedding_path).to_pandas()
101
+
102
+ # Pre-compute image embeddings tensor
103
+ image_embeddings_np = np.stack(self.df_embed['embedding'].values)
104
+ self.image_embeddings = torch.from_numpy(image_embeddings_np).to(self.device).float()
105
+ self.image_embeddings = F.normalize(self.image_embeddings, dim=-1)
106
+ print(f"DINOv2 Data loaded: {len(self.df_embed)} records")
107
+ except Exception as e:
108
+ print(f"Error loading DINOv2 embeddings: {e}")
109
+
110
+ # def normalize_s2(self, input_data):
111
+ # """
112
+ # Normalize Sentinel-2 RGB data to true-color values.
113
+
114
+ # Converts raw Sentinel-2 reflectance values to normalized true-color values
115
+ # suitable for the DINOv2 model.
116
+
117
+ # Args:
118
+ # input_data (torch.Tensor or np.ndarray): Raw Sentinel-2 image data
119
+
120
+ # Returns:
121
+ # torch.Tensor or np.ndarray: Normalized true-color image in range [0, 1]
122
+ # """
123
+ # return (2.5 * (input_data / 1e4)).clip(0, 1)
124
+
125
+ def encode_image(self, image, is_sentinel2=False):
126
+ """
127
+ Encode an image into a feature embedding.
128
+
129
+ Args:
130
+ image (PIL.Image, torch.Tensor, or np.ndarray): Input image
131
+ - PIL.Image: RGB image
132
+ - torch.Tensor: Image tensor with shape [C, H, W] (Sentinel-2) or [H, W, C]
133
+ - np.ndarray: Image array with shape [H, W, C]
134
+ is_sentinel2 (bool): Whether to apply Sentinel-2 normalization
135
+
136
+ Returns:
137
+ torch.Tensor: Normalized embedding vector with shape [embedding_dim]
138
+ """
139
+ if self.model is None or self.processor is None:
140
+ print("Model not loaded!")
141
+ return None
142
+
143
+ try:
144
+ # Convert to PIL Image if needed
145
+ if isinstance(image, torch.Tensor):
146
+ if is_sentinel2:
147
+ # Sentinel-2 data: [C, H, W] -> normalize -> PIL
148
+ image = self.normalize_s2(image)
149
+ # Convert to [H, W, C] and then to numpy
150
+ if image.shape[0] == 3: # [C, H, W]
151
+ image = image.permute(1, 2, 0)
152
+ image_np = (image.cpu().numpy() * 255).astype(np.uint8)
153
+ image = Image.fromarray(image_np, mode='RGB')
154
+ else:
155
+ # Regular RGB tensor: [H, W, C] or [C, H, W]
156
+ if image.shape[0] == 3: # [C, H, W]
157
+ image = image.permute(1, 2, 0)
158
+ image_np = (image.cpu().numpy() * 255).astype(np.uint8)
159
+ image = Image.fromarray(image_np, mode='RGB')
160
+ elif isinstance(image, np.ndarray):
161
+ if is_sentinel2:
162
+ image = self.normalize_s2(image)
163
+ # Assume [H, W, C] format
164
+ if image.max() <= 1.0:
165
+ image = (image * 255).astype(np.uint8)
166
+ else:
167
+ image = image.astype(np.uint8)
168
+ image = Image.fromarray(image, mode='RGB')
169
+ elif isinstance(image, Image.Image):
170
+ image = image.convert("RGB")
171
+ else:
172
+ raise ValueError(f"Unsupported image type: {type(image)}")
173
+
174
+ # Process image
175
+ inputs = self.processor(images=image, return_tensors="pt")
176
+ pixel_values = inputs['pixel_values'].to(self.device)
177
+
178
+ # Generate embeddings
179
+ with torch.no_grad():
180
+ if self.device == "cuda":
181
+ # with torch.amp.autocast('cuda'): # disable amp as the official embedding is float32
182
+ outputs = self.model(pixel_values)
183
+ else:
184
+ outputs = self.model(pixel_values)
185
+
186
+ # Get embeddings: average across sequence dimension
187
+ last_hidden_states = outputs.last_hidden_state
188
+ image_features = last_hidden_states.mean(dim=1)
189
+
190
+ # # Get embeddings: Use pooler_output (1024-d) to match pre-computed embeddings
191
+ # # If pooler_output is not available, use CLS token (first token)
192
+ # if hasattr(outputs, 'pooler_output') and outputs.pooler_output is not None:
193
+ # image_features = outputs.pooler_output
194
+ # else:
195
+ # # Use CLS token (first token in sequence)
196
+ # last_hidden_states = outputs.last_hidden_state
197
+ # image_features = last_hidden_states[:, 0, :] # [batch_size, hidden_dim]
198
+
199
+ # Normalize
200
+ image_features = F.normalize(image_features, dim=-1)
201
+
202
+ return image_features
203
+
204
+ except Exception as e:
205
+ print(f"Error encoding image: {e}")
206
+ import traceback
207
+ traceback.print_exc()
208
+ return None
209
+
210
+ def search(self, query_features, top_k=5, top_percent=None, threshold=0.0):
211
+ """
212
+ Search for similar images using cosine similarity.
213
+
214
+ Args:
215
+ query_features (torch.Tensor): Query embedding vector
216
+ top_k (int): Number of top results to return
217
+ top_percent (float): If set, use top percentage instead of top_k
218
+ threshold (float): Minimum similarity threshold
219
+
220
+ Returns:
221
+ tuple: (similarities, filtered_indices, top_indices)
222
+ - similarities: Similarity scores for all images
223
+ - filtered_indices: Indices of images above threshold
224
+ - top_indices: Indices of top-k results
225
+ """
226
+ if self.image_embeddings is None:
227
+ print("Embeddings not loaded!")
228
+ return None, None, None
229
+
230
+ try:
231
+ # Ensure query_features is float32 and on correct device
232
+ query_features = query_features.float().to(self.device)
233
+
234
+ # Normalize query features
235
+ query_features = F.normalize(query_features, dim=-1)
236
+
237
+ # Cosine similarity
238
+ similarity = (self.image_embeddings @ query_features.T).squeeze()
239
+ similarities = similarity.detach().cpu().numpy()
240
+
241
+ # Handle top_percent
242
+ if top_percent is not None:
243
+ k = int(len(similarities) * top_percent)
244
+ if k < 1:
245
+ k = 1
246
+ threshold = np.partition(similarities, -k)[-k]
247
+
248
+ # Filter by threshold
249
+ mask = similarities >= threshold
250
+ filtered_indices = np.where(mask)[0]
251
+
252
+ # Get top k
253
+ top_indices = np.argsort(similarities)[-top_k:][::-1]
254
+
255
+ return similarities, filtered_indices, top_indices
256
+
257
+ except Exception as e:
258
+ print(f"Error during search: {e}")
259
+ return None, None, None
260
+
261
+
262
+ # Legacy class for backward compatibility
263
+ class DINOv2_S2RGB_Embedder(torch.nn.Module):
264
+ """
265
+ Legacy embedding wrapper for DINOv2 and Sentinel-2 data.
266
+
267
+ This class is kept for backward compatibility with existing code.
268
+ For new projects, please use DINOv2Model instead.
269
+ """
270
+
271
+ def __init__(self):
272
+ """Initialize the legacy DINOv2_S2RGB_Embedder."""
273
+ super().__init__()
274
+
275
+ # Load the DINOv2 processor and model from Hugging Face
276
+ self.processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
277
+ self.model = AutoModel.from_pretrained('facebook/dinov2-base')
278
+
279
+ # Define the RGB bands for Sentinel-2 (B04, B03, B02)
280
+ self.bands = ['B04', 'B03', 'B02']
281
+
282
+ # Extract the input size from the processor settings
283
+ self.size = self.processor.crop_size['height'], self.processor.crop_size['width']
284
+
285
+ def normalize(self, input):
286
+ """
287
+ Normalize Sentinel-2 RGB data to true-color values.
288
+
289
+ Args:
290
+ input (torch.Tensor): Raw Sentinel-2 image tensor
291
+
292
+ Returns:
293
+ torch.Tensor: Normalized true-color image
294
+ """
295
+ return (2.5 * (input / 1e4)).clip(0, 1)
296
+
297
+ def forward(self, input):
298
+ """
299
+ Forward pass through the model to generate embeddings.
300
+
301
+ Args:
302
+ input (torch.Tensor): Input Sentinel-2 image tensor with shape [C, H, W]
303
+
304
+ Returns:
305
+ torch.Tensor: Embedding vector with shape [embedding_dim]
306
+ """
307
+ model_input = self.processor(self.normalize(input), return_tensors="pt")
308
+ outputs = self.model(model_input['pixel_values'].to(self.model.device))
309
+ last_hidden_states = outputs.last_hidden_state
310
+ return last_hidden_states.mean(dim=1).cpu()
visualize.py CHANGED
@@ -143,21 +143,6 @@ def plot_geographic_distribution(df, scores, threshold, lat_col='centre_lat', lo
143
  ax.add_feature(cfeature.LAND, facecolor='lightgray', alpha=0.2)
144
  ax.add_feature(cfeature.COASTLINE, linewidth=0.5, alpha=0.5)
145
 
146
- # # 1. Plot Background (All points, sampled) to provide context
147
- # if len(df) > 40000:
148
- # df_bg = df.sample(40000)
149
- # else:
150
- # df_bg = df
151
- # ax.scatter(
152
- # df_bg[lon_col],
153
- # df_bg[lat_col],
154
- # s=1,
155
- # c='lightgrey',
156
- # alpha=0.3,
157
- # transform=ccrs.PlateCarree(),
158
- # label='All Samples',
159
- # )
160
-
161
  # 2. Plot Search Results with color map
162
  label_text = f'Top {threshold * 1000:.0f}‰ Matches'
163
  sc = ax.scatter(
@@ -165,7 +150,7 @@ def plot_geographic_distribution(df, scores, threshold, lat_col='centre_lat', lo
165
  df_filtered[lat_col],
166
  c=df_filtered['score'],
167
  cmap='Reds',
168
- s=0.3,
169
  alpha=0.8,
170
  transform=ccrs.PlateCarree(),
171
  label=label_text,
 
143
  ax.add_feature(cfeature.LAND, facecolor='lightgray', alpha=0.2)
144
  ax.add_feature(cfeature.COASTLINE, linewidth=0.5, alpha=0.5)
145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  # 2. Plot Search Results with color map
147
  label_text = f'Top {threshold * 1000:.0f}‰ Matches'
148
  sc = ax.scatter(
 
150
  df_filtered[lat_col],
151
  c=df_filtered['score'],
152
  cmap='Reds',
153
+ s=0.35,
154
  alpha=0.8,
155
  transform=ccrs.PlateCarree(),
156
  label=label_text,