Spaces:

VoyagerXHF
/

EarthEmbeddingExplorer

Runtime error

App Files Files Community

VoyagerXvoyagerx commited on Jan 25

Commit

f33c596

1 Parent(s): 407bf3f

Support DINOv2

Browse files

Files changed (7) hide show

Tutorial.md +16 -7
Tutorial_zh.md +23 -11
app.py +23 -9
configs/huggingface.yaml +6 -3
images/samples.png +2 -2
models/dinov2_model.py +310 -0
visualize.py +1 -16

Tutorial.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# Tutorial: EarthEmbeddingExplorer
 ## Background
@@ -31,7 +31,8 @@ The original tiles in Core-S2L2A are large (1068×1068 pixels), but most AI mode
 </div>
 ### Retrieval models
-The core of image retrieval is a family of models known as **CLIP (Contrastive Language-Image Pre-training)** [2]. We use its improved variants such as **SigLIP (Sigmoid Language-Image Pre-training)** [3], **FarSLIP (Fine-grained Aligned Remote Sensing Language Image Pretraining)** [4], and **SatCLIP (Satellite Location-Image Pretraining)** [5].
 An analogy: when teaching a child, you show a picture of a glacier and say “glacier”. After seeing many examples, the child learns to associate the visual concept with the word.
@@ -47,11 +48,14 @@ The key property is: if an image matches a text description (or location), their
   <em>Figure 2: How CLIP-like models connect images and text.</em>
 </div>
-The three models we use differ in their encoders and training data:
 | Model | Encoder type | Training data |
 | :--- | :--- | :--- |
 | SigLIP | image encoder + text encoder | natural image–text pairs from the web |
 | FarSLIP | image encoder + text encoder | satellite image–text pairs |
 | SatCLIP | image encoder + location encoder | satellite image–location pairs |
@@ -62,8 +66,8 @@ The three models we use differ in their encoders and training data:
 </div>
 In EarthEmbeddingExplorer:
-1. We precompute embeddings for ~22k globally distributed satellite images using SigLIP, FarSLIP, and SatCLIP.
-2. When you provide a query (text like “a satellite image of glacier”, an image, or a location such as (-89, 120)), we encode the query into an embedding using the corresponding encoder.
 3. We compare the query embedding with all image embeddings, visualize similarities on a map, and show the top-5 most similar images.
 ## System architecture
@@ -128,6 +132,7 @@ We thank the following open-source projects and datasets that made EarthEmbeddin
 - [SigLIP](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) - Vision Transformer model for image-text alignment
 - [FarSLIP](https://github.com/NJU-LHRS/FarSLIP) - Fine-grained satellite image-text pretraining model
 - [SatCLIP](https://github.com/microsoft/satclip) - Satellite location-image pretraining model
 **Datasets:**
 - [MajorTOM](https://github.com/ESA-PhiLab/MajorTOM) - Expandable datasets for Earth observation by ESA
@@ -137,11 +142,13 @@ We are grateful to the research communities and organizations that developed and
 ## Contributors
 - [Yijie Zheng](https://voyagerxvoyagerx.github.io/)
 - [Weijie Wu](https://github.com/go-bananas-wwj)
-- [Bingyue Wu](https://brynn-wu.github.io/Brynn-Wu)
 ## Roadmap
 - [ ] Increase the geographical coverage (sample rate) to 1.2% of of the Earth's land surface.
-- [ ] Support DINOv2 Embedding model and embedding datasets.
 - [ ] Support FAISS for faster similarity search.
 - [ ] What features do you want? Leave an issue [here](https://huggingface.co/spaces/ML4Sustain/EarthExplorer/discussions)!
@@ -160,3 +167,5 @@ We warmly welcome new contributors!
 [5] Klemmer, K. et al. (2025). SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery. AAAI 2025.
 [6] Czerkawski, M., Kluczek, M., & Bojanowski, J. S. (2024). Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space. arXiv preprint arXiv:2412.05600.

+# EarthEmbeddingExplorer
 ## Background
 </div>
 ### Retrieval models
+The core of image retrieval includes **CLIP (Contrastive Language-Image Pre-training)** [2] and **DINOv2 (self-supervised vision transformers)** [7]. We use CLIP's improved variants such as **SigLIP (Sigmoid Language-Image Pre-training)** [3], **FarSLIP (Fine-grained Aligned Remote Sensing Language Image Pretraining)** [4], and **SatCLIP (Satellite Location-Image Pretraining)** [5], along with **DINOv2** for pure visual similarity search [7].
 An analogy: when teaching a child, you show a picture of a glacier and say “glacier”. After seeing many examples, the child learns to associate the visual concept with the word.
   <em>Figure 2: How CLIP-like models connect images and text.</em>
 </div>
+DINOv2, on the other hand, is a self-supervised vision model that learns rich visual representations without requiring paired text data. It excels at capturing visual patterns and can be used for image-to-image similarity search.
+The four models we use differ in their encoders and training data:
 | Model | Encoder type | Training data |
 | :--- | :--- | :--- |
 | SigLIP | image encoder + text encoder | natural image–text pairs from the web |
+| DINOv2 | image encoder only | web-scale natural images (self-supervised) |
 | FarSLIP | image encoder + text encoder | satellite image–text pairs |
 | SatCLIP | image encoder + location encoder | satellite image–location pairs |
 </div>
 In EarthEmbeddingExplorer:
+1. We precompute embeddings for ~250k globally distributed satellite images using SigLIP, DINOv2, FarSLIP, and SatCLIP.
+2. When you provide a query (text like "a satellite image of glacier", an image, or a location such as (-89, 120)), we encode the query into an embedding using the corresponding encoder.
 3. We compare the query embedding with all image embeddings, visualize similarities on a map, and show the top-5 most similar images.
 ## System architecture
 - [SigLIP](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) - Vision Transformer model for image-text alignment
 - [FarSLIP](https://github.com/NJU-LHRS/FarSLIP) - Fine-grained satellite image-text pretraining model
 - [SatCLIP](https://github.com/microsoft/satclip) - Satellite location-image pretraining model
+- [DINOv2](https://huggingface.co/facebook/dinov2-large) - Self-supervised vision transformer
 **Datasets:**
 - [MajorTOM](https://github.com/ESA-PhiLab/MajorTOM) - Expandable datasets for Earth observation by ESA
 ## Contributors
 - [Yijie Zheng](https://voyagerxvoyagerx.github.io/)
 - [Weijie Wu](https://github.com/go-bananas-wwj)
+- [Bingyue Wu](https://brynn-wu.github.io/Brynn-Wu)
+- [Mikolaj Czerkawski](https://mikonvergence.github.io/)
+- [Konstantin Klemmer](https://konstantinklemmer.github.io/)
 ## Roadmap
+- [x] Support DINOv2 Embedding model and embedding datasets.
 - [ ] Increase the geographical coverage (sample rate) to 1.2% of of the Earth's land surface.
 - [ ] Support FAISS for faster similarity search.
 - [ ] What features do you want? Leave an issue [here](https://huggingface.co/spaces/ML4Sustain/EarthExplorer/discussions)!
 [5] Klemmer, K. et al. (2025). SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery. AAAI 2025.
 [6] Czerkawski, M., Kluczek, M., & Bojanowski, J. S. (2024). Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space. arXiv preprint arXiv:2412.05600.
+[7] Oquab, M., et al. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193.

Tutorial_zh.md CHANGED Viewed

@@ -30,11 +30,11 @@ Core-S2L2A 中的原始卫星图像尺寸很大（1068x1068 像素），但 AI
 </div>
 ### 检索模型
-图像检索核心技术是一种叫做 **CLIP (Contrastive Language-Image Pre-training)** [2] 的人工智能模型，我们使用的是它的改进版本 **SigLIP (Sigmoid Language-Image Pre-training)** [3], **FarSLIP (Fine-grained Aligned Remote Sensing Language Image Pretraining)** [4], 和 **SatCLIP (Satellite Location-Image Pretraining)** [5]。
 想象一下教小孩子识物。你给他们看一张冰川的照片，并说“冰川”。在看了很多冰川的照片并听到这个词后，孩子就学会了将冰川的样子和“冰川”这个词联系起来。
-SigLIP/FarSLIP/SatCLIP 的工作原理类似，但规模要大得多。它在学习了数百万个图片-文字对或图片-地理位置对，从而理解了图像和文本/地理位置之间的关系。
 - 它使用图片编码器将**图像**转换成一种数学表示（一串数字），我们称之为**嵌入 (Embedding)**。
 - 它也使用文本/地理位置编码器将**文本**或**地理位置（经纬度坐标）**转换成类似的数学表示（嵌入）。
@@ -46,10 +46,13 @@ SigLIP/FarSLIP/SatCLIP 的工作原理类似，但规模要大得多。它在学
   <em>图 2：CLIP 类模型如何连接图像和文本/位置。</em>
 </div>
-我们用到的三个模型的模型结构和训练数据是：
 | 模型 | 编码器类型 | 训练数据来源  |
 | :--- | :--- | :--- |
 | SigLIP | 图像编码器+文本编码器 | 互联网上的自然图像-文本对 |
 | FarSLIP | 图像编码器+文本编码器 | 卫星图像-文本对 |
 | SatCLIP | 图像编码器+位置编码器 | 卫星图像-地理位置对 |
@@ -60,8 +63,8 @@ SigLIP/FarSLIP/SatCLIP 的工作原理类似，但规模要大得多。它在学
 </div>
 在 EarthExplorer 中：
-1. 我们将全球均匀采样的两万多张卫星图像，分别使用 SigLIP, FarSLIP, 和 SatCLIP 的图像编码器，将卫星图像已经转换成这种数学“嵌入”。
-2. 当你输入一个查询，这个查询可以是文本（例如“a satellite image of glacier”），图像（一张冰川的图像），或地理位置(-89, 120)，我们将你的查询也使用对应的编码器转换成嵌入。
 3. 然后，我们将你的查询嵌入与所有卫星图像的嵌入进行比较，将相似度在地图上可视化，并展示最相似的5张图像。
@@ -117,13 +120,9 @@ MajorTOM Core-S2L2A 的原始影像体量很大（约 23TB），以 **Parquet
 ## 局限性
-虽然 EarthExplorer 有很大的应用潜力，但它也有一些局限性。SigLIP 模型主要是通过互联网上的“自然图像”（如人物、猫狗、汽车、日常用品的照片）训练的，而不是专门针对卫星图像训练的。这种训练数据和应用时数据的偏差，使得模型可能难以理解特定的科学术语或在普通网络照片中不常见的独特地理特征。而 FarSLIP 模型对非典型遥感地物的语言描述，例如 'an image of face' 的检索效果不佳。
-未来的工作可以使用其他专门针对地球观测数据训练的 AI 模型来提高检索的准确性。
-## 未来工作
-- 结合时间序列影像，实现全球变化监测
-- 添加不同地球基础模型，对比不同模型的检索性能
 ## 致谢
 我们感谢以下开源项目和数据集，它们使 EarthExplorer 得以实现：
@@ -132,6 +131,7 @@ MajorTOM Core-S2L2A 的原始影像体量很大（约 23TB），以 **Parquet
 - [SigLIP](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) - 用于图像-文本对齐的视觉Transformer模型
 - [FarSLIP](https://github.com/NJU-LHRS/FarSLIP) - 细粒度卫星图像-文本预训练模型
 - [SatCLIP](https://github.com/microsoft/satclip) - 卫星位置-图像预训练模型
 **数据集：**
 - [MajorTOM](https://github.com/ESA-PhiLab/MajorTOM) - 欧洲航天局（ESA）的可扩展地球观测数据集
@@ -142,6 +142,16 @@ MajorTOM Core-S2L2A 的原始影像体量很大（约 23TB），以 **Parquet
 - [郑祎杰](https://voyagerxvoyagerx.github.io/)
 - [伍炜杰](https://github.com/go-bananas-wwj)
 - [吴冰玥](https://brynn-wu.github.io/Brynn-Wu)
 ## 引用
 [1] Francis, A., & Czerkawski, M. (2024). Major TOM: Expandable Datasets for Earth Observation. IGARSS 2024.
@@ -155,3 +165,5 @@ MajorTOM Core-S2L2A 的原始影像体量很大（约 23TB），以 **Parquet
 [5] Klemmer, K. et al. (2025). SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery. AAAI 2025.
 [6] Czerkawski, M., Kluczek, M., & Bojanowski, J. S. (2024). Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space. arXiv preprint arXiv:2412.05600.

 </div>
 ### 检索模型
+图像检索核心技术包括 **CLIP (Contrastive Language-Image Pre-training)** [2] 和 **DINOv2 (自监督视觉Transformer)** [7]。我们使用的是 CLIP 的改进版本 **SigLIP (Sigmoid Language-Image Pre-training)** [3], **FarSLIP (Fine-grained Aligned Remote Sensing Language Image Pretraining)** [4], 和 **SatCLIP (Satellite Location-Image Pretraining)** [5]，以及用于纯视觉相似度搜索的 **DINOv2** [7]。
 想象一下教小孩子识物。你给他们看一张冰川的照片，并说“冰川”。在看了很多冰川的照片并听到这个词后，孩子就学会了将冰川的样子和“冰川”这个词联系起来。
+类 CLIP 模型的工作原理类似，但规模要大得多。
 - 它使用图片编码器将**图像**转换成一种数学表示（一串数字），我们称之为**嵌入 (Embedding)**。
 - 它也使用文本/地理位置编码器将**文本**或**地理位置（经纬度坐标）**转换成类似的数学表示（嵌入）。
   <em>图 2：CLIP 类模型如何连接图像和文本/位置。</em>
 </div>
+另一方面，DINOv2 是一种自监督视觉模型，无需配对的文本数据即可学习丰富的视觉表示。它擅长捕捉视觉模式，可用于图像到图像的相似度搜索。
+我们用到的四个模型的模型结构和训练数据是：
 | 模型 | 编码器类型 | 训练数据来源  |
 | :--- | :--- | :--- |
 | SigLIP | 图像编码器+文本编码器 | 互联网上的自然图像-文本对 |
+| DINOv2 | 仅图像编码器 | 互连网上的自然图像（自监督） |
 | FarSLIP | 图像编码器+文本编码器 | 卫星图像-文本对 |
 | SatCLIP | 图像编码器+位置编码器 | 卫星图像-地理位置对 |
 </div>
 在 EarthExplorer 中：
+1. 我们将全球均匀采样的约25万张卫星图像，分别使用 SigLIP, DINOv2, FarSLIP, 和 SatCLIP 的图像编码器，将卫星图像已经转换成这种数学"嵌入"。
+2. 当你输入一个查询，这个查询可以是文本（例如"a satellite image of glacier"），图像（一张冰川的图像），或地理位置(-89, 120)，我们将你的查询也使用对应的编码器转换成嵌入。
 3. 然后，我们将你的查询嵌入与所有卫星图像的嵌入进行比较，将相似度在地图上可视化，并展示最相似的5张图像。
 ## 局限性
+虽然 EarthExplorer 有很大的应用潜力，但它也有一些局限性。SigLIP 模型主要是通过互联网上的"自然图像"（如人物、猫狗、汽车、日常用品的照片）训练的，而不是专门针对卫星图像训练的。这种训练数据和应用时数据的偏差，使得模型可能难以理解特定的科学术语或在普通网络照片中不常见的独特地理特征。
+而 FarSLIP 模型对非典型遥感地物的语言描述，例如 'an image of face' 的检索效果不佳。
 ## 致谢
 我们感谢以下开源项目和数据集，它们使 EarthExplorer 得以实现：
 - [SigLIP](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) - 用于图像-文本对齐的视觉Transformer模型
 - [FarSLIP](https://github.com/NJU-LHRS/FarSLIP) - 细粒度卫星图像-文本预训练模型
 - [SatCLIP](https://github.com/microsoft/satclip) - 卫星位置-图像预训练模型
+- [DINOv2](https://huggingface.co/facebook/dinov2-large) - 自监督视觉Transformer
 **数据集：**
 - [MajorTOM](https://github.com/ESA-PhiLab/MajorTOM) - 欧洲航天局（ESA）的可扩展地球观测数据集
 - [郑祎杰](https://voyagerxvoyagerx.github.io/)
 - [伍炜杰](https://github.com/go-bananas-wwj)
 - [吴冰玥](https://brynn-wu.github.io/Brynn-Wu)
+- [Mikolaj Czerkawski](https://mikonvergence.github.io/)
+- [Konstantin Klemmer](https://konstantinklemmer.github.io/)
+## 路线图
+- [x] 支持 DINOv2 嵌入模型和嵌入数据集。
+- [ ] 将地理覆盖率（采样率）提高到地球陆地表面的 1.2%。
+- [ ] 支持 FAISS 以实现更快的相似度搜索。
+- [ ] 您想要哪些功能？请在[这里](https://huggingface.co/spaces/ML4Sustain/EarthExplorer/discussions)留言！
+我们热烈欢迎新的贡献者！
 ## 引用
 [1] Francis, A., & Czerkawski, M. (2024). Major TOM: Expandable Datasets for Earth Observation. IGARSS 2024.
 [5] Klemmer, K. et al. (2025). SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery. AAAI 2025.
 [6] Czerkawski, M., Kluczek, M., & Bojanowski, J. S. (2024). Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space. arXiv preprint arXiv:2412.05600.
+[7] Oquab, M., et al. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193.

app.py CHANGED Viewed

@@ -12,6 +12,7 @@ from concurrent.futures import ThreadPoolExecutor, as_completed
 from models.siglip_model import SigLIPModel
 from models.satclip_model import SatCLIPModel
 from models.farslip_model import FarSLIPModel
 from models.load_config import load_and_process_config
 from visualize import format_results_for_gallery, plot_top5_overview, plot_location_distribution, plot_global_map_static, plot_geographic_distribution
 from data_utils import download_and_process_image, get_esri_satellite_image, get_placeholder_image
@@ -29,6 +30,19 @@ config = load_and_process_config()
 print("Initializing models...")
 models = {}
 # SigLIP
 try:
     if config and 'siglip' in config:
@@ -396,10 +410,10 @@ def get_initial_plot():
     # Use FarSLIP as default for initial plot, fallback to SigLIP
     df_vis = None
     img = None
-    if 'FarSLIP' in models and models['FarSLIP'].df_embed is not None:
-        img, df_vis = plot_global_map_static(models['FarSLIP'].df_embed)
         # fig = plot_global_map(models['FarSLIP'].df_embed)
-    elif 'SigLIP' in models and models['SigLIP'].df_embed is not None:
         img, df_vis = plot_global_map_static(models['SigLIP'].df_embed)
     return gr.update(value=img, visible=True), [img], df_vis, gr.update(visible=False)
@@ -519,9 +533,9 @@ def reset_to_global_map():
     """Reset the map to the initial global distribution view"""
     img = None
     df_vis = None
-    if 'FarSLIP' in models and models['FarSLIP'].df_embed is not None:
-        img, df_vis = plot_global_map_static(models['FarSLIP'].df_embed)
-    elif 'SigLIP' in models and models['SigLIP'].df_embed is not None:
         img, df_vis = plot_global_map_static(models['SigLIP'].df_embed)
     return gr.update(value=img, visible=True), [img], df_vis
@@ -609,8 +623,8 @@ with gr.Blocks(title="EarthEmbeddingExplorer") as demo:
         <a href="https://www.modelscope.cn/studios/VoyagerX/EarthExplorer"><img src="https://img.shields.io/badge/Open in ModelScope.cn-xGPU-624aff"></a>
         <a href="https://www.modelscope.ai/studios/VoyagerX/EarthExplorer"><img src="https://img.shields.io/badge/Open in ModelScope.ai-CPU-624aff"></a>
         <a href="https://huggingface.co/spaces/ML4Sustain/EarthExplorer"><img src="https://img.shields.io/badge/Open in HF Space-CPU-FFD21E"></a>
-        <a href="https://modelscope.cn/studios/VoyagerX/EarthExplorer/file/view/master/Tutorial.md?status=1"> <img src="https://img.shields.io/badge/Tutorial-📖-007bff"> </a>
-        <a href="https://www.modelscope.cn/learn/3958"> <img src="https://img.shields.io/badge/中文教程-📖-007bff"> </a>
     </div>
     """)
@@ -637,7 +651,7 @@ with gr.Blocks(title="EarthEmbeddingExplorer") as demo:
                     search_btn = gr.Button("Search by Text", variant="primary")
                 with gr.TabItem("Image Search") as tab_image:
-                    model_selector_img = gr.Dropdown(choices=["SigLIP", "FarSLIP", "SatCLIP"], value="FarSLIP", label="Model")
                     gr.Markdown("### Option 1: Upload or Select Image")
                     image_input = gr.Image(type="pil", label="Upload Image")

 from models.siglip_model import SigLIPModel
 from models.satclip_model import SatCLIPModel
 from models.farslip_model import FarSLIPModel
+from models.dinov2_model import DINOv2Model
 from models.load_config import load_and_process_config
 from visualize import format_results_for_gallery, plot_top5_overview, plot_location_distribution, plot_global_map_static, plot_geographic_distribution
 from data_utils import download_and_process_image, get_esri_satellite_image, get_placeholder_image
 print("Initializing models...")
 models = {}
+# DINOv2
+try:
+    if config and 'dinov2' in config:
+        models['DINOv2'] = DINOv2Model(
+            ckpt_path=config['dinov2'].get('ckpt_path'),
+            embedding_path=config['dinov2'].get('embedding_path'),
+            device=device
+        )
+    else:
+        models['DINOv2'] = DINOv2Model(device=device)
+except Exception as e:
+    print(f"Failed to load DINOv2: {e}")
 # SigLIP
 try:
     if config and 'siglip' in config:
     # Use FarSLIP as default for initial plot, fallback to SigLIP
     df_vis = None
     img = None
+    if 'DINOv2' in models and models['DINOv2'].df_embed is not None:
+        img, df_vis = plot_global_map_static(models['DINOv2'].df_embed)
         # fig = plot_global_map(models['FarSLIP'].df_embed)
+    else:
         img, df_vis = plot_global_map_static(models['SigLIP'].df_embed)
     return gr.update(value=img, visible=True), [img], df_vis, gr.update(visible=False)
     """Reset the map to the initial global distribution view"""
     img = None
     df_vis = None
+    if 'DINOv2' in models and models['DINOv2'].df_embed is not None:
+        img, df_vis = plot_global_map_static(models['DINOv2'].df_embed)
+    else:
         img, df_vis = plot_global_map_static(models['SigLIP'].df_embed)
     return gr.update(value=img, visible=True), [img], df_vis
         <a href="https://www.modelscope.cn/studios/VoyagerX/EarthExplorer"><img src="https://img.shields.io/badge/Open in ModelScope.cn-xGPU-624aff"></a>
         <a href="https://www.modelscope.ai/studios/VoyagerX/EarthExplorer"><img src="https://img.shields.io/badge/Open in ModelScope.ai-CPU-624aff"></a>
         <a href="https://huggingface.co/spaces/ML4Sustain/EarthExplorer"><img src="https://img.shields.io/badge/Open in HF Space-CPU-FFD21E"></a>
+        <a href="https://huggingface.co/spaces/ML4Sustain/EarthExplorer/blob/main/Tutorial.md"> <img src="https://img.shields.io/badge/Tutorial-📖-007bff"> </a>
+        <a href="https://modelscope.cn/studios/VoyagerX/EarthExplorer/file/view/master/Tutorial_zh.md?status=1"> <img src="https://img.shields.io/badge/中文教程-📖-007bff"> </a>
     </div>
     """)
                     search_btn = gr.Button("Search by Text", variant="primary")
                 with gr.TabItem("Image Search") as tab_image:
+                    model_selector_img = gr.Dropdown(choices=["SigLIP", "FarSLIP", "SatCLIP", "DINOv2"], value="FarSLIP", label="Model")
                     gr.Markdown("### Option 1: Upload or Select Image")
                     image_input = gr.Image(type="pil", label="Upload Image")

configs/huggingface.yaml CHANGED Viewed

@@ -2,11 +2,14 @@ siglip:
   ckpt_path: "hf"
   model_name: "ViT-SO400M-14-SigLIP-384"
   tokenizer_path: "hf"
-  embedding_path: "hf://ML4Sustain/EarthEmbeddings/uniform_sample_250k/siglip/SigLIP_grid_sample_center_384x384_243k.parquet"
 farslip:
   ckpt_path: "hf"
   model_name: "ViT-B-16"
-  embedding_path: "hf://ML4Sustain/EarthEmbeddings/uniform_sample_250k/farslip/FarSLIP_grid_sample_center_384x384_243k.parquet"
 satclip:
   ckpt_path: "hf"
-  embedding_path: "hf://ML4Sustain/EarthEmbeddings/uniform_sample_250k/satclip/SatCLIP_grid_sample_center_384x384_243k.parquet"

   ckpt_path: "hf"
   model_name: "ViT-SO400M-14-SigLIP-384"
   tokenizer_path: "hf"
+  embedding_path: "hf://ML4Sustain/EarthEmbeddings/uniform_sample_250k/siglip/SigLIP_grid_sample_center_384x384_244k.parquet"
 farslip:
   ckpt_path: "hf"
   model_name: "ViT-B-16"
+  embedding_path: "hf://ML4Sustain/EarthEmbeddings/uniform_sample_250k/farslip/FarSLIP_grid_sample_center_384x384_244k.parquet"
 satclip:
   ckpt_path: "hf"
+  embedding_path: "hf://ML4Sustain/EarthEmbeddings/uniform_sample_250k/satclip/SatCLIP_grid_sample_center_384x384_244k.parquet"
+dinov2:
+  ckpt_path: "hf"
+  embedding_path_224: "hf://ML4Sustain/EarthEmbeddings/uniform_sample_250k/dinov2/DINOv2_grid_sample_center_224x224_249k_MajorTOM.parquet"

images/samples.png CHANGED Viewed

Git LFS Details

SHA256: 122e7e4c21b01fc14325ce794d5286c0e1abbd6ae3c42cf102907c7e209df65e
Pointer size: 132 Bytes
Size of remote file: 2.78 MB

Git LFS Details

SHA256: f1aa5f91807c95124130f5d37e6e2e9f7095d56ccc1d808c27754bd455983aaf
Pointer size: 132 Bytes
Size of remote file: 5.63 MB

models/dinov2_model.py ADDED Viewed

	@@ -0,0 +1,310 @@

+import torch
+from transformers import AutoImageProcessor, AutoModel
+import numpy as np
+import pandas as pd
+import pyarrow.parquet as pq
+import torch.nn.functional as F
+from PIL import Image
+import os
+class DINOv2Model:
+    """
+    DINOv2 model wrapper for Sentinel-2 RGB data embedding and search.
+    This class provides a unified interface for:
+    - Loading DINOv2 models from local checkpoint or HuggingFace
+    - Encoding images into embeddings
+    - Loading pre-computed embeddings
+    - Searching similar images using cosine similarity
+    The model processes Sentinel-2 RGB data by normalizing it to true-color values
+    and generating feature embeddings using the DINOv2 architecture.
+    """
+    def __init__(self,
+                 ckpt_path="./checkpoints/DINOv2",
+                 model_name="facebook/dinov2-large",
+                 embedding_path="./embedding_datasets/10percent_dinov2_encoded/all_dinov2_embeddings.parquet",
+                 device=None):
+        """
+        Initialize the DINOv2Model.
+        Args:
+            ckpt_path (str): Path to local checkpoint directory or 'hf' for HuggingFace
+            model_name (str): HuggingFace model name (used when ckpt_path='hf')
+            embedding_path (str): Path to pre-computed embeddings parquet file
+            device (str): Device to use ('cuda', 'cpu', or None for auto-detection)
+        """
+        self.device = device if device else ("cuda" if torch.cuda.is_available() else "cpu")
+        self.model_name = model_name
+        self.ckpt_path = ckpt_path
+        self.embedding_path = embedding_path
+        self.model = None
+        self.processor = None
+        self.df_embed = None
+        self.image_embeddings = None
+        # Define the RGB bands for Sentinel-2 (B04, B03, B02)
+        self.bands = ['B04', 'B03', 'B02']
+        self.size = None
+        self.load_model()
+        if self.embedding_path is not None:
+            self.load_embeddings()
+    def load_model(self):
+        """Load DINOv2 model and processor from local checkpoint or HuggingFace."""
+        print(f"Loading DINOv2 model from {self.ckpt_path}...")
+        try:
+            if self.ckpt_path == 'hf':
+                # Load from HuggingFace
+                print(f"Loading from HuggingFace: {self.model_name}")
+                self.processor = AutoImageProcessor.from_pretrained(self.model_name)
+                self.model = AutoModel.from_pretrained(self.model_name)
+            elif self.ckpt_path.startswith('ms'):
+                # Load from ModelScope
+                import modelscope
+                self.processor = modelscope.AutoImageProcessor.from_pretrained(self.model_name)
+                self.model = modelscope.AutoModel.from_pretrained(self.model_name)
+            else:
+                self.processor = AutoImageProcessor.from_pretrained(self.ckpt_path)
+                self.model = AutoModel.from_pretrained(self.ckpt_path)
+            self.model = self.model.to(self.device)
+            self.model.eval()
+            # Extract the input size from the processor settings
+            if hasattr(self.processor, 'crop_size'):
+                self.size = (self.processor.crop_size['height'], self.processor.crop_size['width'])
+            elif hasattr(self.processor, 'size'):
+                if isinstance(self.processor.size, dict):
+                    self.size = (self.processor.size.get('height', 224), self.processor.size.get('width', 224))
+                else:
+                    self.size = (self.processor.size, self.processor.size)
+            else:
+                self.size = (224, 224)
+            print(f"DINOv2 model loaded on {self.device}, input size: {self.size}")
+        except Exception as e:
+            print(f"Error loading DINOv2 model: {e}")
+    def load_embeddings(self):
+        """Load pre-computed embeddings from parquet file."""
+        print(f"Loading DINOv2 embeddings from {self.embedding_path}...")
+        try:
+            if not os.path.exists(self.embedding_path):
+                print(f"Warning: Embedding file not found at {self.embedding_path}")
+                return
+            self.df_embed = pq.read_table(self.embedding_path).to_pandas()
+            # Pre-compute image embeddings tensor
+            image_embeddings_np = np.stack(self.df_embed['embedding'].values)
+            self.image_embeddings = torch.from_numpy(image_embeddings_np).to(self.device).float()
+            self.image_embeddings = F.normalize(self.image_embeddings, dim=-1)
+            print(f"DINOv2 Data loaded: {len(self.df_embed)} records")
+        except Exception as e:
+            print(f"Error loading DINOv2 embeddings: {e}")
+    # def normalize_s2(self, input_data):
+    #     """
+    #     Normalize Sentinel-2 RGB data to true-color values.
+    #     Converts raw Sentinel-2 reflectance values to normalized true-color values
+    #     suitable for the DINOv2 model.
+    #     Args:
+    #         input_data (torch.Tensor or np.ndarray): Raw Sentinel-2 image data
+    #     Returns:
+    #         torch.Tensor or np.ndarray: Normalized true-color image in range [0, 1]
+    #     """
+    #     return (2.5 * (input_data / 1e4)).clip(0, 1)
+    def encode_image(self, image, is_sentinel2=False):
+        """
+        Encode an image into a feature embedding.
+        Args:
+            image (PIL.Image, torch.Tensor, or np.ndarray): Input image
+                - PIL.Image: RGB image
+                - torch.Tensor: Image tensor with shape [C, H, W] (Sentinel-2) or [H, W, C]
+                - np.ndarray: Image array with shape [H, W, C]
+            is_sentinel2 (bool): Whether to apply Sentinel-2 normalization
+        Returns:
+            torch.Tensor: Normalized embedding vector with shape [embedding_dim]
+        """
+        if self.model is None or self.processor is None:
+            print("Model not loaded!")
+            return None
+        try:
+            # Convert to PIL Image if needed
+            if isinstance(image, torch.Tensor):
+                if is_sentinel2:
+                    # Sentinel-2 data: [C, H, W] -> normalize -> PIL
+                    image = self.normalize_s2(image)
+                    # Convert to [H, W, C] and then to numpy
+                    if image.shape[0] == 3:  # [C, H, W]
+                        image = image.permute(1, 2, 0)
+                    image_np = (image.cpu().numpy() * 255).astype(np.uint8)
+                    image = Image.fromarray(image_np, mode='RGB')
+                else:
+                    # Regular RGB tensor: [H, W, C] or [C, H, W]
+                    if image.shape[0] == 3:  # [C, H, W]
+                        image = image.permute(1, 2, 0)
+                    image_np = (image.cpu().numpy() * 255).astype(np.uint8)
+                    image = Image.fromarray(image_np, mode='RGB')
+            elif isinstance(image, np.ndarray):
+                if is_sentinel2:
+                    image = self.normalize_s2(image)
+                # Assume [H, W, C] format
+                if image.max() <= 1.0:
+                    image = (image * 255).astype(np.uint8)
+                else:
+                    image = image.astype(np.uint8)
+                image = Image.fromarray(image, mode='RGB')
+            elif isinstance(image, Image.Image):
+                image = image.convert("RGB")
+            else:
+                raise ValueError(f"Unsupported image type: {type(image)}")
+            # Process image
+            inputs = self.processor(images=image, return_tensors="pt")
+            pixel_values = inputs['pixel_values'].to(self.device)
+            # Generate embeddings
+            with torch.no_grad():
+                if self.device == "cuda":
+                    # with torch.amp.autocast('cuda'):  # disable amp as the official embedding is float32
+                    outputs = self.model(pixel_values)
+                else:
+                    outputs = self.model(pixel_values)
+                # Get embeddings: average across sequence dimension
+                last_hidden_states = outputs.last_hidden_state
+                image_features = last_hidden_states.mean(dim=1)
+                # # Get embeddings: Use pooler_output (1024-d) to match pre-computed embeddings
+                # # If pooler_output is not available, use CLS token (first token)
+                # if hasattr(outputs, 'pooler_output') and outputs.pooler_output is not None:
+                #     image_features = outputs.pooler_output
+                # else:
+                #     # Use CLS token (first token in sequence)
+                #     last_hidden_states = outputs.last_hidden_state
+                #     image_features = last_hidden_states[:, 0, :]  # [batch_size, hidden_dim]
+                # Normalize
+                image_features = F.normalize(image_features, dim=-1)
+            return image_features
+        except Exception as e:
+            print(f"Error encoding image: {e}")
+            import traceback
+            traceback.print_exc()
+            return None
+    def search(self, query_features, top_k=5, top_percent=None, threshold=0.0):
+        """
+        Search for similar images using cosine similarity.
+        Args:
+            query_features (torch.Tensor): Query embedding vector
+            top_k (int): Number of top results to return
+            top_percent (float): If set, use top percentage instead of top_k
+            threshold (float): Minimum similarity threshold
+        Returns:
+            tuple: (similarities, filtered_indices, top_indices)
+                - similarities: Similarity scores for all images
+                - filtered_indices: Indices of images above threshold
+                - top_indices: Indices of top-k results
+        """
+        if self.image_embeddings is None:
+            print("Embeddings not loaded!")
+            return None, None, None
+        try:
+            # Ensure query_features is float32 and on correct device
+            query_features = query_features.float().to(self.device)
+            # Normalize query features
+            query_features = F.normalize(query_features, dim=-1)
+            # Cosine similarity
+            similarity = (self.image_embeddings @ query_features.T).squeeze()
+            similarities = similarity.detach().cpu().numpy()
+            # Handle top_percent
+            if top_percent is not None:
+                k = int(len(similarities) * top_percent)
+                if k < 1:
+                    k = 1
+                threshold = np.partition(similarities, -k)[-k]
+            # Filter by threshold
+            mask = similarities >= threshold
+            filtered_indices = np.where(mask)[0]
+            # Get top k
+            top_indices = np.argsort(similarities)[-top_k:][::-1]
+            return similarities, filtered_indices, top_indices
+        except Exception as e:
+            print(f"Error during search: {e}")
+            return None, None, None
+# Legacy class for backward compatibility
+class DINOv2_S2RGB_Embedder(torch.nn.Module):
+    """
+    Legacy embedding wrapper for DINOv2 and Sentinel-2 data.
+    This class is kept for backward compatibility with existing code.
+    For new projects, please use DINOv2Model instead.
+    """
+    def __init__(self):
+        """Initialize the legacy DINOv2_S2RGB_Embedder."""
+        super().__init__()
+        # Load the DINOv2 processor and model from Hugging Face
+        self.processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
+        self.model = AutoModel.from_pretrained('facebook/dinov2-base')
+        # Define the RGB bands for Sentinel-2 (B04, B03, B02)
+        self.bands = ['B04', 'B03', 'B02']
+        # Extract the input size from the processor settings
+        self.size = self.processor.crop_size['height'], self.processor.crop_size['width']
+    def normalize(self, input):
+        """
+        Normalize Sentinel-2 RGB data to true-color values.
+        Args:
+            input (torch.Tensor): Raw Sentinel-2 image tensor
+        Returns:
+            torch.Tensor: Normalized true-color image
+        """
+        return (2.5 * (input / 1e4)).clip(0, 1)
+    def forward(self, input):
+        """
+        Forward pass through the model to generate embeddings.
+        Args:
+            input (torch.Tensor): Input Sentinel-2 image tensor with shape [C, H, W]
+        Returns:
+            torch.Tensor: Embedding vector with shape [embedding_dim]
+        """
+        model_input = self.processor(self.normalize(input), return_tensors="pt")
+        outputs = self.model(model_input['pixel_values'].to(self.model.device))
+        last_hidden_states = outputs.last_hidden_state
+        return last_hidden_states.mean(dim=1).cpu()

visualize.py CHANGED Viewed

@@ -143,21 +143,6 @@ def plot_geographic_distribution(df, scores, threshold, lat_col='centre_lat', lo
     ax.add_feature(cfeature.LAND, facecolor='lightgray', alpha=0.2)
     ax.add_feature(cfeature.COASTLINE, linewidth=0.5, alpha=0.5)
-    # # 1. Plot Background (All points, sampled) to provide context
-    # if len(df) > 40000:
-    #     df_bg = df.sample(40000)
-    # else:
-    #     df_bg = df
-    # ax.scatter(
-    #     df_bg[lon_col],
-    #     df_bg[lat_col],
-    #     s=1,
-    #     c='lightgrey',
-    #     alpha=0.3,
-    #     transform=ccrs.PlateCarree(),
-    #     label='All Samples',
-    # )
     # 2. Plot Search Results with color map
     label_text = f'Top {threshold * 1000:.0f}‰ Matches'
     sc = ax.scatter(
@@ -165,7 +150,7 @@ def plot_geographic_distribution(df, scores, threshold, lat_col='centre_lat', lo
         df_filtered[lat_col],
         c=df_filtered['score'],
         cmap='Reds',
-        s=0.3,
         alpha=0.8,
         transform=ccrs.PlateCarree(),
         label=label_text,

     ax.add_feature(cfeature.LAND, facecolor='lightgray', alpha=0.2)
     ax.add_feature(cfeature.COASTLINE, linewidth=0.5, alpha=0.5)
     # 2. Plot Search Results with color map
     label_text = f'Top {threshold * 1000:.0f}‰ Matches'
     sc = ax.scatter(
         df_filtered[lat_col],
         c=df_filtered['score'],
         cmap='Reds',
+        s=0.35,
         alpha=0.8,
         transform=ccrs.PlateCarree(),
         label=label_text,