Spaces:

OpenHands
/

openhands-index

Running

openhands-index / about.py

openhands

Add Acknowledgements section crediting AstaBench

737a3f2 18 days ago

9.5 kB

	import gradio as gr


	def build_page():
	with gr.Column(elem_id="about-page-content-wrapper"):
	# --- Section 1: About OpenHands Index ---
	gr.HTML(
	"""
	<h2>About OpenHands Index</h2>
	<p>
	OpenHands Index is a comprehensive leaderboard that tracks the performance of AI coding agents across multiple software engineering benchmarks. It provides a unified view of agent capabilities in areas like code generation, bug fixing, repository-level tasks, and complex reasoning challenges. The index makes it easy to compare agents' performance in an apples-to-apples manner across diverse evaluation scenarios.
	</p>
	"""
	)
	gr.Markdown("---", elem_classes="divider-line")

	# --- Section 2: Why OpenHands Index? ---
	gr.HTML(
	"""
	<h2>Why OpenHands Index?</h2>
	<p>
	Software engineering benchmarks are scattered across different platforms and evaluation frameworks, making it difficult to compare agent performance holistically. Agents may excel at one type of task but struggle with others. Understanding the true capabilities of coding agents requires comprehensive evaluation across multiple dimensions.
	</p>
	<br>
	<p>
	OpenHands Index fills this gap by providing a unified leaderboard aggregating results from diverse software engineering benchmarks. It helps developers and researchers identify which agents best suit their needs, while providing standardized metrics for comparing agent performance across tasks like repository-level editing, multimodal understanding, and commit message generation.
	</p>
	"""
	)
	gr.Markdown("---", elem_classes="divider-line")

	# --- Section 3: What Does OpenHands Index Include? ---
	gr.HTML(
	"""
	<h2>What Does OpenHands Index Include?</h2>
	<p>
	OpenHands Index aggregates results from 6 key benchmarks for evaluating AI coding agents:
	</p>
	<ul class="info-list">
	<li><strong>SWE-bench</strong>: Repository-level bug fixing from real GitHub issues</li>
	<li><strong>Multi-SWE-bench</strong>: Multi-repository software engineering tasks</li>
	<li><strong>SWE-bench Multimodal</strong>: Bug fixing with visual context</li>
	<li><strong>SWT-bench</strong>: Web development and testing tasks</li>
	<li><strong>Commit0</strong>: Commit message generation and code understanding</li>
	<li><strong>GAIA</strong>: General AI assistant tasks requiring reasoning and tool use</li>
	</ul>
	<p>
	Plus: comprehensive leaderboards showing performance across models, agents, and configurations.
	</p>
	<p>
	🔍 Learn more at <a href="https://github.com/OpenHands/OpenHands" target="_blank" class="primary-link-button">github.com/OpenHands/OpenHands</a>
	</p>
	"""
	)
	gr.Markdown("---", elem_classes="divider-line")

	# --- Section 4: Understanding the Leaderboards ---
	gr.HTML(
	"""
	<h2>Understanding the Leaderboards</h2>
	<p>
	The OpenHands Index Overall Leaderboard provides a high-level view of agent performance and efficiency:
	</p>
	<ul class="info-list">
	<li><strong>Overall score</strong>: A macro-average across all benchmarks (equal weighting)</li>
	<li><strong>Overall cost</strong>: Average cost per task in USD, aggregated across benchmarks with reported cost</li>
	</ul>
	<p>
	Individual benchmark pages provide:
	</p>
	<ul class="info-list">
	<li>Detailed scores and metrics for that specific benchmark</li>
	<li>Cost breakdowns per agent</li>
	<li>Links to submission details and logs</li>
	</ul>
	"""
	)
	gr.Markdown("---", elem_classes="divider-line")

	# --- Section 5: Scoring & Aggregation ---
	gr.HTML(
	"""
	<h2>Scoring & Aggregation</h2>
	<p>
	OpenHands Index provides transparent, standardized evaluation metrics:
	</p>

	<h3>Scores</h3>
	<ul class="info-list">
	<li>Each benchmark returns an average score based on per-task performance</li>
	<li>All scores are aggregated using macro-averaging (equal weight per benchmark)</li>
	<li>Metrics vary by benchmark (e.g., resolve rate, pass@1, accuracy)</li>
	</ul>

	<h3>Cost</h3>
	<ul class="info-list">
	<li>Costs are reported in USD per task</li>
	<li>Benchmarks without cost data are excluded from cost averages</li>
	<li>In scatter plots, agents without cost data are clearly marked</li>
	</ul>

	<p>
	<em>Note: Cost values reflect API pricing at evaluation time and may vary based on provider, infrastructure, and usage patterns.</em>
	</p>
	"""
	)
	gr.Markdown("---", elem_classes="divider-line")

	# --- Section 6: Submitting Results & Accessing Raw Data ---
	gr.HTML(
	"""
	<h2>Submitting Results & Accessing Raw Data</h2>

	<h3>How to Submit Your Agent Results</h3>
	<p>
	To submit your agent's evaluation results to the OpenHands Index:
	</p>
	<ol class="info-list">
	<li>Run your agent on the supported benchmarks (SWE-bench, Multi-SWE-bench, SWE-bench Multimodal, SWT-bench, Commit0, GAIA)</li>
	<li>Format your results according to the data structure documented in the repository</li>
	<li>Submit a pull request to <a href="https://github.com/OpenHands/openhands-index-results" target="_blank" class="primary-link-button">github.com/OpenHands/openhands-index-results</a></li>
	<li>Your submission should include:
	<ul>
	<li><code>metadata.json</code> with agent information, model used, and evaluation details</li>
	<li><code>scores.json</code> with benchmark results and scores</li>
	</ul>
	</li>
	</ol>

	<h3>Accessing Raw Results</h3>
	<p>
	All raw evaluation results displayed on this leaderboard are publicly available at:
	</p>
	<p>
	📊 <a href="https://github.com/OpenHands/openhands-index-results" target="_blank" class="primary-link-button">github.com/OpenHands/openhands-index-results</a>
	</p>
	<p>
	The repository contains:
	</p>
	<ul class="info-list">
	<li>Complete metadata for each agent submission</li>
	<li>Detailed benchmark scores and metrics</li>
	<li>Evaluation dates and configurations</li>
	<li>Model and cost information</li>
	</ul>
	<p>
	You can clone the repository, analyze the data, or use it for your own research and comparisons.
	</p>
	"""
	)
	gr.Markdown("---", elem_classes="divider-line")

	# --- Section 7: Acknowledgements ---
	gr.HTML(
	"""
	<h2>Acknowledgements</h2>
	<p>
	The OpenHands Index leaderboard interface and visualization components are adapted from the
	<a href="https://huggingface.co/spaces/allenai/asta-bench-leaderboard" target="_blank" class="primary-link-button">AstaBench Leaderboard</a>
	developed by the Allen Institute for AI. We thank the AstaBench team for their excellent work in creating
	a clear and effective leaderboard design that we have customized for the software engineering domain.
	</p>
	<p>
	Key aspects adapted from AstaBench include:
	</p>
	<ul class="info-list">
	<li>Macro-averaging methodology for computing overall scores from category-level averages</li>
	<li>Interactive data visualization and filtering components</li>
	<li>Leaderboard UI structure and styling</li>
	</ul>
	<p>
	We have extended and modified this foundation to support software engineering benchmarks and the
	specific requirements of evaluating AI coding agents.
	</p>
	"""
	)
	gr.Markdown("---", elem_classes="divider-line")

	# --- Section 8: Citation ---
	gr.HTML(
	"""
	<h2>Citation</h2>
	<p>
	If you use OpenHands or reference the OpenHands Index in your work, please cite:
	</p>
	<pre class="citation-block">
	@misc{openhands2024,
	title={OpenHands: An Open Platform for AI Software Developers as Generalist Agents},
	author={OpenHands Team},
	year={2024},
	howpublished={https://github.com/OpenHands/OpenHands}
	}</pre>
	"""
	)