Next Article in Journal
Submarine Terrain Generalization in Nautical Charts: A Survey of Traditional Methods and Graph Neural Network Solutions
Previous Article in Journal
Capturing Built Environment and Automated External Defibrillator Resource Interplay in Tianjin Downtown
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AutoGEEval: A Multimodal and Automated Evaluation Framework for Geospatial Code Generation on GEE with Large Language Models

1
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China
2
Collaborative Innovation Center of Geospatial Technology, Wuhan University, Wuhan 430079, China
3
School of Resource and Environmental Sciences, Wuhan University, Wuhan 430079, China
4
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
*
Author to whom correspondence should be addressed.
ISPRS Int. J. Geo-Inf. 2025, 14(7), 256; https://doi.org/10.3390/ijgi14070256
Submission received: 15 May 2025 / Revised: 27 June 2025 / Accepted: 29 June 2025 / Published: 30 June 2025

Abstract

Geospatial code generation is emerging as a key direction in the integration of artificial intelligence and geoscientific analysis. However, there remains a lack of standardized tools for automatic evaluation in this domain. To address this gap, we propose AutoGEEval, the first multimodal, unit-level automated evaluation framework for geospatial code generation tasks on the Google Earth Engine (GEE) platform powered by large language models (LLMs). Built upon the GEE Python API, AutoGEEval establishes a benchmark suite (AutoGEEval-Bench) comprising 1325 test cases that span 26 GEE data types. The framework integrates both question generation and answer verification components to enable an end-to-end automated evaluation pipeline—from function invocation to execution validation. AutoGEEval supports multidimensional quantitative analysis of model outputs in terms of accuracy, resource consumption, execution efficiency, and error types. We evaluate 18 state-of-the-art LLMs—including general-purpose, reasoning-augmented, code-centric, and geoscience-specialized models—revealing their performance characteristics and potential optimization pathways in GEE code generation. This work provides a unified protocol and foundational resource for the development and assessment of geospatial code generation models, advancing the frontier of automated natural language to domain-specific code translation.

1. Introduction

General-purpose code refers to a set of program instructions written in languages like Python, C++, or Java, applied to tasks such as data processing, network communication, and algorithm implementation [1,2]. Programming allows users to translate logical intentions into executable tasks on computers [3]. With the rise of Transformer-based large language models (LLMs), models like GPT-4o, DeepSeek, Claude, and LLaMA have shown exceptional performance in code generation, benefiting from exposure to large-scale training data and advanced contextual understanding and generative capabilities [4]. These models enable users to generate code from natural language instructions, lowering the programming barrier [5]. Building on this foundation, domain-specific code generation models—such as DeepSeek Coder [6], Qwen2.5-Coder [7], and Code LLaMA [8]—have further improved accuracy and robustness through targeted training. However, model-generated code often faces issues like syntax errors, incorrect function calls, and missing dependencies, which impact its executability and logical consistency. This issue, known as “code hallucination,” remains a challenge [9]. To quantify model performance and guide iterative improvement, researchers have developed benchmark suites such as HumanEval [10], MBPP [11], and LiveCodeBench [12], which enable automated evaluation based on execution success rates and related metrics [13,14,15].
With the rapid expansion of high-resolution remote sensing and spatiotemporal data, the demand for customized geospatial analysis tools in the geosciences has significantly increased. In response to this demand, cloud platforms such as Google Earth Engine (GEE) have emerged, offering geoscientific functionalities through JavaScript and Python interfaces [16,17]. Unlike traditional GIS tools that rely on graphical user interfaces, GEE utilizes concise coding to automate complex workflows, supporting tasks such as remote sensing preprocessing, index computation, and time-series change detection [18]. Its “copy-paste-run” sharing mechanism has facilitated the widespread adoption of geospatial analytical methods, signifying the specialization of general-purpose code in the geosciences and gradually shaping the modern paradigm of “geospatial code as analysis [19,20].”
However, writing code on the GEE platform requires not only basic programming skills but also solid knowledge of geospatial analysis. This includes familiarity with core objects and operators (such as ‘ee.Image’ and ‘ee.FeatureCollection’), remote sensing datasets (e.g., Landsat, MODIS), spatial information concepts (e.g., coordinate systems and geographic projections), and methods for processing and integrating multisource data. As a result, the learning curve for GEE programming is significantly steeper than for general-purpose coding, and users without a geospatial background often encounter substantial barriers in practice [21,22]. In this context, leveraging LLMs to generate GEE code has emerged as a promising approach to lowering the entry barrier and enhancing development efficiency [23]. It was not until October 2024, with the publication of two systematic evaluation papers, that “geospatial code generation” was formally proposed as an independent research task. These studies extended the general NL2Code paradigm into natural language to geospatial code (NL2GeospatialCode), providing a theoretical foundation for the field [24,25]. Since then, research in this area has advanced rapidly, with several optimization strategies emerging: the CoP strategy uses prompt chaining to guide task decomposition and generation [26]. Geo-FuB [27] and GEE-OPs [22] build functional semantic and function invocation knowledge bases, respectively, enhancing accuracy through retrieval-augmented generation (RAG); GeoCode-GPT, a Codellama variant fine-tuned on geoscientific code corpora, became the first LLM dedicated to this task [21]. However, due to limited training resources, geospatial code accounts for only a small fraction of pretraining data. As a result, models are more prone to “code hallucination” in geospatial code generation tasks than in general domains [24,25]. Typical issues include function invocation errors, object type confusion, missing filter conditions, loop structure errors, semantic band mapping errors, type mismatch errors, invalid type conversions, and missing required parameters. Figure 1 illustrates some of these error types, with additional error types provided in Appendix A. These issues severely compromise code executability and the reliability of analytical results. Therefore, establishing a systematic evaluation framework for geospatial code generation is essential. It not only helps to clarify the performance boundaries of current models in geospatial tasks but also provides theoretical and practical support for developing future high-performance, low-barrier geospatial code generation models [26].
At present, a few studies have begun exploring evaluation mechanisms for geospatial code generation tasks. Notable efforts include GeoCode-Bench [24] and GeoCode-Eval [21] proposed by Wuhan University, and the GeoSpatial-Code-LLMs dataset developed by Wrocław University of Science and Technology [25]. GeoCode-Bench primarily relies on multiple-choice, true/false, and open-ended questions, focusing mainly on the understanding of textual knowledge required for code construction. Code-related tasks rely on expert manual scoring, which increases evaluation costs, introduces subjectivity, and limits reproducibility. Similarly, GeoCode-Eval depends on human evaluation and emphasizes complex test cases, but lacks systematic testing of basic function and common logical combinations, which hinders a fine-grained analysis of model capabilities. The GeoSpatial-Code-LLMs dataset attempts to introduce automated evaluation mechanisms, but currently supports only limited data types, excluding multimodal data such as imagery, vector, and raster formats, and contains only about 40 samples. There is an urgent need to develop an end-to-end, reproducible, and unit-level evaluation benchmark that supports automated assessment and encompasses diverse multimodal geospatial data types.
In response to the aforementioned needs and challenges, this study proposes AutoGEEval, an automated evaluation framework for GEE geospatial code generation tasks based on LLMs, as shown in Figure 2. The framework comprises three key components: the AutoGEEval-Bench test suite (Figure 2a, Section 3), the Submission Program (Figure 2b, Section 4.1), and the Judge Program (Figure 2c, Section 4.2). It supports multimodal data types and unit-level assessment, implemented via GEE’s Python API (earthengine-api, version 1.3.1). The Python interface, running locally in environments like Jupyter Notebook (version 6.5.4) and PyCharm (version 2023.1.2), removes reliance on the GEE web editor, aligns better with real-world practices, and is more widely adopted than the JavaScript version. It also enables automated error detection and feedback through capturing console outputs and runtime exceptions, facilitating a complete evaluation workflow. In contrast, the JavaScript interface is limited by GEE’s online platform, restricting automation.
The main contributions of this study are summarized as follows:
  • We design, implement, and open-source AutoGEEval, the first automated evaluation framework for geospatial code generation on GEE using LLMs. The framework supports end-to-end automation of test execution, result verification, and error type analysis across multimodal data types at the unit level.
  • We construct and release AutoGEEval-Bench, a geospatial code benchmark comprising 1325 unit-level test cases spanning 26 distinct GEE data types.
  • We conduct a comprehensive evaluation of 18 representative LLMs across four categories—including GPT-4o, DeepSeek-R1, Qwen2.5-Coder, and GeoCode-GPT—by measuring execution pass rates for geospatial code generation tasks. In addition, we analyze model accuracy, resource consumption, execution efficiency, and error type distributions, providing insights into current limitations and future optimization directions.
The remainder of this paper is structured as follows: Section 2 describes the construction of the AutoGEEval-Bench test suite. Section 3 outlines the AutoGEEval evaluation framework, detailing the design and implementation of the Submission and Judge Programs. Section 4 presents a systematic evaluation analysis, followed by Section 5, which discusses the experimental findings on geospatial code generation by large language models. Section 6 concludes by summarizing the contributions, identifying current limitations, and proposing directions for future research.

2. AutoGEEval-Bench

The AutoGEEval-Bench (Figure 2a) is built using the official GEE function documentation, containing 1325 unit test cases automatically generated via the LLM-based Self-Design framework and covering 26 GEE data types. This chapter will detail the test case definition, design approach, construction method, and final results.

2.1. Task Definition

Unit-level testing ( T u n i t ) is designed to evaluate a model’s ability to understand the invocation semantics, parameter structure, and input–output specifications of each API function provided by the platform. The goal is to assess whether the model can generate a syntactically correct and semantically valid function call based on structured function information, such that the code executes successfully and produces the expected result. This task simulates one of the most common workflows for developers—“consulting documentation and writing function calls”—and serves as a capability check at the finest behavioral granularity. Each test case corresponds to a single, independent API function and requires the model to generate executable code that correctly invokes the function with appropriate inputs and yields the expected output.
Let F denote the set of functions provided in the public documentation of the Earth Engine platform.
F = { f 1 , f 2 , , f N } ,   f i G E E _ A P I
The task of each model under evaluation is to generate a syntactically correct and executable code snippet C i within the Earth Engine python environment.
T u n i t : f i C i
Define a code executor, where y i denotes the result object returned after executing the code snippet C i .
E x e c C i = y i
Let A i denote the expected output (ground-truth answer). The evaluation metric is defined based on the comparison between y i and A i , where the symbol “=” may represent strict equality, approximate equality for floating-point values, set containment, or other forms of semantic equivalence.
u n i t y i , A i = 0 , i f   E x e c ( C i ) = A i 1 , o t h e r w i s e

2.2. Structural Design

All test cases are generated by the flagship LLM Qwen2.5-Max, developed by Alibaba, using predefined prompts and reference data, and subsequently verified by human experts (see Section 2.3 for details). Each complete test case consists of six components: the function header, reference code snippet (Reference_code), parameter list (Parameters_list), output type (Output_type), output path (Output_path), and the expected answer (Expected_answer). Let the set of unit test cases be denoted as
Q = { q 1 , q 2 , , q n }
Each test case q i is defined as a six-tuple:
q i = H i , R i , P i , T i , O i , A i
The meaning of each component is defined as follows:
  • H i FunctionHeader : Function declaration, including the ‘def’ statement, function name, parameter list, and a natural language description of the function’s purpose. It serves as the semantic prompt to guide the language model in generating the complete function body.
  • R i ReferenceCode : Reference code snippet, representing the intended logic of the function. It is generated by Qwen2.5-Max based on a predefined prompt and is executed by human experts to obtain the standard answer. During the testing phase, this component is completely hidden from the model, which must independently complete a functionally equivalent implementation based solely on H i .
  • P i ParameterList : Parameter list, specifying the concrete values to be injected into the function during testing, thereby constructing a runnable execution environment.
  • T i OutputType : Output type, indicating the expected data type returned by the function, used to enforce format constraints on the model’s output. Examples include numeric values, Boolean values, dictionaries, or layer objects.
  • O i OutputPath : Output path, specifying where the execution result of the generated code will be stored. The testing system retrieves the model’s output from this path.
  • A i ExpectedAnswer : Expected answer, the correct output obtained by executing the reference code with the given parameters. It serves as the ground-truth reference for evaluating the accuracy of the model’s output.

2.3. Construction Methodology

The unit test cases are constructed based on the official GEE reference documentation, specifically the Client Libraries section, available at https://developers.google.com/earth-engine/apidocs (accessed on 29 June 2025), which includes a total of 1374 functions. Each function page provides the full function name, a description of its functionality, usage examples, return type, parameter names and types, and parameter descriptions. Some pages include sample code demonstrating function usage, while others do not. Prior to constructing the test cases, we manually executed all functions to validate their operability. This process revealed that 49 functions were deprecated or non-functional due to version updates, and were thus excluded. The final set of valid functions incorporated into the unit test suite includes 1325 functions. We extracted relevant information from each function page and organized it into a JSON structure. A corresponding prompt template was then designed (see Figure 3) to guide the LLM in parsing the structured documentation and automatically generating unit-level test items.
After initial generation, all test cases were manually verified by a panel of five experts with extensive experience in GEE usage and geospatial code development. The verification process ensured that each test task reflects a valid geospatial analysis need, has a clear and accurate problem definition, and is configured with appropriate test inputs. Any test case exhibiting execution errors or incomplete logic was revised and corrected by the experts based on domain knowledge. For test cases that execute successfully and produce the expected results, the output is stored at the specified ‘output_path’ and serves as the ground-truth answer for that item. During the testing phase, the Judge Program retrieves the reference result from this path and compares it against the model-generated output to compute consistency-based accuracy metrics.

2.4. Construction Results

The distribution and proportion of each data type in AutoGEEval-Bench are detailed in Table 1.
The 26 GEE data types covered in AutoGEEval-Bench can be broadly categorized into two groups. The first group consists of text-based formats, such as dictionaries, arrays, lists, strings, and floating-point numbers. The second group includes topology-based formats, such as geometries, imagery, and GeoJSON structures. Representative unit test cases from AutoGEEval-Bench are presented in this paper. Figure 4 showcases a typical test case involving text-based GEE data types using ‘ee.Array’, while Figure 5 illustrates a task related to topology-based data types using ‘ee.Image’. Additional test cases are provided in Appendix B.

3. Submission and Judge Programs

The AutoGEEval framework relies on two main components during evaluation: the Submission Program, which generates and executes code based on tasks in AutoGEEval-Bench, and the Judge Program, which compares the model’s output to the correct answers. This chapter outlines the workflow design of both programs.

3.1. Submission Program

The overall workflow of the Submission Program is illustrated in Figure 2b and consists of three main tasks: answer generation, execution, and result saving. In the answer generation stage, the system utilizes a prompt template to guide the target LLM to respond to each item in AutoGEEval-Bench sequentially. The model generates code based solely on the function header, from which it constructs the corresponding function body. During the execution stage, the execution module reads the parameter list and substitutes the specified values into the formal parameters of the generated code. The code is then executed within the Earth Engine environment. Finally, the execution result is saved to the specified location and file name, as defined by the output path. It is important to note that the prompt is carefully designed to instruct the model to output only the final answer, avoiding any extraneous or irrelevant content. The detailed prompt design is shown in Figure 6.

3.2. Judge Program

The overall workflow of the Judge Program is illustrated in Figure 2c. Its primary function is to read the execution results from the specified ‘Output_path’, select the appropriate evaluation logic based on the declared ‘Output_type’, and compare the model’s output against the ‘Expected_answer’. The core challenge of the Judge Program lies in accurately assessing correctness across different output data types. As shown in Table 1, AutoGEEval-Bench defines 26 categories of GEE data types. However, many of these types share overlapping numerical representations. For example, although ‘ee.Array’, ‘ee.ConfusionMatrix’, and ‘ee.ArrayImage’ are different in type, they are all expressed as arrays in output. Similarly, ‘ee.Dictionary’, ‘ee.Blob’, and ‘ee.Reducer’ are represented as dictionary-like structures at runtime. Furthermore, ‘ee.Geometry’, ‘ee.Feature’, and ‘ee.FeatureCollection’ all serialize to the GeoJSON format, while both ‘ee.String’ and ‘ee.Boolean’ are represented as strings. Given these overlaps, the Judge Program performs unified categorization based on the actual value representation—such as arrays, dictionaries, GeoJSON, or strings—and applies corresponding matching strategies to ensure accurate and fair evaluation across diverse GEE data types. AutoGEEval summarizes the value representations and matching strategies for each GEE data type in Table 2.

4. Experiments

The framework also supports the automated monitoring of resource usage and execution efficiency. In the experimental evaluation, we assessed various models, including general-purpose, reasoning-enhanced, code generation, and geospatial-specific models. This chapter covers the model selection, experimental setup, evaluation metrics, and runtime cost considerations.

4.1. Evaluated Models

The models evaluated in this study are selected from among the most advanced and widely adopted LLMs as of April 2025. All selected models have either undergone peer review or have been publicly released through open-source or open-access channels. The aim is to align with the growing user preference for end-to-end, easy-to-use models and to provide informative references for both practical application and academic research. It is important to note that optimization strategies such as prompt engineering, RAG, and agent-based orchestration are not included in this evaluation. These strategies do not alter the core model architecture, and their effectiveness is highly dependent on specific design choices, often resulting in unstable performance. Moreover, they are typically tailored for specific downstream tasks and were not originally intended for unit-level testing, making their inclusion in this benchmark neither targeted nor meaningful. Additionally, such strategies often involve complex prompts that consume a large number of tokens, thereby compromising the fairness and efficiency of the evaluation process.
The evaluated models span four categories: (1) general-purpose non-reasoning LLMs, (2) general-purpose reasoning-enhanced LLMs, (3) general-domain code generation models, and (4) task-specific code generation models tailored for geospatial applications. For some models, multiple publicly available parameter configurations are evaluated. Counting different parameter versions as independent models, a total of 18 models are assessed. Detailed specifications of the evaluated models are provided in Table 3.

4.2. Experimental Setup

In terms of hardware configuration and parameter settings, a local computing device equipped with 32 GB RAM and an RTX 4090 GPU was used. During model inference, open-source models with parameter sizes not exceeding 16 B were deployed locally using the Ollama tool; for larger open-source models and proprietary models, inference was conducted via their official API interfaces to access cloud-hosted versions.
For parameter settings, the generation temperature was set to 0.2 for non-reasoning models to enhance the determinism and stability of outputs. For reasoning-enhanced models, following existing research practices, no temperature was specified, preserving the models’ native inference capabilities. In addition, the maximum output token length for all models was uniformly set to 4096 to ensure complete responses and prevent truncation due to excessive length.
Time consumption and task descriptions for each phase are provided in Table 4.

4.3. Evaluation Metrics

This study evaluates the performance of LLMs in geospatial code generation tasks across four dimensions: accuracy metrics, image metrics, resource consumption metrics, operational efficiency metrics, rank and error type logs.

4.3.1. Accuracy Metrics

This study adopts pass@n as the primary accuracy metric [38]. It measures the probability that a correct answer is generated at least once within n independent attempts for the same test case. This is a widely used standard for evaluating both the correctness and stability of model outputs. Given the known hallucination issue in LLMs—where inconsistent or unreliable results may be produced for identical inputs—a single generation may not be representative. Therefore, we evaluate the models under three configurations, n = 1, 3, 5, to enhance the robustness and credibility of the assessment.
p a s s @ n = 1 C n N
where N is the total number of generated samples and C n is the number of incorrect samples among them.
In addition, we introduce the coefficient of variation (CV) to assess the stability of the pass@1, pass@3, and pass@5 scores. This metric helps to evaluate the variability in model performance across multiple generations, serving as an indirect indicator of the severity of hallucination.
C V = σ μ
where σ is the standard deviation and μ is the mean. A smaller C V indicates higher stability in model performance.
To more comprehensively evaluate model behavior, we further introduce the stability-adjusted accuracy (SA), which integrates both accuracy and stability into a single metric. Specifically, a higher pass@5 score (accuracy) and a lower CV score (stability) result in a higher SA score. The calculation is defined as
SA = p a s s @ 5 1 + C V

4.3.2. Image Metrics

Specifically, ee.Image, ee.Geometry, ee.FeatureCollection and ee.ImageCollection are object types in Google Earth Engine (GEE) that represent raster data structured on a per-pixel basis. For these data types, the outputs of model-generated code are evaluated against ground-truth reference images using pixel-wise comparison. This study employs three quantitative metrics—mean absolute error (MAE), peak signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM)—to assess image similarity.
MAE [39] measures the average absolute difference between corresponding pixel values in the generated and reference images, providing a direct indication of pixel-level reconstruction accuracy.
M A E = 1 N i = 1 N I p r e d i I t r u e i
Here, I p r e d i and I t r u e i denote the pixel values at location i in the predicted image and the ground-truth reference image, respectively, and N represents the total number of pixels in the image.
PSNR [40] quantifies the level of distortion between two images and is commonly used to assess the quality of image compression or reconstruction. A higher PSNR value indicates that the predicted image is closer in quality to the original reference. The PSNR is computed as follows:
P S N R = 10 log 10 R 2 M S E
Here, R denotes the maximum possible pixel value in the image, typically 255 for 8-bit images. M S E refers to the mean squared error, defined as follows:
M S E = 1 N i = 1 N ( I p r e d ( i ) I t r u e ( i ) ) 2
SSIM [41] evaluates image quality by comparing luminance, contrast, and structural information between images, providing a perceptual similarity measure that aligns more closely with human visual perception. The SSIM is computed as follows:
S S I M x , y = 2 μ x μ y + C 1 2 σ x y + C 2 μ x 2 + μ y 2 + C 1 σ x 2 + σ y 2 + C 2
Here, μ x and μ y denote the mean pixel values of images x and y , respectively; σ x 2 and σ y 2 represent the variances; σ x y is the covariance between x and y . Constants C 1 and C 2 are used to stabilize the division and prevent numerical instability.
A test case is considered passed only if all three criteria are simultaneously satisfied: MAE ≤ 0.01, PSNR ≥ 50, and SSIM ≥ 0.99.

4.3.3. Resource Consumption Metrics

Resource consumption metrics measure the computational resources and cost required for a model to complete the testing tasks. This study considers three key metrics:
  • Token Consumption (Tok.): Refers to the average number of tokens required to complete each unit test case. For locally deployed models, this metric reflects hardware resource usage; for commercial models, token consumption directly correlates with monetary cost. Most mainstream APIs charge based on the number of tokens processed (typically per 1 million tokens), and pricing varies significantly across models. As of April 2025, GPT-4 Turbo is priced at USD 10.00/1M tokens, Claude 3 Opus at USD 15.00/1M tokens, DeepSeek-Coder at USD 0.60/1M tokens, and Qwen2-72B at USD 0.80/1M tokens. Therefore, token usage is a critical indicator of both inference cost and model accessibility.
  • Inference Time (In.T): Refers to the average response time (in seconds) required by the model to generate each test case. This metric reflects latency and response efficiency, both of which directly impact user experience.
  • Code Lines (Co.L): Measures the number of core executable lines of code generated by the model, excluding comments, natural language explanations, and auxiliary prompts. Compared to token count, code line count provides a more accurate assessment of the model’s actual code generation capability, filtering out token inflation caused by unnecessary text in the reasoning process.

4.3.4. Operational Efficiency Metrics

Operational efficiency metrics are used to assess a model’s accuracy per unit of resource consumption, thereby reflecting its cost-effectiveness. This study defines inference efficiency, token efficiency, and code line efficiency based on three resource dimensions: time, token usage, and code structure. It is important to note that, to ensure comparability and fairness across models in terms of generation attempts and to reduce the variance caused by random sampling, all resource consumption metrics reported in this study are averaged over five generations. Therefore, pass@5 is uniformly adopted as the reference accuracy metric in all efficiency calculations.
  • Inference Efficiency (In.T-E): Inference efficiency refers to the average accuracy achieved by a model per unit time, calculated as the ratio of accuracy to average inference time (in seconds). This metric evaluates the model’s ability to balance response speed and output quality. The shorter the inference time, the higher the accuracy achieved per unit time, indicating a more efficient utilization of computational resources and better interactive performance.
Inference Efficiency = p a s s @ 5 Inference Time
  • Token Efficiency (Tok.-E): Token efficiency measures the accuracy achieved per unit of token consumption, calculated as the ratio of accuracy to the average number of tokens used. This metric reflects the economic efficiency of the generation process and effectively supports cross-model comparisons in terms of cost–performance.
Token Efficiency = p a s s @ 5 Token Consumption
  • Code Line Efficiency (Co.L-E): Code line efficiency refers to the accuracy achieved per line of core executable code, emphasizing the structural compactness and effectiveness of the generated logic. Unlike tokens, code lines exclude natural language explanations and prompt-related content, offering a more direct reflection of the model’s ability to produce high-quality, executable code for geospatial tasks. This metric is of particular value to developers, especially when evaluating code generation efficiency in practical engineering deployments.
Code Line Efficiency = p a s s @ 5 C o d e   L i n e s

4.3.5. Rank

To facilitate systematic evaluation across multiple dimensions, we introduce a unified ranking scheme for all key performance indicators and apply it consistently in the subsequent result analysis.
For accuracy metrics, we adopt the pass@n metric (n = 1, 3, 5) as the primary indicator of model accuracy. The corresponding ranking is denoted as P_Rank, with higher scores indicating better performance and thus a higher rank. We further compute the coefficient of variation (CV) across pass@1, pass@3, and pass@5 to capture the stability of accuracy. The ranking based on CV is denoted as C_Rank, where lower CV values correspond to higher rankings. For the stability-adjusted accuracy metric (SA), the ranking is noted as S_Rank, and a higher SA value leads to a higher rank. These three rankings are utilized in Table 6 (Section 5.1) and Table 9 (Section 5.3) to support comparative analysis.
For operational efficiency metrics, we independently rank three dimensions: token efficiency (Tok.-E), inference time efficiency (In.T-E), and code line efficiency (Co.L-E), denoted as T_Rank, I_Rank, and Co_Rank, respectively. The overall efficiency rank, E_Rank, is derived by computing the average of these three ranks and ranking the result. These efficiency-related rankings are also presented in Table 9 (Section 5.3).
Finally, to evaluate overall model performance, we define a composite ranking metric, Total_Rank, which integrates accuracy, efficiency, and stability. It is calculated by averaging the rankings of P_Rank, S_Rank, and E_Rank, followed by ranking the averaged score. This comprehensive ranking is used in Table 9 (Section 5.3) to compare models across all performance dimensions.

4.3.6. Error Type

To support qualitative analysis of model performance, AutoGEEval incorporates an automated error detection mechanism based on GEE runtime errors, designed to record the types of errors present in generated code. These errors are as follows:
  • Syntax Errors: These refer to issues in the syntactic structure of the code that prevent successful compilation, such as missing parentheses, misspellings, or missing module imports. Such errors are typically flagged in the GEE console as ‘SyntaxError’.
  • Parameter Errors: These occur when the code is syntactically correct but fails to execute due to incorrect or missing parameters. Parameters often involve references to built-in datasets, band names, or other domain-specific knowledge in geosciences. Common error messages include phrases like “xxx has no attribute xx”, “xxx not found”, or prompts indicating missing required arguments. These errors often arise during parameter concatenation or variable assignment.
  • Invalid Answers: These refer to cases where the code executes successfully, but the output is inconsistent with the expected answer or the returned data type does not match the predefined specification.
  • Runtime Errors: Timeouts often result from infinite loops or large datasets, causing the computation to exceed 180 s and be terminated by the testing framework. These errors are usually due to logical flaws, such as incorrect conditionals or abnormal loops. On the GEE platform, they are displayed as “timeout 180 s”.
  • Network Errors: These occur when the GEE system returns an Internal Server Error, persisting after three retries under stable network conditions. Such errors are caused by Google server rate limits or backend timeouts, not by model code syntax or logic. On the GEE platform, these are displayed as HTTP 500 errors, while client errors are shown as HTTP 400 codes.

5. Results

Building on the evaluation metrics outlined in Section 5.3, this chapter presents a systematic analysis of the evaluation results based on the AutoGEEval framework and the AutoGEEval-Bench test suite. The analysis focuses on four key dimensions: accuracy metrics, resource consumption metrics, operational efficiency metrics, and error type logs.

5.1. Accuracy

The evaluation results for accuracy-related metrics across all models are presented in Table 5.
The stacked bar chart of execution accuracy across models is shown in Figure 7. As observed, increasing the number of generation attempts generally improves accuracy, indicating that multiple generations can partially mitigate hallucination in model-generated code. However, a visual analysis reveals that although both pass@3 and pass@5 increase the number of generations by two rounds compared to the previous level, the green segment (representing the improvement from pass@3 to pass@5) is noticeably shorter than the orange segment (representing the improvement from pass@1 to pass@3). This suggests a significant diminishing return in accuracy gains with additional generations. Quantitative analysis results are presented in Figure 8. The average improvement in pass@3 is 12.88%, ranging from 4.38% to 21.37%. In contrast, the average improvement in pass@5 is only 3.81%, with a range of 1.24% to 6.93%. This pattern highlights a clear diminishing marginal effect in improving accuracy through additional generations. It suggests that while early rounds of generation can substantially correct errors and enhance accuracy, the potential for improvement gradually tapers off in later rounds, reducing the value of further sampling. Therefore, future research should focus on enhancing performance during the initial generation rounds, rather than relying on incremental gains from additional sampling, in order to improve generation efficiency and accuracy more effectively.
The bubble chart displaying the pass@n scores and relative rankings of all models is shown in Figure 9. Several key observations can be made:
  • Model performance ranking: The dark blue bubbles, representing general-purpose non-reasoning models, generally occupy higher ranks, outperforming the red general-purpose reasoning models and pink general-purpose code generation models. The light blue bubble representing the geospatial code generation model GeoCode-GPT is positioned in the upper-middle tier, with an average rank of 7.33 among the 18 evaluated models.
  • Performance variation within the DeepSeek family: DeepSeek-V3 (average rank 1.33), DeepSeek-V3-0324 (average rank 3.67), and DeepSeek-R1 (average rank 4.00) all rank among the top-performing models, demonstrating strong performance. However, DeepSeek-Coder-V2 performs poorly, ranking last (average rank 18.00), indicating that it lacks sufficient capability for GEE code generation tasks.
  • Inconsistent performance across model versions: Surprisingly, DeepSeek-V3-0324, an optimized version of DeepSeek-V3, performs worse in GEE code generation, suggesting that later updates may not have specifically targeted improvements in this domain, potentially leading to performance degradation.
  • Performance of different parameter versions within the same model: Significant differences are observed across parameter configurations of the same model. For instance, Qwen-2.5-Coder-32B (average rank 8.33) outperforms its 7B (rank 14.00) and 3B (rank 15.67) variants. Similarly, within the Qwen-2.5 family, the 32B version (rank 12.33) ranks notably higher than the 7B (rank 15.33) and 3B (rank 17.00) versions. In addition, GPT-4o (rank 9.33) also outperforms GPT-4o-mini (rank 12.00).
  • Performance gain of GeoCode-GPT-7B: GeoCode-GPT-7B (average rank 7.33) outperforms its base model Code-Llama-7B (rank 9.50), indicating effective fine-tuning for GEE code generation tasks. However, the improvement is modest, possibly due to GeoCode-GPT’s training covering a broad range of geospatial code types (e.g., ARCPY, GDAL), thus diluting its specialization in the GEE-specific domain.
  • Category-wise performance analysis: Among the categories, the best-performing general-purpose non-reasoning LLM is DeepSeek-V3 (rank 1.33), the top general-purpose reasoning model is DeepSeek-R1 (rank 4.00), and the best general-purpose code generation model is Qwen-2.5-Coder-32B (rank 8.33).
  • Underwhelming performance of the GPT series: The GPT series shows relatively weak performance. Specifically, GPT-4o (rank 9.33) and GPT-4o-mini (rank 12.00) are both outperformed by models from the DeepSeek, Claude, and Gemini families, as well as by GeoCode-GPT-7B. Even the GPT-series reasoning model o3-mini only marginally surpasses GeoCode-GPT-7B by less than one rank.
Figure 9. LLM pass@n ranking bubble chart. The x-axis represents the pass@1 scores, the y-axis represents the pass@3 scores, and the size of the bubbles corresponds to the pass@5 scores. Different colors represent different LLM types, as shown in the legend. The bold and underlined numbers beside the model names indicate the average ranking of the model under the pass@1, pass@3, and pass@5 metrics. The bold and underlined numbers in red represent the highest-ranking model within each LLM category.
Figure 9. LLM pass@n ranking bubble chart. The x-axis represents the pass@1 scores, the y-axis represents the pass@3 scores, and the size of the bubbles corresponds to the pass@5 scores. Different colors represent different LLM types, as shown in the legend. The bold and underlined numbers beside the model names indicate the average ranking of the model under the pass@1, pass@3, and pass@5 metrics. The bold and underlined numbers in red represent the highest-ranking model within each LLM category.
Ijgi 14 00256 g009
To assess the stability of accuracy for the evaluated LLMs, we performed metric slicing, and summarize the results in Table 6. Models with green shading indicate that both P_Rank and C_Rank are higher than S_Rank, suggesting that these models exhibit strong stability, with high overall rankings and robust consistency. Examples include DeepSeek-V3 and DeepSeek-V3-0324. Models with orange shading indicate that P_Rank is lower than both S_Rank and C_Rank. Although these models achieve high P_Rank, their poor stability leads to lower S_Rank scores. Typical examples include Gemini-2.0-pro, DeepSeek-R1, o3-mini, and QwQ-32B. Most of these are reasoning models, reflecting that poor stability is one of the current performance bottlenecks for reasoning-oriented LLMs. Models with blue shading indicate that P_Rank is higher than both S_Rank and C_Rank. Although P_Rank is not particularly high, these models demonstrate good stability and achieve relatively better rankings, making them more robust in scenarios where stability is crucial. Representative models include Claude3.7-Sonnet, Qwen2.5-Coder-32B, GPT-4o, GPT-4o-mini, and Qwen-2.5-7B.

5.2. Resource Consumption

The evaluation results for resource consumption are presented in Table 7. This study provides visual analyses of token consumption, inference time, and the number of core generated code lines.
The bar chart of the average token consumption for GEE code generation across all LLMs is shown in Figure 10. The results show that the general non-reasoning, general code generation, and geospatial code generation model categories exhibit relatively similar levels of token consumption, while the general reasoning models consume significantly more tokens—approximately 6 to 7 times higher on average than the other three categories. This finding provides a useful reference for users in estimating token-based billing costs when selecting a model. It suggests that, for the same GEE code generation task, general reasoning models will incur 6 to 7 times the cost compared to general non-reasoning, general code generation, and geospatial code generation models.
The lollipop chart of inference time consumption for GEE code generation across LLMs is shown in Figure 11. In terms of inference methods, models using the API call approach (circles) exhibit longer inference times compared to those using local deployment (squares). This may be due to network latency and limitations in the computing resources of remote servers. From a model category perspective, general reasoning models (orange) generally require more inference time than other types. However, o3-mini is an exception—its inference latency is even lower than that of the locally deployed DeepSeek-Coder-V2, indicating that its server-side computational resources may have been optimized accordingly. In addition, the average inference time per unit test case for DeepSeek-R1 and QwQ-32B reaches as high as 78.3 s and 44.68 s, respectively—2 to 40 times longer than other models—indicating that these two models are in urgent need of targeted optimization for inference latency.
The token consumption metric reflects not only the length of the generated code but also includes the model’s reasoning output and the length of the prompt template, thereby representing more of the reasoning cost than the actual size of the generated code itself. To more accurately measure the structural length of the model’s output code, we excluded the influence of prompt- and reasoning-related content and used the total number of generated lines of code (including both comments and executable lines) as the evaluation metric. The results are shown in Figure 12. As observed, GeoCode-GPT-7B (average: 11.79 lines), DeepSeek-Coder-V2 (10.06), Qwen2.5-Coder-3B (9.11), and Claude3.7-Sonnet (8.98) rank among the highest in terms of code length. This may be attributed to excessive generated comments or more standardized code structures that automatically include formal comment templates, thereby increasing the overall line count. Additionally, a noteworthy phenomenon is observed within the Qwen2.5-Coder family: models with larger parameter sizes tend to generate shorter code. For example, the Qwen2.5-Coder-32B model has an average code length of 5.79 lines, which is significantly shorter than its 7B (7.06) and 3B (9.11) versions. This result contradicts conventional expectations and may suggest that larger models possess stronger capabilities in code compression and refinement, or that their output formatting is subject to stricter constraints and optimizations during training.

5.3. Operational Efficiency

The operational efficiency results for each model are presented in Table 8.
According to the results shown in Table 9, DeepSeek-V3, Gemini-2.0-pro, and DeepSeek-V3-0324 consistently rank at the top across all three dimensions and demonstrate excellent overall performance. All three are commercial models, making them suitable for API-based deployment. In contrast, models such as Code-Llama-7B, Qwen2.5-Coder-32B, and GPT-4o do not rank as highly in terms of P_Rank and S_Rank, but their strong performance in E_Rank makes them well-suited for local deployment (the first two) or for scenarios requiring high generation efficiency (GPT-4o). By comparison, although models like DeepSeek-R1, GeoCode-GPT-7B, o3-mini, and Claude3.7-Sonnet perform well in terms of accuracy and stability, their low E_Rank scores lead to less favorable overall rankings, indicating a need to improve generation efficiency in order to optimize their total performance.

5.4. Error Type Logs

The types of errors encountered by each model during GEE code generation are summarized in Table 10, revealing an overall consistent error pattern across models. Parameter errors occur at a significantly higher rate than invalid answers, while syntax errors, runtime errors, and network errors appear only sporadically and at extremely low frequencies. This suggests that the core challenge currently faced by models in GEE-based geospatial code generation lies in the lack of domain-specific parameter knowledge, including references to platform-integrated datasets, band names, coordinate formats, and other geoscientific details. As such, there is an urgent need to augment training data with domain-relevant knowledge specific to the GEE platform and to implement targeted fine-tuning. Meanwhile, the models have demonstrated strong stability in terms of basic syntax, code structure, and loop control, with related errors being extremely rare. This indicates that their foundational programming capabilities are largely mature. Therefore, future optimization efforts should shift toward enhancing domain knowledge rather than further reinforcing general coding skills.

6. Discussion

Using the AutoGEEval framework, this study evaluates 18 large language models (LLMs) in GEE code generation across four key dimensions: accuracy, resource consumption, operational efficiency, and error types. This chapter summarizes the findings and explores future research directions for LLMs in geospatial code generation.
The results show that multi-round generation mitigates hallucinations and improves stability, but the marginal gains diminish with more rounds, particularly from pass@3 to pass@5. This highlights the need to prioritize early-stage generation quality over additional iterations. Future work should focus on enhancing initial code generation accuracy to prevent error propagation. A promising approach is incorporating reinforcement learning-based adaptive mechanisms with early feedback to optimize early outputs and reduce reliance on post-correction. Cross-round information sharing may also enhance stability.
Our study shows that general-purpose reasoning models consume significantly more resources, leading to higher computational costs and slower response times, averaging 2 to 40 times longer than non-reasoning models. Despite this, their performance does not exceed, and in some cases is inferior to, non-reasoning models, resulting in low cost-efficiency. Future research on geospatial code generation with reasoning models should focus on integrating model compression and optimization techniques, such as quantization, distillation, and hardware acceleration (e.g., GPU/TPU), to improve inference speed and efficiency, particularly in resource-limited edge computing environments.
Our analysis reveals that parameter errors are the most common, with syntax and network errors being relatively rare. This indicates that most models have achieved maturity in basic syntax and code execution. However, the lack of domain-specific knowledge required by the GEE platform (e.g., dataset paths, band names, coordinate formats) remains a key limitation, highlighting the need for domain-specific fine-tuning to improve model performance in geospatial tasks.
We observed that models from the same company can show significant performance variability. For example, while OpenAI’s GPT series consistently maintains stability, DeepSeek models vary widely—DeepSeek-V3 excels in accuracy and stability, while DeepSeek-Coder-V2 ranks lowest. This underscores the importance of data-driven model selection, emphasizing that model choice should rely on rigorous testing and comparative analysis, not brand reputation. For model selection, the overall ranking indicator (Total_Rank), which combines accuracy (P_Rank), stability (S_Rank), and efficiency (E_Rank), is recommended. Models such as DeepSeek-V3, offering high accuracy and efficiency, are well-suited for high-performance, high-frequency API deployment. In contrast, models like Claude3.7-Sonnet, with a focus on accuracy and stability, are better suited for scientific and engineering tasks requiring consistent outputs.
Model size alone does not determine performance. For example, Qwen2.5-Coder-32B excels in accuracy and efficiency compared to its 7B and 3B counterparts but underperforms in code simplicity and stability. This indicates that, for specific tasks, fine-tuning and output formatting are more crucial than model size. Future research should focus on task-specific adaptation and fine-tuning, integrating model size, task alignment, and output formatting to optimize efficiency.

7. Conclusions

This study presents AutoGEEval, the first automated evaluation framework designed for geospatial code generation tasks on the GEE platform. Implemented via the Python API, the framework supports unit-level, multimodal, and end-to-end evaluation across 26 GEE data types. It consists of three core components: the constructed benchmark (AutoGEEval-Bench) with 1325 unit test cases; the Submission Program, which guides LLMs to generate executable code via prompts; and the Judge Program, which automatically verifies output correctness, resource consumption, and error types. Using this framework, we conducted a comprehensive evaluation of 18 representative LLMs, spanning general-purpose, reasoning-enhanced, code generation, and geospatial-specialized models. The results reveal performance gaps, code hallucination phenomena, and trade-offs in code quality, offering valuable insights for the optimization of future geospatial code generation technologies.

7.1. Significance and Contributions

This study is the first to establish a dedicated evaluation system for LLMs in geospatial code generation tasks, addressing key gaps in current tools that lack geospatial coverage, granularity, and automation. Through the proposed AutoGEEval framework, we achieved the systematic evaluation of multimodal GEE data types and API function call capabilities, advancing the automated transformation from natural language to geospatial code. Compared to existing methods that rely heavily on manual scoring, AutoGEEval offers high automation, standardization, and reproducibility, substantially reducing evaluation costs and improving efficiency. The framework supports comprehensive tracking and quantitative analysis of code correctness, inference efficiency, resource consumption, and error types, providing a clear indicator system and real-world entry points for model refinement. Moreover, the constructed benchmark AutoGEEval-Bench, covering 1325 test cases and 26 GEE data types, is both scalable and representative, serving as a valuable public resource for future research on intelligent geospatial code generation. Overall, this work advances the transformation of geospatial code generation from an engineering tool into a quantifiable scientific problem, and provides a methodological reference and practical blueprint for interdisciplinary AI model evaluation paradigms.

7.2. Limitations and Future Work

Despite the representativeness of the proposed unit-level evaluation framework, several limitations remain, and future work can explore multiple directions for further enhancement. Currently, the evaluation tasks focus on single-function unit tests, and although 1325 use cases are included, coverage remains limited. Future expansions could include a broader test set, especially under boundary conditions and abnormal inputs, to evaluate model robustness under extreme scenarios. Additionally, introducing function composition and cross-API test cases will allow for the assessment of model capabilities in handling complex logical structures. The current 26 GEE data types could also be expanded using modality-based classification strategies to achieve a more balanced and comprehensive benchmark. In terms of evaluation metrics, the current system primarily centers on execution correctness. Future extensions could incorporate multi-dimensional evaluation criteria, including code structural complexity, runtime efficiency, and resource usage. Given the continual evolution of LLMs, a valuable next step would be to build an open, continuous evaluation platform that includes economic cost dimensions and releases “cost-effectiveness leaderboards”, thereby driving community development and enhancing the visibility and influence of geospatial code generation research.

Author Contributions

Conceptualization, Huayi Wu and Shuyang Hou; methodology, Huayi Wu and Shuyang Hou; software, Zhangxiao Shen, Haoyue Jiao and Shuyang Hou; validation, Zhangxiao Shen, Jianyuan Liang and Yaxian Qing; formal analysis, Shuyang Hou and Zhangxiao Shen; investigation, Xu Li, Xiaopu Zhang and Shuyang Hou; resources, Jianyuan Liang, Huayi Wu, Zhipeng Gui, Xuefeng Guan and Longgang Xiang; data curation, Jianyuan Liang and Shuyang Hou; writing—original draft preparation, Shuyang Hou and Zhangxiao Shen; writing—review and editing, Shuyang Hou, Zhangxiao Shen and Huayi Wu; visualization, Shuyang Hou; supervision, Huayi Wu, Zhipeng Gui, Xuefeng Guan and Longgang Xiang; project administration, Shuyang Hou; funding acquisition, Zhipeng Gui All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 41971349. The APC was funded by the same source.

Data Availability Statement

The experimental data used in this study can be downloaded from https://github.com/szx-0633/AutoGEEval (accessed on 29 June 2025).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

The error types that may occur when large language models generate GEE platform code are shown in Figure A1.
Figure A1. Common error types in geospatial code generation with LLMs.
Figure A1. Common error types in geospatial code generation with LLMs.
Ijgi 14 00256 g0a1

Appendix B

Figure A2 and Figure A3 present representative test cases from AutoGEEval-Bench, illustrating tasks involving text-based and topology-based GEE data types, respectively.
Figure A2. Unit test example with text-based type.
Figure A2. Unit test example with text-based type.
Ijgi 14 00256 g0a2
Figure A3. Unit test example with topology-based type. The non-English characters in the imagery are determined by the map API used. When retrieving a specific region, the place names are displayed in the official language of that region. This does not affect the readability of the figure.
Figure A3. Unit test example with topology-based type. The non-English characters in the imagery are determined by the map API used. When retrieving a specific region, the place names are displayed in the official language of that region. This does not affect the readability of the figure.
Ijgi 14 00256 g0a3

References

  1. Li, Y.; Choi, D.; Chung, J.; Kushman, N.; Schrittwieser, J.; Leblond, R.; Eccles, T.; Keeling, J.; Gimeno, F.; Dal Lago, A. Competition-level code generation with alphacode. Science 2022, 378, 1092–1097. [Google Scholar] [CrossRef] [PubMed]
  2. Popat, S.; Starkey, L. Learning to code or coding to learn? A systematic review. Comput. Educ. 2019, 128, 365–376. [Google Scholar] [CrossRef]
  3. Bonner, A.J.; Kifer, M. An overview of transaction logic. Theor. Comput. Sci. 1994, 133, 205–265. [Google Scholar] [CrossRef]
  4. Jiang, J.; Wang, F.; Shen, J.; Kim, S.; Kim, S. A survey on large language models for code generation. arXiv 2024, arXiv:2406.00515. [Google Scholar]
  5. Wang, J.; Chen, Y. A review on code generation with llms: Application and evaluation. In Proceedings of the 2023 IEEE International Conference on Medical Artificial Intelligence (MedAI), Beijing, China, 18–19 November 2023; pp. 284–289. [Google Scholar]
  6. Guo, D.; Zhu, Q.; Yang, D.; Xie, Z.; Dong, K.; Zhang, W.; Chen, G.; Bi, X.; Wu, Y.; Li, Y.K. DeepSeek-Coder: When the Large Language Model Meets Programming—The Rise of Code Intelligence. arXiv 2024, arXiv:2401.14196. [Google Scholar]
  7. Hui, B.; Yang, J.; Cui, Z.; Yang, J.; Liu, D.; Zhang, L.; Liu, T.; Zhang, J.; Yu, B.; Lu, K. Qwen2. 5-coder technical report. arXiv 2024, arXiv:2409.12186. [Google Scholar]
  8. Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T. Code llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
  9. Rahman, M.M.; Kundu, A. Code Hallucination. arXiv 2024, arXiv:2407.04831. [Google Scholar]
  10. Li, D.; Murr, L. HumanEval on Latest GPT Models—2024. arXiv 2024, arXiv:2402.14852. [Google Scholar]
  11. Yu, Z.; Zhao, Y.; Cohan, A.; Zhang, X.-P. HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation. arXiv 2024, arXiv:2412.21199. [Google Scholar]
  12. Jain, N.; Han, K.; Gu, A.; Li, W.-D.; Yan, F.; Zhang, T.; Wang, S.; Solar-Lezama, A.; Sen, K.; Stoica, I. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv 2024, arXiv:2403.07974. [Google Scholar]
  13. Evtikhiev, M.; Bogomolov, E.; Sokolov, Y.; Bryksin, T. Out of the bleu: How should we assess quality of the code generation models? J. Syst. Softw. 2023, 203, 111741. [Google Scholar] [CrossRef]
  14. Liu, J.; Xie, S.; Wang, J.; Wei, Y.; Ding, Y.; Zhang, L. Evaluating language models for efficient code generation. arXiv 2024, arXiv:2408.06450. [Google Scholar]
  15. Zhou, S.; Alon, U.; Agarwal, S.; Neubig, G. Codebertscore: Evaluating code generation with pretrained models of code. arXiv 2023, arXiv:2302.05527. [Google Scholar]
  16. Capolupo, A.; Monterisi, C.; Caporusso, G.; Tarantino, E. Extracting land cover data using GEE: A review of the classification indices. In Proceedings of the Computational Science and Its Applications—ICCSA 2020, Cagliari, Italy, 1–4 July 2020; pp. 782–796. [Google Scholar]
  17. Tamiminia, H.; Salehi, B.; Mahdianpari, M.; Quackenbush, L.; Adeli, S.; Brisco, B. Google Earth Engine for geo-big data applications: A meta-analysis and systematic review. ISPRS J. Photogramm. Remote Sens. 2020, 164, 152–170. [Google Scholar] [CrossRef]
  18. Ratti, C.; Wang, Y.; Ishii, H.; Piper, B.; Frenchman, D. Tangible User Interfaces (TUIs): A novel paradigm for GIS. Trans. GIS 2004, 8, 407–421. [Google Scholar] [CrossRef]
  19. Zhao, Q.; Yu, L.; Li, X.; Peng, D.; Zhang, Y.; Gong, P. Progress and trends in the application of Google Earth and Google Earth Engine. Remote Sens. 2021, 13, 3778. [Google Scholar] [CrossRef]
  20. Mutanga, O.; Kumar, L. Google earth engine applications. Remote Sens. 2019, 11, 591. [Google Scholar] [CrossRef]
  21. Hou, S.; Shen, Z.; Zhao, A.; Liang, J.; Gui, Z.; Guan, X.; Li, R.; Wu, H. GeoCode-GPT: A large language model for geospatial code generation. Int. J. Appl. Earth Obs. Geoinf. 2025, 104456. [Google Scholar] [CrossRef]
  22. Hou, S.; Liang, J.; Zhao, A.; Wu, H. GEE-OPs: An operator knowledge base for geospatial code generation on the Google Earth Engine platform powered by large language models. Geo-Spat. Inf. Sci. 2025, 1–22. [Google Scholar] [CrossRef]
  23. Yang, L.; Driscol, J.; Sarigai, S.; Wu, Q.; Chen, H.; Lippitt, C.D. Google Earth Engine and artificial intelligence (AI): A comprehensive review. Remote Sens. 2022, 14, 3253. [Google Scholar] [CrossRef]
  24. Hou, S.; Shen, Z.; Liang, J.; Zhao, A.; Gui, Z.; Li, R.; Wu, H. Can large language models generate geospatial code? arXiv 2024, arXiv:2410.09738. [Google Scholar]
  25. Gramacki, P.; Martins, B.; Szymański, P. Evaluation of Code LLMs on Geospatial Code Generation. arXiv 2024, arXiv:2410.04617. [Google Scholar]
  26. Hou, S.; Jiao, H.; Shen, Z.; Liang, J.; Zhao, A.; Zhang, X.; Wang, J.; Wu, H. Chain-of-Programming (CoP): Empowering Large Language Models for Geospatial Code Generation. arXiv 2024, arXiv:2411.10753. [Google Scholar] [CrossRef]
  27. Shuyang, H.; Anqi, Z.; Jianyuan, L.; Zhangxiao, S.; Huayi, W. Geo-FuB: A method for constructing an Operator-Function knowledge base for geospatial code generation with large language models. Knowl.-Based Syst. 2025, 319, 113624. [Google Scholar] [CrossRef]
  28. Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.J.; Welihinda, A.; Hayes, A.; Radford, A. Gpt-4o system card. arXiv 2024, arXiv:2410.21276. [Google Scholar]
  29. Menick, J.; Lu, K.; Zhao, S.; Wallace, E.; Ren, H.; Hu, H.; Stathas, N.; Such, F.P. GPT-4o Mini: Advancing Cost-efficient Intelligence; Open AI: San Francisco, CA, USA, 2024. [Google Scholar]
  30. Anderson, I. Comparative Analysis Between Industrial Design Methodologies Versus the Scientific Method: AI: Claude 3.7 Sonnet. Preprints 2025. [Google Scholar]
  31. Team, G.R.; Abeyruwan, S.; Ainslie, J.; Alayrac, J.-B.; Arenas, M.G.; Armstrong, T.; Balakrishna, A.; Baruch, R.; Bauza, M.; Blokzijl, M. Gemini robotics: Bringing ai into the physical world. arXiv 2025, arXiv:2503.20020. [Google Scholar]
  32. Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
  33. Yang, A.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Huang, H.; Jiang, J.; Tu, J.; Zhang, J.; Zhou, J. Qwen2. 5-1M Technical Report. arXiv 2025, arXiv:2501.15383. [Google Scholar]
  34. Arrieta, A.; Ugarte, M.; Valle, P.; Parejo, J.A.; Segura, S. o3-mini vs DeepSeek-R1: Which One is Safer? arXiv 2025, arXiv:2501.18438. [Google Scholar]
  35. Zheng, C.; Zhang, Z.; Zhang, B.; Lin, R.; Lu, K.; Yu, B.; Liu, D.; Zhou, J.; Lin, J. Processbench: Identifying process errors in mathematical reasoning. arXiv 2024, arXiv:2412.06559. [Google Scholar]
  36. Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
  37. Zhu, Q.; Guo, D.; Shao, Z.; Yang, D.; Wang, P.; Xu, R.; Wu, Y.; Li, Y.; Gao, H.; Ma, S. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv 2024, arXiv:2406.11931. [Google Scholar]
  38. Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.D.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374. [Google Scholar]
  39. Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 2016, 3, 47–57. [Google Scholar] [CrossRef]
  40. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  41. Wang, Z.; Wu, G.; Sheikh, H.R.; Simoncelli, E.P.; Yang, E.-H.; Bovik, A.C. Quality-aware images. IEEE Trans. Image Process. 2006, 15, 1680–1689. [Google Scholar] [CrossRef]
Figure 1. Common error types in geospatial code generation with LLMs. The figure categorizes typical errors into three components: the error type, the user’s geospatial code prompt to the LLM, and the erroneous code generated by the model. In each example, red highlights the incorrect code while green indicates the corrected version.
Figure 1. Common error types in geospatial code generation with LLMs. The figure categorizes typical errors into three components: the error type, the user’s geospatial code prompt to the LLM, and the erroneous code generated by the model. In each example, red highlights the incorrect code while green indicates the corrected version.
Ijgi 14 00256 g001
Figure 2. AutoGEEval framework structure. The diagram highlights AutoGEEval-Bench (a), the Submission Program (b), and the Judge Program (c). Blue represents documentation, orange denotes language models, green represents prompts, and purple indicates evaluation methods and metrics. The black solid arrows represent the main workflow of each stage while the gray dashed arrows indicate the primary data flow.
Figure 2. AutoGEEval framework structure. The diagram highlights AutoGEEval-Bench (a), the Submission Program (b), and the Judge Program (c). Blue represents documentation, orange denotes language models, green represents prompts, and purple indicates evaluation methods and metrics. The black solid arrows represent the main workflow of each stage while the gray dashed arrows indicate the primary data flow.
Ijgi 14 00256 g002
Figure 3. Prompt for unit test construction.
Figure 3. Prompt for unit test construction.
Ijgi 14 00256 g003
Figure 4. Unit test example with text-based type ‘ee.Array’.
Figure 4. Unit test example with text-based type ‘ee.Array’.
Ijgi 14 00256 g004
Figure 5. Unit test example with topology-based type ‘ee.Image’.
Figure 5. Unit test example with topology-based type ‘ee.Image’.
Ijgi 14 00256 g005
Figure 6. Prompt for submission program.
Figure 6. Prompt for submission program.
Ijgi 14 00256 g006
Figure 7. Stacked bar chart of pass@n metrics. The blue represents the pass@1 value, the orange represents the improvement of pass@3 over pass@1, and the green represents the improvement of pass@5 over pass@3. The text on the bars indicates the absolute scores for pass@1, pass@3, and pass@5, respectively.
Figure 7. Stacked bar chart of pass@n metrics. The blue represents the pass@1 value, the orange represents the improvement of pass@3 over pass@1, and the green represents the improvement of pass@5 over pass@3. The text on the bars indicates the absolute scores for pass@1, pass@3, and pass@5, respectively.
Ijgi 14 00256 g007
Figure 8. Stacked bar chart of pass@3 and pass@5 improvement ratios.
Figure 8. Stacked bar chart of pass@3 and pass@5 improvement ratios.
Ijgi 14 00256 g008
Figure 10. Average token consumption across LLMs. Blue indicates general non-reasoning models, orange indicates general reasoning models, green represents code generation models, and yellow represents geospatial code generation models.
Figure 10. Average token consumption across LLMs. Blue indicates general non-reasoning models, orange indicates general reasoning models, green represents code generation models, and yellow represents geospatial code generation models.
Ijgi 14 00256 g010
Figure 11. Average inference time comparison of LLMs.
Figure 11. Average inference time comparison of LLMs.
Ijgi 14 00256 g011
Figure 12. Average lines of generated GEE code per model.
Figure 12. Average lines of generated GEE code per model.
Ijgi 14 00256 g012
Table 1. Distribution of GEE output types in AutoGEEval-Bench.
Table 1. Distribution of GEE output types in AutoGEEval-Bench.
Output_TypeDescriptionCountPercentage
ee.ArrayMulti-dimensional array for numbers and pixels1188.91%
ee.ArrayImageImage constructed from multidimensional arrays302.26%
ee.BlobBinary large object storage (e.g., files/models)10.08%
ee.BOOLBoolean logic value (True/False)382.87%
ee.ClassifierMachine learning classifier object120.91%
ee.ClustererClustering algorithm processor60.45%
ee.ConfusionMatrixConfusion matrix of classification results40.30%
ee.DateDate and time format data90.68%
ee.DateRangeObject representing a range of dates50.38%
ee.DictionaryKey-value data structure634.75%
ee.ElementFundamental unit of a geographic feature30.23%
ee.ErrorMarginStatistical object for error margins10.08%
ee.FeatureSingle feature with properties and shape211.58%
ee.FeatureCollectionCollection of geographic features413.09%
ee.FilterObject representing data filtering conditions372.79%
ee.GeometryGeometric shapes (point, line, polygon, etc.)14611.02%
ee.ImageSingle raster image data22416.91%
ee.ImageCollectionCollection of image data objects171.28%
ee.JoinMethod for joining datasets60.45%
ee.KernelConvolution kernel for spatial analysis221.66%
ee.ListOrdered list data structure685.13%
ee.NumberNumeric data19414.64%
ee.PixelTypePixel type definition100.75%
ee.ProjectionCoordinate system projection information151.13%
ee.ReducerAggregation and reduction functions604.53%
ee.StringString-type data17413.13%
OverallTotal1325100.00%
Table 2. Summary of value representations and evaluation strategies for GEE data types.
Table 2. Summary of value representations and evaluation strategies for GEE data types.
GEE Data TypeValue RepresentationTesting Logic
ee.ArraySmall-scale arrayUse getInfo to convert to a NumPy array and compare each element with expected_answer.
ee.ConfusionMatrix
ee.ArrayImage
ee.ImageLarge-scale arrayDownload the image as a NumPy array and perform pixel-wise comparison; for large images, apply center sampling with a tolerance of 0.001. Merge all images into one and evaluate as a single image.
ee.ImageCollection
ee.ListListConvert to a Python list via getInfo and compare each element.
ee.StringStringConvert to a Python string via getInfo and compare directly. Boolean values are also treated as strings.
ee.BOOL
ee.NumberFloating-point numberConvert to a Python float via getInfo and compare with the answer.
ee.DictionaryAll dictionary keysConvert to a Python dictionary via getInfo and compare key-value pairs.
ee.Blob
ee.Reducer
ee.Filter
ee.Classifier
ee.Clusterer
ee.Pixeltype
ee.Join
ee.Kernel
ee.ErrorMargin
ee.Element
ee.Projection
ee.DateDictionary ‘value’ fieldUse getInfo to obtain a dictionary, extract the ‘value’ field (timestamp in milliseconds) and compare numerically.
ee.DateRange
ee.GeometryGeoJSONConvert to GeoJSON using getInfo and compare geometric consistency with Shapely; for Features, extract geometry before comparison.
ee.Feature
ee.FeatureCollection
Table 3. Information of evaluated LLMs. “N/A” indicates that the parameter size of the model was not publicly released at the time of publication and is therefore marked as unknown.
Table 3. Information of evaluated LLMs. “N/A” indicates that the parameter size of the model was not publicly released at the time of publication and is therefore marked as unknown.
Model TypeModel NameDeveloperSizeYear
General Non-ReasoningGPT-4o [28]OpenAIN/A2024
GPT-4o-mini [29]OpenAIN/A2024
Claude3.7-Sonnet [30]AnthropicN/A2025
Gemini-2.0-pro [31]GoogleN/A2025
DeepSeek-V3 [32]DeepSeek671B2024
DeepSeek-V3-0324 [32]DeepSeek685B2025
Qwen-2.5 [33]Alibaba3B, 7B, 32B2024
General Reasoningo3-mini [34]OpenAIN/A2025
QwQ-32B [35]Alibaba32B2025
DeepSeek-R1 [36]DeepSeek671B2025
General Code GenerationDeepSeek-Coder-V2 [37]DeepSeek16B2024
Qwen2.5-Coder [7]Alibaba3B, 7B, 32B2024
Code-Llama-7B [8]Meta7B2023
Geospatial Code GenerationGeoCode-GPT-7B [21]Wuhan University7B2024
Table 4. Time allocation across experimental stages.
Table 4. Time allocation across experimental stages.
StagesTime Spent (hours)
AutoGEEval-Bench Construction35
Expert Manual Revision50
Model Inference and Code Execution445
Evaluation of Model Responses270
Total (All Stages)800
Table 5. Accuracy evaluation results. Where the values in parentheses under pass@3 represent the improvement over pass@1, and the values in parentheses under pass@5 represent the improvement over pass@3.
Table 5. Accuracy evaluation results. Where the values in parentheses under pass@3 represent the improvement over pass@1, and the values in parentheses under pass@5 represent the improvement over pass@3.
Modelpass@1 (%)pass@3 (%)pass@5 (%)CVSA
General Non-Reasoning
GPT-4o59.0263.62 (+4.60)65.36 (+1.74)0.097 59.58
GPT-4o-mini55.0260.68 (+4.66)61.43 (+0.75)0.104 55.63
Claude3.7-Sonnet63.9266.72 (+2.80)67.92 (+1.20)0.059 64.14
Gemini-2.0-pro65.3675.09 (+9.73)77.28 (+2.19)0.154 66.95
DeepSeek-V371.5575.25 (+3.70)76.91 (+1.66)0.070 71.90
DeepSeek-V3-032465.2871.92 (+6.64)73.51 (+1.59)0.112 66.11
Qwen-2.5-3B33.5839.32 (+5.74)41.43 (+2.11)0.189 34.83
Qwen-2.5-7B49.3654.49 (+5.13)56.38 (+1.89)0.125 50.14
Qwen-2.5-32B54.4260.00 (+5.58)62.04 (+2.04)0.123 55.25
General Reasoning
o3-mini56.9868.91 (+11.93)71.02 (+2.11)0.198 59.30
QwQ-32B53.7464.83 (+9.09)68.83 (+4.00)0.219 56.45
DeepSeek-R160.2372.68 (+12.45)76.68 (+4.00)0.215 63.14
General Code Generation
DeepSeek-Coder-V231.4038.11 (+6.71)40.75 (+2.64)0.229 33.14
Qwen2.5-Coder-3B46.4954.34 (+7.85)57.36 (+3.02)0.190 48.22
Qwen2.5-Coder-7B51.2557.66 (+6.41)60.91 (+3.25)0.159 52.57
Qwen2.5-Coder-32B61.2864.08 (+2.80)65.21 (+1.13)0.060 61.50
Code-Llama-7B56.9864.00 (+7.02)66.42 (+2.42)0.142 58.15
Geospatial Code Generation
GeoCode-GPT-7B58.5865.34 (+6.76)68.53 (+3.19)0.14559.84
Table 6. Ranking of the models under pass@5, CV, and SA metrics. The table is sorted by S_Rank, reflecting the accuracy ranking of the models with the inclusion of stability factors, rather than solely considering accuracy. Categories 1, 2, 3, and 4 correspond to general non-reasoning models, general reasoning models, general code generation models, and geospatial code generation models, respectively.
Table 6. Ranking of the models under pass@5, CV, and SA metrics. The table is sorted by S_Rank, reflecting the accuracy ranking of the models with the inclusion of stability factors, rather than solely considering accuracy. Categories 1, 2, 3, and 4 correspond to general non-reasoning models, general reasoning models, general code generation models, and geospatial code generation models, respectively.
CategoryModelpass@5CVSAP_RankC_RankS_Rank
1DeepSeek-V376.910.0771.9231
1Gemini-2.0-pro77.280.15466.951112
1DeepSeek-V3-032473.510.11266.11463
1Claude3.7-Sonnet67.920.05964.14814
2DeepSeek-R176.680.21563.143165
3Qwen2.5-Coder-32B65.210.0661.51126
4GeoCode-GPT-7B68.530.14559.847107
1GPT-4o65.360.09759.581048
2o3-mini71.020.19859.35159
3Code-Llama-7B66.420.14258.159910
2QwQ-32B68.830.21956.4561711
1GPT-4o-mini61.430.10455.6313512
1Qwen-2.5-32B62.040.12355.2512713
3Qwen2.5-Coder-7B60.910.15952.57141214
1Qwen-2.5-7B56.380.12550.1416815
3Qwen2.5-Coder-3B57.360.1948.22151416
1Qwen-2.5-3B41.430.18934.83171317
3DeepSeek-Coder-V240.750.22933.14181818
Table 7. Evaluation results for resource consumption. For the QwQ-32B model using API calls, due to the provider’s configuration, only “streaming calls” are supported. In this mode, token consumption cannot be tracked, so it is marked as N/A.
Table 7. Evaluation results for resource consumption. For the QwQ-32B model using API calls, due to the provider’s configuration, only “streaming calls” are supported. In this mode, token consumption cannot be tracked, so it is marked as N/A.
ModelInference MethodTok. (tokens)In.T (s)Co.L (lines)
General Non-Reasoning
GPT-4oAPI call2103.317.77
GPT-4o-miniAPI call2087.635.86
Claude3.7-SonnetAPI call26511.728.98
Gemini-2.0-proAPI call22324.555.2
DeepSeek-V3API call1908.874.86
DeepSeek-V3-0324API call20416.326.82
Qwen-2.5-3BLocal deployment1862.584.12
Qwen-2.5-7BLocal deployment1973.886.28
Qwen-2.5-32BAPI call2055.636.6
General Reasoning
o3-miniAPI call10837.406.93
QwQ-32BAPI callN/A44.685.64
DeepSeek-R1API call155778.305.32
General Code Generation
DeepSeek-Coder-V2Local deployment2858.3910.06
Qwen2.5-Coder-3BLocal deployment2402.519.11
Qwen2.5-Coder-7BLocal deployment2243.767.06
Qwen2.5-Coder-32BAPI call1985.505.79
Code-Llama-7BLocal deployment2563.053.58
Geospatial Code Generation
GeoCode-GPT-7BLocal deployment2534.0511.79
Table 8. Evaluation results for operational efficiency. For the QwQ-32B model using API calls, due to the provider’s configuration, only “streaming calls” are supported. In this mode, token consumption cannot be tracked, so it is marked as N/A.
Table 8. Evaluation results for operational efficiency. For the QwQ-32B model using API calls, due to the provider’s configuration, only “streaming calls” are supported. In this mode, token consumption cannot be tracked, so it is marked as N/A.
ModelInference MethodTok.-EIn.T-ECo.L-E
General Non-Reasoning
GPT-4oAPI call0.31119.7467.77
GPT-4o-miniAPI call0.2958.0525.86
Claude3.7-SonnetAPI call0.2565.7968.98
Gemini-2.0-proAPI call0.3473.1485.2
DeepSeek-V3API call0.4058.6704.86
DeepSeek-V3-0324API call0.3604.5046.82
Qwen-2.5-3BLocal deployment0.22316.0604.12
Qwen-2.5-7BLocal deployment0.28614.5306.28
Qwen-2.5-32BAPI call0.30311.0196.6
General Reasoning
o3-miniAPI call0.0669.5976.93
QwQ-32BAPI callN/A1.5415.64
DeepSeek-R1API call0.0490.9795.32
General Code Generation
DeepSeek-Coder-V2Local deployment0.1438.3910.06
Qwen2.5-Coder-3BLocal deployment0.2392.519.11
Qwen2.5-Coder-7BLocal deployment0.2723.767.06
Qwen2.5-Coder-32BAPI call0.3295.505.79
Code-Llama-7BLocal deployment0.2593.053.58
Geospatial Code Generation
GeoCode-GPT-7BLocal deployment0.2714.0511.79
Table 9. Rank-based comparative evaluation of models. The table is sorted by Total_Rank in ascending order. If models share the same average rank, they are assigned the same ranking (e.g., both DeepSeek-V3 and Code-Llama-7B are ranked 1 in E_Rank). Yellow highlights indicate the top 12 models in E_Rank, green for the top 12 in S_Rank, and orange for the top 12 in P_Rank. Gray highlights mark the bottom 6 models across E_Rank, S_Rank, and P_Rank. Categories 1, 2, 3, and 4 correspond to general non-reasoning models, general reasoning models, general code generation models, and geospatial code generation models, respectively.
Table 9. Rank-based comparative evaluation of models. The table is sorted by Total_Rank in ascending order. If models share the same average rank, they are assigned the same ranking (e.g., both DeepSeek-V3 and Code-Llama-7B are ranked 1 in E_Rank). Yellow highlights indicate the top 12 models in E_Rank, green for the top 12 in S_Rank, and orange for the top 12 in P_Rank. Gray highlights mark the bottom 6 models across E_Rank, S_Rank, and P_Rank. Categories 1, 2, 3, and 4 correspond to general non-reasoning models, general reasoning models, general code generation models, and geospatial code generation models, respectively.
CategoryModelT_RankI_RankCo_RankE_RankS_RankP_RankTotal_Rank
1DeepSeek-V311121121
1Gemini-2.0-pro31634212
1DeepSeek-V3-032421576343
3Code-Llama-7B112111094
3Qwen2.5-Coder-32B48636114
1GPT-4o531458106
2DeepSeek-R11718415536
4GeoCode-GPT-7B1041713778
2o3-mini1610914959
1Claude3.7-Sonnet121315174810
1Qwen-2.5-32B69117131211
1GPT-4o-mini71288121312
2QwQ-32B181751611612
3Qwen2.5-Coder-7B95139141414
1Qwen-2.5-7B871210151615
3Qwen2.5-Coder-3B1311611161516
1Qwen-2.5-3B1461012171717
3DeepSeek-Coder-V215141818181818
Table 10. Error type distribution in GEE code generation across models.
Table 10. Error type distribution in GEE code generation across models.
ModelParameter
Error (%)
Invalid
Answer (%)
Syntax
Error (%)
Runtime
Error (%)
Network
Error (%)
General Non-Reasoning
GPT-4o72.2126.581.020.190.00
GPT-4o-mini75.8822.291.490.300.04
Claude3.7-Sonnet65.8131.921.760.220.29
Gemini-2.0-pro55.7137.157.010.020.11
DeepSeek-V372.7526.290.370.140.45
DeepSeek-V3-032479.4019.860.430.080.23
Qwen-2.5-3B83.728.387.900.000.00
Qwen-2.5-7B83.4412.603.960.000.00
Qwen-2.5-32B78.4718.652.750.110.02
General Reasoning
o3-mini67.7930.021.840.090.26
QwQ-32B85.6813.011.110.010.19
DeepSeek-R185.0414.620.190.000.15
General Code Generation
DeepSeek-Coder-V284.4710.624.780.000.13
Qwen2.5-Coder-3B75.2612.5412.200.000.00
Qwen2.5-Coder-7B84.7614.420.630.030.16
Qwen2.5-Coder-32B79.1919.960.430.190.23
Code-Llama-7B80.0118.471.370.010.14
Geospatial Code Generation
GeoCode-GPT-7B77.219.5413.140.030.08
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, H.; Shen, Z.; Hou, S.; Liang, J.; Jiao, H.; Qing, Y.; Zhang, X.; Li, X.; Gui, Z.; Guan, X.; et al. AutoGEEval: A Multimodal and Automated Evaluation Framework for Geospatial Code Generation on GEE with Large Language Models. ISPRS Int. J. Geo-Inf. 2025, 14, 256. https://doi.org/10.3390/ijgi14070256

AMA Style

Wu H, Shen Z, Hou S, Liang J, Jiao H, Qing Y, Zhang X, Li X, Gui Z, Guan X, et al. AutoGEEval: A Multimodal and Automated Evaluation Framework for Geospatial Code Generation on GEE with Large Language Models. ISPRS International Journal of Geo-Information. 2025; 14(7):256. https://doi.org/10.3390/ijgi14070256

Chicago/Turabian Style

Wu, Huayi, Zhangxiao Shen, Shuyang Hou, Jianyuan Liang, Haoyue Jiao, Yaxian Qing, Xiaopu Zhang, Xu Li, Zhipeng Gui, Xuefeng Guan, and et al. 2025. "AutoGEEval: A Multimodal and Automated Evaluation Framework for Geospatial Code Generation on GEE with Large Language Models" ISPRS International Journal of Geo-Information 14, no. 7: 256. https://doi.org/10.3390/ijgi14070256

APA Style

Wu, H., Shen, Z., Hou, S., Liang, J., Jiao, H., Qing, Y., Zhang, X., Li, X., Gui, Z., Guan, X., & Xiang, L. (2025). AutoGEEval: A Multimodal and Automated Evaluation Framework for Geospatial Code Generation on GEE with Large Language Models. ISPRS International Journal of Geo-Information, 14(7), 256. https://doi.org/10.3390/ijgi14070256

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop