Scaling Automated Programming Assessment Systems

Mekterović, Igor; Brkić, Ljiljana; Horvat, Marko

doi:10.3390/electronics12040942

Open AccessArticle

Scaling Automated Programming Assessment Systems

by

Igor Mekterović

^*

,

Ljiljana Brkić

and

Marko Horvat

Department of Applied Computing, Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, HR-10000 Zagreb, Croatia

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(4), 942; https://doi.org/10.3390/electronics12040942

Submission received: 12 December 2022 / Revised: 22 January 2023 / Accepted: 10 February 2023 / Published: 13 February 2023

(This article belongs to the Special Issue Advanced Web Applications)

Download

Browse Figures

Versions Notes

Abstract

:

The first automated assessment of student programs was reported more than 60 years ago, but this topic remains relevant and highly topical among computer science researchers and teachers. In the last decade, several factors have contributed to the popularity of this approach, such as the development of massive online courses, where large numbers of students can hardly be assessed manually, the COVID-19 pandemic with a strong online presence and physical relocation of students, and the ever-increasing shortage of personnel in the field CS. Modern Automated Programming Assessment Systems (APASs) are nowadays implemented as web applications. For such web applications, especially those that support immediate (on-demand) program assessments and feedback, it can be quite a challenge to implement the various system modules in a secure and scalable manner. Over the past six years, we have developed and actively deployed “Edgar”—a state-of-the-art APAS that enables immediate program evaluation and feedback in any programming language (SQL, C, Java, etc.). In this article, we look at the APAS web application architecture with a focus on scalability issues. We review fundamental features such as dynamic analysis and untrusted code execution, as well as more complex cases such as static analysis and plagiarism detection, and we summarize the lessons learned over the previous six years of research. We identify scalability challenges, show how they have been addressed in APAS Edgar, and then propose general architectural solutions, building blocks and patterns to address those challenges.

Keywords:

web application; scalability; APAS; automated programming assessment

1. Introduction

Courses in the areas of computer science (CS) and computer engineering (CE) are delivered on many levels: primary and secondary schools, universities, online courses, internally at companies, etc. Regardless of level, an essential part of the educational process is the timely assessment of learning. In CS and CE, this means reviewing the students’ program code. Through assessment and feedback, learners receive information about the progress of their education. On the other hand, teachers gain insight into the overall teaching process in addition to detailed information about individual learners. However, providing a manual assessment of computer code is difficult. Furthermore, teachers and other faculty members are outnumbered by the growing number of CS and CE students, and the gap seems to be widening as teachers face staff shortages. This has naturally led to the development of (semi-)automated tools for code assessment. Although manual evaluation of program code is still considered the gold standard, automated evaluation has certain advantages over manual evaluation: speed, availability, objectivity, and consistency of criteria. A system developed for such a purpose is called an Automated Programming Assessment System (APAS). There are several high-quality APASs in use today [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] and many of them are actively being developed and improved (a more detailed overview can be found in the Section 6). Recent studies also show a growing interest in exploring automated assessment of program features other than functionality, such as quality, behavior, readability, and security, as well as novel assessment methods to improve feedback [24]. The literature in this domain typically focuses on isolated aspects of the process, and we note a lack of literature on overall APAS software development and architecture. Modern APASs are implemented as web applications that must support many parallel users, which requires a modular, distributed, and scalable architecture. Web application scalability has attracted considerable attention in both the technical and academic literature (per example [25,26,27,28,29]), but it has not yet been addressed in the challenging and unique context of automated programming systems. Our paper provides a contribution in that regard.

Over the past six years, we have been developing an APAS named Edgar [30], and using it daily in teaching and student program code assessments for many undergraduate and graduate courses in CS and CE. For example, in the past academic year, 2021/22, we administered more than 57,000 exams using APAS Edgar, which contained approximately 340,000 questions from 19 different courses. Edgar can assess code in an arbitrary programming language (e.g., C, Java, Pyhton, and SQL), supports many different question types (multiple-choice multi-correct questions, programming questions, free text, etc.) and exam types (fixed and random tests with mixed question types, peer assessment, etc.) and provides rich logging and monitoring capabilities. Edgar also provides certain Learning Management System (LMS) features such as adaptive exercises and tutorials. Edgar is designed in a generic approach to be used by all and is released under the MIT license [30].

In this paper, we present the lessons we have learned over the six years of developing and revising this modular software, with particular attention to the resource-intensive parts of APAS that must be designed in a scalable way. We identify and comment on APAS modules in general terms, followed by examples of good and bad practices we have learned over the years. The architecture of modules for executing untrusted code is covered in detail, as we make the case and advocate for immediate (synchronous) assessment/feedback, which places a heavy load on the system. In addition, we find the static analysis module and the plagiarism detection module challenging, especially if they are generic, which is our goal. In this general spirit, we present proposals for the system architecture of these modules and initial results of implementation efforts.

The remaining parts of this paper are organized as follows: The APAS modules discussed in Section 2 address the first part of our contribution and introduce the APAS modules. We address the second half of our contribution in Section 3 and Section 4, where we provide a general analysis and architectural recommendations for each module, followed by a section on the implementation of these recommendations in APAS Edgar, with examples and comments on lessons learned. Section 3 elaborates on core modules for data storage and dynamic analysis. Section 4 deals with the advanced modules: the static analysis module and the plagiarism detection module. Section 5 briefly discusses the remaining core modules that do not require significant resources. Section 3, Section 4 and Section 5 comment on the modules mainly through Edgar’s prism, while Section 6 provides a broader view using information from the available literature. Finally, Section 7 summarizes the results of our research.

2. APAS Modules

In this section, we present and comment on modules which can be found in a typical APAS ([24,31,32,33]). In our earlier work, we defined the necessary features set of a comprehensive APAS [23]. This previous work focused on the functional aspects of the system—on the behavior and functionalities that an APAS must provide. Here, we take a different perspective and examine non-functional requirements with a focus on performance and scalability. While any computer architecture can satisfy all functional requirements, non-functional requirements have a direct impact on the system architecture.

Following an analysis of the information that is currently available on several of the notable APAS systems ([5,9,14,18,34]), taking into account Edgar’s architecture and our own APAS development expertise, we have identified ten common modules that are arranged according to their topic and scalability requirements. Figure 1 shows the proposed modules, grouped by topic, with solid lines indicating core modules that are mandatory for an APAS system. Dashed rectangles group optional, more advanced modules. The shaded modules are those that we believe pose scalability challenges and that we will explore later in this paper. Note that we distinguish between “standalone” programming languages (such as C, Java, and Python) that are compiled and run (or interpreted) in the OS environment, and “managed” programming languages (for example SQL, HTML, CSS, and MongoDB query language) that are executed in the context of a hosting environment such as RDBMS. Program code in all “standalone” languages can be processed using a generic schema, while “managed” programming languages typically require a custom scoring pipeline. Section 3.2. discusses this topic in more detail.

In the following sections, we will explain these modules in detail, point out challenges, and propose architectural solutions. In this context, Edgar serves as an example of a typical APAS and as a template for a more detailed discussion of the proposed architectural solution. As a result, we arrive at general principles and recommendations for the architectural design of APAS.

Since APAS systems can execute and evaluate code they naturally evolve to support various e-learning scenarios, but we do not cover that aspect here. Edgar has tutorials, adaptive exercises, and even an unassessed public exam can be defined as learning material, but these features are beyond the scope of this paper [23].

3. Core APAS Modules

3.1. (Polyglot) Structured Data Storage and Management

As with any web application, the fundamental part of the system is a database management system, typically a relational DBMS. DBMSs are also typical single-point-of-failure components, unless one chooses a distributed peer-to-peer database management system (e.g., Riak), which loses the strong consistency [35]. We did not consider such an option, nor are we aware of any APAS that uses such a data store. Of course, there is a wealth of work in the literature and industry on scaling databases, which we will not be repeat here, except for a few summarizing thoughts:

Vertical scaling of databases is limited and eventually reaches the limit where additional resources (CPU, memory, etc.) no longer improve performance,
Horizontally database scaling can be achieved either by (asynchronous) replication (e.g., PostgreSQL’s streaming replication, where the logical log is streamed and replayed on the replica server) or by using out-of-the-box master-slave database management systems such as MongoDB, which provide N replicas, but only for the read operations, and
Write operations with strong consistency guarantees will always be subject to the CAP theorem [35] and can hardly scale horizontally.

Therefore, if write operations (transactions) are not excessive and, more importantly, if the transactions are short, the database theoretically should not be a bottleneck or cause problems in the system. Long-running transactions can be treated separately and handled with asynchronous execution in a queue or precomputation (see Section 3.2.1).

In general, it can be said that the selection of consistency guarantees and thus database technology depends on the needs of the applications. Although relational databases with customizable isolation levels are undoubtedly the best general option, other technologies should be considered that might be more appropriate for specific system components. For instance, fast NoSQL stores might be utilized for transient and unstructured data such as session data and logs, and graph databases could be used for plagiarism detection (discussed in Section 4.2). By splitting the data into “less important” transient data and transactional data (i.e., on two database management systems), the workload is divided between two different servers that can be allocated and scaled separately. In general, data can be split into an arbitrary number of databases and database management systems, where each DBMS is best suited for the assigned workloads. This pattern, known as polyglot persistence [36], does have the disadvantage of increasing complexity, as there is more than one database to manage and backup, but in our experience, it is well worth the cost.

Structured Data Storage in APAS Edgar

In Edgar, a polyglot persistence pattern was used from the beginning by using PostgreSQL and MongoDB DBMSs. Edgar makes extensive use of logging, as we believe any APAS should, especially during the exam. Every action is logged, from retrieving questions to uploading files and navigating the questions. Even the code evaluation results that are displayed to the student after each code run are logged. This is very handy when it comes to possible student complaints or troubleshooting code. We store all log data, other “less important” transient data, and even session data in a NoSQL MongoDB database. For example, ongoing exam answers are stored in MongoDB and moved to the PostgreSQL relational database after submission. This is similar to large web stores that store shopping cart data in NoSQL stores and transfer it to a relational database after payment is made. Storing session data in persistent storage (not in memory) has two major advantages: (1) web applications can be seamlessly restarted while students are working without interrupting their sessions, and (2) web applications can scale horizontally without using consistent hashing or sticky sessions.

Regarding the RDBMS’s short transactions, it applied to our case with the exception of the exam generation process. Edgar employs a relatively rich semantic model for the exam definition (described in [23]), the process of instancing an exam for a certain student is complicated and lengthy, requiring over 300 lines of SQL code in the chk_get_test_instance and gen_test_instance functions ([37]). To prevent integrity violations, such as creating more exam instances than allowed, exam instantiation must also be carried out with high consistency. We had serious performance problems when we imposed a serializable isolation level on this process when many students were simultaneously creating their exams. The asynchronous message broker (queue) paradigm, where jobs are queued for serial execution and the results of job execution are computed and stored in advance, is often the best method for handling such demanding tasks (analogous to the concept of materialized view or the concept of pre-aggregation in OLAP systems). The disadvantage of the first strategy is that it complicates the overall architecture and likely adds new components such as message broker software (e.g., RabbitMQ or Apache Kafka). The disadvantage of the second strategy is that it is less universal and cannot be used in all circumstances. For example, the result of a student’s code execution cannot be calculated in advance. In our case, however, the latter strategy was relevant, and Edgar now has a “forward generate” button that allows teachers to create exams after they have been defined and then save them so that students can “pick them up” later without performance problems, since reading does not block reading (or writing) in RDBM systems that support snapshot isolation protocols.

3.2. Dynamic Analysis—Untrusted Code Execution

The most widely used and very efficient method for evaluating programming code is the so-called dynamic analysis ([24,31]). In dynamic analysis, the program is treated as a black box, and the program’s behavior under various inputs is examined. The program is compiled, executed, and various test inputs are fed into the standard input. Finally, the outputs are observed and compared to the expected results. Most, if not all, APAS use dynamic analysis as the dominant method of code evaluation.

What distinguishes APAS web applications from other mainstream web applications is the untrusted code execution paradigm. Students submit their code, which can be faulty, malicious, overreaching in terms of resources, never-ending, etc. In other words, it poses a set of different challenges, both security and performance challenges, which we address in the following subsections.

3.2.1. Sandboxing Untrusted Code

Untrusted code must be compiled and executed outside the web application process in the sandbox. The sandbox is a security mechanism that separates the running program from the rest of the system. Additionally, the sandbox should provide constraints on resources such as processor time, memory used, and disk space. We recognize two families of programs: compiled and interpreted languages, such as C, Java or Python, and database query languages. The latter can be elegantly solved with transactions—each program (query) is always rolled back. In this process, one simply must:

Begin transaction,
Execute an arbitrary number of DML (select, update, insert, delete) or even DDL (create and alter table, create index, etc.) statements,
Retrieve the temporary results by querying the tables or system catalog, and
Rollback transaction.

Resource constraints are handled through SET SESSION STATEMENT_TIMEOUT command, which limits the execution time, and the result row count can be limited by injecting the LIMIT/FIRST/TOPN clause or similar mechanism.

On the other hand, programs must be compiled in standalone programming languages (which can also be a security risk—see the example of a valid C program with infinite compilation time in [38]) and executed. For interpreted languages, there is only a second step. In general, a sandbox environment can be achieved for this purpose through three mechanisms: user-level restrictions (Linux access controls and resource constraints), process-level restrictions (mandatory access controls and process jails), and virtualization (JVM virtualization, virtual machines, and containers). More on this topic can be found in a recent study [24], where it is also stated that although most tools still rely on user permissions, jails, and JVM security mechanisms, several new tools are adopting containerization, particularly through Docker containers. Linux containers (per example Docker, as a popular option) are probably the best option in terms of features, complexity, and slight performance overhead. For safe code execution, APAS Edgar relies on Judge0 [38]—an online judge system that wraps isolate [39] for code compilation and execution. Isolate is a battle-tested software used in the International Olympiads in Informatics that leverages Linux kernel features (namespaces, control groups) similar to Docker.

3.2.2. Performance Considerations

In a typical web application, when the request hits the web server, the server-side code routes the request, usually fetches data from the database and performs light computations to calculate and format the result before sending it back. All these events usually happen in milliseconds, while student code execution is on the order of seconds. For instance, Figure 2 shows the execution times for C programming language code for the two batches of midterm exams in the “Introduction to programming” course. Three servers have executed the code: edrunner and edrunner2 with the same characteristics (32 CPU cores, 118 GiB RAM) and westeros (16 CPU cores, 48 GiB RAM) with load-balancing weights 30%, 30%, 40%, respectively. As expected, the first two have the same median and westeros performed somewhat worse. It is also visible that the tail latency is considerable as there are long running jobs clustered around five seconds and sporadic peaks. These data go to show that executing untrusted code in an arbitrary programming language incurs a massive processing burden yt6 independent message broker or by a separate stateless service which can be scaled horizontally and placed behind load balancer (Figure 3).

In general, it is necessary to distinguish two code execution paradigms:

Delayed (asynchronous execution) paradigm—code is submitted, queued for execution, and results (as well as feedback) are computed seconds, minutes, or even hours after the submission. This paradigm is more commonly used when submissions are “projects”, usually submitted as a zipped folder containing multiple files. This paradigm is much easier to implement but can hardly be used to implement e-learning scenarios with rapid feedback. Some APAS use only this paradigm (e.g., Jack [1]).
Immediate (synchronous execution and feedback) paradigm—the code is processed as soon as it is submitted, and feedback is retrieved within seconds. This paradigm is more difficult to implement, but it enables a much better user experience for students and allows for a broader range of usage scenarios. For example, it can be used to support e-learning systems (not just exam evaluation).

Even with separate infrastructure, the code execution service can be overwhelmed with requests, especially if immediate pattern is employed. Therefore, it is advisable to implement additional client-side throttling and/or server-side pushback mechanism to protect these services. More details and examples on this are given in the following section.

3.2.3. Dynamic Analysis in APAS Edgar

Edgar leverages the immediate execution paradigm: every student can run the code multiple times during the exam and receive feedback before submission. It is also used in tutorials, adaptive exercises, open exams, etc.

In APAS Edgar, any code execution is abstracted and separated from them main application via outer stateless “runner” web services. Thus, there is an outer code-runner service that executes all compiled/interpreted languages, pg-runner service for SQL code execution and mongo-runner service for MongoDB map/reduce queries. Figure 3 shows the architecture for this part of the system. The generic code-runner balances the load between N stateless Judge0 instances. It is horizontally scalable, and instances can be added and removed dynamically.

The code-runners are lightweight components that route and load balance requests and do not perform any heavy lifting themselves, and therefore do not have to be replicated. Although such architecture is scalable, this does not mean that it is always resistant to excessive loads. Students can still flood the code-runner with requests. If no additional servers join the code execution cluster, the system (Judge0) will queue requests, and response latencies will increase to values that break the immediate paradigm. Figure 4 shows one such incident in the “Object-oriented programming” course during the COVID-19 pandemic. Approximately 700 students had three hours to complete their assignments in the Java programming language. Before the pandemic, students were divided into three groups, starting at 09:00, 12:00, and 15:00 h. However, since students were not in the physical laboratory classrooms, but writing from home, they were allowed to break the 09:00–12:00–15:00 schedule and take the exam an arbitrary time. The maximum three-hour rule for the exam duration still applied. In this incident, the system failed to give timely responses due to the following two reasons:

Only two servers were used, just as in the previous successful winter semester, when the same number of students took similar exams—but in the C language. Java is much more resource intensive than C. The compiler is slower, and the execution of the bytecode generated by the JVM is significantly slower than the machine code generated by the GCC compiler. On average, C code questions are executed in approximately 1 s and Java code questions in approximately 2.5 s.
The disruption of the 09:00–12:00–15:00 exam schedule allowed students to load the system unevenly, with most students opting for the last three hours from 15:00 to 18:00 (some of them probably sought experience from colleagues).

However, it is possible to protect the system from such failures at two levels:

Client-side throttling—if resources are insufficient to scale the code runner cluster, a throttling mechanism can be imposed on clients to prevent them from flooding the server with requests. After the incident, the pace concept was introduced in APAS Edgar. The pace is defined via an array of seconds that specifies how many seconds a student must wait before being able to execute the code. For instance, pace₁ = [0, 0, 5, 10, 17 × 10⁶] states that one must wait:
○
0 s before running the code for the first time,
○
0 s before running the code for the second time,
○
5 s before running the code for the third time,
○
10 s before running the code for the fourth time, and
○
16 s before running the code for the fifth time, i.e., there is no fifth time.

Obviously, one can use the pace feature to limit the maximum number of runs. Pace can be used (and is combined) at two levels:
○
Question pace—the pace array is applied to each individual question;
○
Exam pace—the pace array is applied with regard to the cumulative number of runs (of all questions).

When using two paces (question and exam), the one with the longer wait times wins. To illustrate, let us say we have an exam with two questions, where question_pace = [0, 5, 10, 15] and exam_pace = [0, 0, 30, 40]. The initial wait time is 0, i.e., the student does not have to wait to run the question. Student runs:
○
Question 1 for the first time, now we get P = max(5, 0) = 5 s;
○
Question 1 for the second time, now we get P = max(10, 30) = 30 s;
○
Question 2 for the first time, now we get P = max(0, 40) = 40 s.

Server-side pushback—in this approach, the server protects itself from the excessive load by responding with 503 Service Unavailable message. Clients can interpret this message as a pushback and introduce a cool-down period that allows the server to recover from the heavy load.

Edgar’s code-runner service implements one such self-protecting algorithm depicted in Figure 5. When the request arrives, SubmittedJobs array is checked for overflow—if the array is shorter than the MAX_ACTV_SUMISSIONS parameter (by default set to 150), the job is placed in the SubmitteJobs array with its newly assigned timer and passed along to the Judge0 cluster. Judge0 will call back the code-runner service when the job is performed using the URL given on submission. If this callback does not occur within the SUBMISSION_QUEUE_TIMEOUT seconds (by default set to 60), the job will be timed out, removed from the SubmittedJobs array and Service Unavailable 503 response will be sent. If the response arrives timely, the job will be removed from the SubmittedJobs, and a successful 200 response will be sent with the evaluation results. During the heavy load, the SumittedJobs array will start to fill, and once it reaches its configured capacity the newly arrived requests will be placed in the WaitingRoom priority queue instead, where they will “wait” for the spot to open in the SubmittedJobs array. Again, a timer is attached to each job, and if the job spends more than WAITING_ROOM_TIMEOUT seconds in the queue (by default set to 5 s), it is timed out, resulting in 503 responses. The queue is prioritized to filter the more important submissions. The most important submission, with priority = 0, is the “exam submission” which occurs when the student submits the exam for evaluation. The submission ordinal number is assigned as a priority for all other test submissions issued by students. Each student may submit tens, sometimes hundreds of submissions. In this way, a student who has already received a lot of feedback is more likely to not receive any more feedback in the event of a high load than a student who has used (or more likely abused) the system less. When teachers test the question in design time, they are assigned the constant low priority of 1000.

Figure 6 shows an example of the use of both mechanisms in APAS Edgar. Client-side throttling occurs at one-minute intervals between runs, and if the system is overloaded, the runs respond with the message Service Unavailable. The student can still submit the exam, and the submission is given a priority of zero and placed in the SubmittedJobs array. Of course, despite all these mechanisms, the exam evaluation can still fail if the code execution cluster has very limited resources. In this case, it is essential to shop the submission in the database to recover it later.

For SQL code execution, a different technology must be used because dynamic testing via string comparison is not satisfactory. There are attempts to evaluate in such manner, such as Moodle’s CodeRunner plugin [16], but such approach has several serious drawbacks (it is not possible to allow for different row/column ordering, it is not possible to test DDL statements, tight coupling with the string output format of SQLite, etc. [23]).

Dynamic testing, in this context, is performed by running the correct query (solution provided by teacher) and student’s query and comparing the result sets in a configurable way (Figure 7). Using the prefix-suffix, it is also possible to test for DML statements such as ALTER TABLE and CREATE INDEX. More details can be found in our previous work [23].

The Edgar system features a dedicated pg-runner service which runs SQL queries against a cluster of replicated PostgreSQL databases (Figure 4). In terms of performance, SQL queries are orders of magnitude faster than compile-and-run programs. Input/output operations contribute primarily to execution speed. Programming code executed in the sandbox must be saved to disk, compiled, and saved to disk, and then executed, with both compilers and compiled programs read from disk and output saved to disk and then read. Additionally, all this is repeated for N test cases. SQL queries are received by the relational engine that is already running and executed against a database that might even be heavily cached in memory. We use the timeout of 2.5 s for SQL queries, but if they take longer than 200 ms, it usually means the query is not efficient, typically a Cartesian product. Pg-runner also truncates results to 1000 rows, since we do not usually write assignments that return relationships with such high cardinality. One exception to this practice is the GIS database used in the “Advanced databases” graduate course because GIS functions are much more resource demanding. While performance should not be a problem, SQL query evaluation faces a different problem—isolation. SQL statements that are part of a transaction are interleaved. When dealing only with SELECT statements this is not a problem, especially in RDMSs that use snapshot isolation protocols such as PostgreSQL where reads do not block reads and writes and vice versa. However, writes do block/lock writes and evaluation of such assignments could be problematic. One such incident occurred during the second year of Edgar’s usage after thousands of student assignments have already been successfully evaluated. The assignment called for a solution of two UPDATE statements which could be arbitrarily ordered. To boost performance, the teacher’s code and the student’s code were run in parallel, using two threads. This “feature” led to a specific situation that sometimes, however very rarely, both queries ended up on the same database instance leading to a deadlock. It was also very difficult to track down and reproduce this overriding error. A simple restructuring of the code to execute the two queries serially solved the problem. In theory, this effect is still possible, but only if two different students synchronize their queries on the same topic in milliseconds. Of course, a larger number of replicas further reduces the probability of this happening (we use 10 replicas).

Another possible solution to the isolation problem is for each student to have their own dedicated database copy. In PostgreSQL, database replicas can be elegantly created with just one statement from the template database. However, a mechanism for creating and deleting these personal databases would need to be established. Additionally, connection pooling could no longer be used for the pg-runner service because the connections would have to be “personal”. With over 500 students in the “Databases” undergraduate course, we decided on the more economical solution of 10 identical databases on a single server, which has proven successful over the past five years.

As a side note, it might be tempting to transfer the evaluation to the client computers. This is hard to achieve while keeping the generic approach to computer languages, but it is feasible if constrained to a single technology (e.g., the Java ecosystem). Our colleagues attempted this approach with Maven scripts that dynamically tested code. The students disassembled the scripts and decompiled jars to extract hidden tests, and even developed a tool to automate this process. Server-side evaluation and grading is the only consistent and secure approach to this problem.

3.3. Students’ Exam Application

The students’ exam application can have scalability issues when hundreds of students are simultaneously taking an exam. The first hotspot occurs when the exam is started, and the initial HTML content is sent over the network. The sheer amount of traffic can cause issues, such as exceeding the number of worker connections or open file descriptors. At the DevOps level, this is handled by standard techniques such as custom system configuration, horizontal scaling, load-balancing entry points, and caching [40]. At the software level, it is beneficial to unload the traffic for various libraries to public CDN networks, and to incrementally deliver the exam content in stages. Single Page Application (SPA) paradigm is suitable for the student’s exam application. Such a generic SPA setup is depicted in Figure 8. The initial request (solid line) commencing the exam is served through:

○: CDN—delivering various web libraries (fonts, JS libs, etc.), and
○: APAS server delivering the custom JavaScript SPA web application.

When the application is loaded on the client browser, this initial request is followed by a series of smaller requests (dashed line) fetching exam metadata and the first question. Additional questions will be individually retrieved later as the students continue the exam. Questions should be cached to minimize network traffic and increase usability. Progressive web application technologies are useful in this scenario for different tasks such as caching and offline work. Once the student goes through the entire exam and caches all questions, the student’s client can work without the server; excluding the rapid feedback on the programming questions that must take place on the server side.

In APAS Edgar, exam application is implemented exactly as described using Angular 13 single application framework. Question caching has proven beneficial in situations where hundreds of students are simultaneously writing an exam at the faculty while connected to only several wireless access points.

4. Optional Advanced APAS Modules

4.1. Static Analysis

While dynamic analysis checks program functionality and occasionally efficiency (CPU vs. memory), static analysis is used to assess program structure, complexity, and even readability and style. In static analysis, the program is not executed but parsed to construct an AST (Abstract Syntax Tree) representation which is then typically matched against the correct AST. Static analysis has a wider purpose than dynamic analysis—it can be to evaluate and grade, provide feedback, detect plagiarism, provide software metrics, etc. On the other hand, it is far less straightforward and is still a subject of active research [41,42,43]. In the context of scalability, static analysis is also much more resource demanding and slower than dynamic analysis. There are numerous tools for static analysis that focus on different aspects (e.g., security) of programs. To maintain the generic nature of APAS, one would need to integrate multiple tools to support “all” programming languages. Some tools are faster than others, namely SonarQube [44], the best-known tool in this domain, is notoriously slow to start with processing of tiny programs that can take more than 10 s. Other tools, such as PMD [45] or ESLint [46] are very fast but support fewer programming languages and assessment features.

Static Analysis in APAS Edgar

In APAS Edgar, a project is currently underway to develop an optional standalone static analysis module that integrates multiple static analysis tools and merges the results into a unified format (e.g., Microsoft’s SARIF format [47]). Figure 9 shows the working architecture of the proposed module. To support immediate feedback and e-learning scenarios, our goal is to provide the analysis in less than “a few seconds”, which is a challenge with SonarQube. Be that as it may, a client can specify the acceptable waiting time as part of the request, which enables more or less detailed analysis, supporting different usage scenarios. Additionally, our plan is to support partial results, so that the client can retrieve and the display results from, e.g., two out of three static analysis tools, and display the complete results of all three later when they are all performed. This will allow for much better user experience.

4.2. Plagiarism Detection

According to the renowned Fraud Triangle framework, developed by the criminologist D. Cressey in the 1970s, there are three conditions that lead to fraud: motivation, opportunity, and rationalization [48]. In our experience, when a student has a motivation to plagiarize the code and has the opportunity to do so without getting caught or can come up with a justification for such behavior, they are then more likely to commit such fraud. APASs certainly significantly contribute to the opportunity component in plagiarism, especially when exams are taken in non-supervised environments such as homes (which was the only option during the COVID-19 pandemic), and probably to the lesser extent to the other two (“everybody is doing it, I’d be disadvantaged to play fair”, etc.). Research shows the presence of cheating in the academic environment [49], where computer science students are no exception. APASs bring a new dimension in the methodology and technology of fraud, but also in the fraud detection. The digital trail that remains when students take exams is the basis for machine detection of fraud.

It Is hard to define what constitutes plagiarism, especially if one aspires to maintain the generic nature of the software. For instance, is it plagiarism to copy code from lectures, or lectures of some other course? Or publicly available and properly licensed code from some online repository? Certainly, code should be compared with the code of other students taking the exam, but the extent of data used for the plagiarism detection should be configurable: entire history of that exam (previous academic years), course’s materials, online source repositories, google searches, etc. The next issue is the level of detail used in the comparison, or in other words comparison algorithms. Simple assignments with short solutions have a high similarity by definition [22,23] (see Figure 10 and Figure 11 in the following section). The approach based on string similarity algorithms to compare code is not universally applicable. In general, code similarity can be evaluated at the level of simple string similarity, or with more advanced static analysis algorithms suitable for more extensive programs that parse the code and generate tree structures that are compared with suspicious pairs [24]. Such static analysis checkers would, on the other hand, produce too many false positives for short code.

As a general guideline, plagiarism detection is very resource- intensive and should be implemented as a separate service with its own database. These resource requirements will be even more pronounced if one tries to detect plagiarism in a near real-time synchronous paradigm—something we have not seen with any APAS, nor in the literature. We comment on this in more detail and propose generic plagiarism detection module architecture in the following section.

Plagiarism Detection in APAS Edgar

Figure 10 shows dendrogram produced in APAS Edgar for the 4th laboratory in the course “Introduction to programming”. The normalized Levenshtein distance is used to construct the dendrogram. The lab had 751 students divided in six groups taking different questions in two-hour groups from 8 AM to 9 PM. For the most, dissimilarity is below 0.3, i.e., programs are 70%+ similar. However, some do stand out, for instance 8616 + 9049 pair manually marked with red rectangle. Details for that pair are shown in Figure 11.

Currently, APAS Edgar provides integrated plagiarism support only for string comparison algorithms, while static analysis is not integrated into the web application: however, teachers can download source code submissions for review, formatted as input to the free tool JPlag [50]. The string similarity approach is faster, but even then, computing the similarity between 700 submissions requires approximately 0.5 million comparisons. Currently, this computation is performed by teachers on demand, which slows down the web application. Furthermore, Edgar’s web application is implemented in Node.js, which does not handle CPU-demanding tasks well, blocks the event loop, and effectively monopolizes the web application server. Another objection to the current implementation relates to usability. Teachers must manually check for plagiarism after each exam to trigger detection. In addition, each exam is scanned individually. There is no course-related data between students and exams. In this regard, we propose another solution that we are currently working on:

The plagiarism detection module should be a configurable stand-alone service that continuously scans for potential cheaters and creates an integrated “plagiarism profile” for each student enrolled in the course.
Profiles, as well as standardized reports will be available to teachers on demand, but the module will also allow for notifications to be sent on high fraud scores.
The configuration, which includes the definition of the algorithms, the scope of the corpus used for plagiarism detection (exam, course, previous years, assigned datasets, internet sources), the working hours of the module (e.g., 00–06) or CPU thresholds (e.g., 10 min of less than X% CPU).
Whether the plagiarism profile is visible to the student will also be configurable. An interesting research question is whether such profiles, which include a clear message that the code is being checked for plagiarism, help reduce plagiarism (and to what extent).

In other words, from a usability standpoint, in the proposed scheme a teacher will simply decide whether to continuously run plagiarism detection in their course, set the parameters, and review the results or be notified periodically by email.

Furthermore, because APAS Edgar uses the immediate feedback paradigm, it is able to check for plagiarism even before the student submits the exam. This innovative approach has the great advantage of detecting fraudulent students and catching them in the act. To the best of our knowledge (addressed in the Discussion section), none of the existing APAS detects fraudulent behavior during the exam. During the exam, students typically attempt to repeat their answers until they receive positive feedback from the system. Once students receive positive feedback, they usually hurry to complete the other assignments, etc., until the final submission. Fraudulent or dishonest students deviate from the typical behavior and leave behind suspicious activity patterns, for example:

Multiple consecutive runs of the correct code, with the code changing slightly—the student is not motivated to change the successful code except maybe to cover their tracks by changing variable names, etc.
Correct code, followed by incorrect code—again, similar pattern, probably the result of covering tracks.
Incorrect code, followed by the significantly different correct code—the student might have acquired the correct solution and pasted the correct code, etc.

Such near real-time analysis is very resource demanding and must be detached from the main web application to avoid compromising the writing of the exam. Every time a code is executed, it should be compared to all other code runs from other students, if possible. The first code must be compared to 0, the second to 1, the third to 2, etc. This adds up to (m*n − 1)*((m*n) − 2)/2, or O((n*m)²) where n is the number of students and m is the average number of code runs. For example, if 500 students write an exam and each student runs a code for a question an average of 10 times before turning it in, the total number of comparisons is approximately 12.5 million. If a single comparison takes, say, 50 ms, that equates to 173.6 h of processing time. Clearly, if the comparison is exhaustive, the processing time will escalate quickly, and the user will likely have to choose between thoroughness and timeliness if the comparison begins to lag behind the code runs.

Figure 12 shows the prototype of the visualization application that is currently in development. Students are on the x-axis, and time is on the y-axis. Red dots on the horizontal lines represent code runs, and gray lines represent similarities above the selected threshold.

The plagiarism detection module should be self-contained and have its own database for storage and retrieval. We propose the architecture in Figure 13, which makes this module usable by any system. The module must allow course configuration (minimal data on students, algorithms, boundaries, etc.) and optional uploading of corpus data (lecture material, code from previous years, etc.). The corpus could also be enriched by scraping assigned URLs, e.g., students in our department have maintained a public server of solved assignments.

The plagiarism detection module should support two discovery modes:

Offline or queued plagiarism detection which continuously searches for plagiarism in the configured working hours
Near real-time plagiarism detection. APAS systems should only forward the code to the module’s API, which can be achieved even if they have responded to their clients, so there is minimal load on the production APAS. The main service stores only the code to conserve resources and be available to receive new code runs. A separate job generator, operating in near real time, expands the new data received into jobs that are queued for processing. Another worker cluster retrieves the jobs, computes code similarities, and stores them back. Such a setup of stateless workers facilitates scaling because most of the work is processed by the workers. Simultaneously, a subset of the data about processed jobs is replicated asynchronously to the graph database to support graph queries for elegant detection of cheating patterns—the suspicious patterns listed above can all be expressed as Cypher queries.

5. Non-Resource-Intensive APAS Modules

The remaining modules from Figure 1 denoted with a white background do not place a significant load on the system. The teacher application has orders of magnitude fewer users than the students’ application. In APAS Edgar, it is currently implemented as a classic multi-page application, but we plan to convert it to SPA, mainly because of the large edit question form which contains several CodeMirror components that incur lag on initialization when the page is refreshed.

Exam statistics and visualization are marked as optional in Figure 1, although one could argue that they are necessary in any APAS. Teachers who want to improve their courses should reflect on all exams and periodically remove and refresh question catalog (e.g., see Figure 6 in [23]). In addition, statistics should be made available to students in a clear and informative manner. Figure 14 shows the teacher’s exam analytics overview report with selectable boxplot diagrams, and Figure 15 shows the student statistics page.

Binary storage should be implemented as a separate service to enable horizontal scaling. Initially, APAS Edgar stored binary data (e.g., when students upload images, documents, and zip files) in the file system. As the system grew, the web application was horizontally scaled and for a while we had to use NFS share so that data uploaded on one web server would be visible (downloadable) on another. Finally, the binary storage is extracted into a separate service using the open-source high-performance S3-compatible storage MinIO, thus leaving file system storage as an option [51].

In terms of logging and monitoring, we should distinguish the following:

Logging and monitoring exam writing process, which must be custom developed within the APAS. This entails logging students’ actions during the exam, network client information, etc., and a corresponding monitoring application that teachers can use during the exam to monitor and gain insight about the process.
Logging and collecting/monitoring distributed web application logs. For this part, APAS Edgar used the ELK stack, but we have since downscaled (to reduce the complexity of the system) to a custom solution that stores the logs in the MongoDB database and makes them visible in the web application menu.
Logging and monitoring operating systems (CPU, memory, disk, network) of all servers in the APAS ecosystem. For this, we use Prometheus [52] and Grafana [53] combination. Besides out-of-the-box OS metrics, we have added metrics to both pg-runner and code-runner so that we can monitor important exam parameters in real-time such as SubmittedJobs ans WaitingRoom queues.

6. Discussion and Related Work

We reviewed the existing APAS software and literature regarding the APAS modules identified in section II. The APASs reviewed and the characteristics described are listed in alphabetical order in Table 1. Columns Storage and Dynamic analysis correspond to the core modules described in Section 3 while columns Static analysis and Plag detection to the optional modules described in Section 4. Additional columns Supported PL and Response further describe the specific APAS.

The list of APASs considered is mostly based in a recent paper [24] that surveys state of the art in the automated assessment of CS assignments. Starting with 30 tools reviewed in that paper we ended up with 21 web based APASs. Seven tools were excluded as we were only interested in web-based platforms: two web services, two cloud-based services, one Moodle plugin, one tool, and one Java library. Moodle is evaluated together with its plugins Virtual Programming Lab (VPL) and CodeRunner.

Detailed information about the architecture of some of the APASs we studied, including the type of storage used, was difficult to find in publications, demo versions of the systems, or on the projects’ websites. Features for which we could not find enough data to clarify whether and how they are supported, we used a blank (meaning “No Information”). Structured data, such as those about students, teachers, courses, enrollments, or exams, is usually stored in relational databases (primarily MySQL and PostgreSQL). In contrast, unstructured data, e.g., for logging submitted student solutions or received feedback, is usually stored in a file system, either on a computer in a local network or in the cloud.

In the context of dynamic analysis, the most common and effective form of evaluating the functionality of a program, we distinguish the following methods: output comparison (OC) and the use of testing tools or unit testing (UT). A more traditional output comparison checks whether a program works according to specified requirements by repeatedly executing the program using a predefined set of inputs and comparing the resulting output with the expected output. The use of tools to test the functionality of the program, which has been widely accepted in professional software development and eventually adopted in APASs, applies to almost all programming languages and application domains. Among the most widely used is the xUnit family of language-specific testing frameworks (e.g., CUnit for C and JUnit for Java). The main drawback of the output comparison method is that even a single character difference between the expected and student output causes the solution to be evaluated as completely wrong. Compared to the output comparison (OC) method, unit testing (UT) offers higher evaluation precision, and the ability to evaluate classes, methods, and even statements. As can be seen from Table 1, the prevailing method of dynamic analysis used in today’s APASs is output comparison, although several APASs supports both methods (CodeOcean, Submitty, and Web-CAT).

Static code analysis and plagiarism detection are typically handled by external tools, such as CheckStyle [54] used in Web-CAT, Testovid, and TestMyCode or JPLAG [50], or achieved with custom scripts focused on static code analysis mainly. The scripts are implemented using different scripting languages, and the common restriction is that they must return a grade and/or feedback. Unfortunately, the columns for static analysis and plagiarism detection for many analyzed APASs were left empty due to already mentioned reasons—the absence of a publication describing the system’s features or a publicly available version of the software to test the system. Most of these tools support multiple programming languages, whereas the rest target C, Java or Phyton.

Regarding feedback, which is essential for student learning and assessment, we distinguish between two types: immediate and delayed. Delayed feedback is a consequence of clumsy legacy system architectures rather than the intended and desired behavior of APAS. It is usually implemented so that students submit their solutions, which are evaluated at regular intervals (the usual frequency is one hour), generating feedback that is subsequently available to students. An exception is the Testovid system, which provides immediate feedback for some tests and delayed feedback for others.

In conclusion, to the best of our knowledge, there is no integrated generic static analysis solution as proposed in Section 4.1., which would allow instant feedback in an immediate paradigm. Additionally, no near real-time plagiarism detection tool (especially a standalone one) exists that leverages exam logs and runtime code execution results to suggest foul play, as outlined in Section 4.2.

7. Conclusions

Automated Programming Assessment Systems are widely used in computer science and engineering education and are becoming increasingly important. However, APASs are implemented as web applications, and special care must be taken to ensure their availability and scalability, especially when using a synchronous (instant feedback) paradigm.

After six years of experience in developing and using APAS Edgar, we have identified a set of mandatory and optional APAS modules with respect to their non-functional requirements and present them in this paper. The presented modules are grouped by topic and scalability requirements. We have listed the most common pitfalls, commented on them, and proposed scalable architectures as well as architectural building blocks and patterns for these modules. Two optional advanced modules stand out in terms of complexity and resource requirements: static analysis and the plagiarism detection module. There are many implementations for these two optional advanced modules in the literature and in existing APASs. Nevertheless, we advocate implementing them in a more generic form so that they can be used by third parties. Finally, we propose generic architectures for these modules, comment on the challenges, and outline our plans to address them in our future implementation.

Author Contributions

Conceptualization, I.M. and L.B.; methodology, I.M. and L.B.; software, I.M.; validation, L.B., I.M. and M.H.; writing—original draft preparation, I.M., L.B. and M.H.; writing—review and editing, I.M., L.B. and M.H.; supervision, I.M.; project administration, I.M.; funding acquisition, I.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been supported by the European Regional Development Fund under the grant KK.01.1.1.01.0009 (DATACROSS).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Striewe, M. An architecture for modular grading and feedback generation for complex exercises. Sci. Comput. Program. 2016, 129, 35–47. [Google Scholar] [CrossRef]
Pietrikova, E.; Juhár, J.; Šťastná, J. Towards automated assessment in game-creative programming courses. In Proceedings of the 2015 13th International Conference on Emerging eLearning Technologies and Applications (ICETA), Stary Smokovec, Slovakia, 26–27 November 2016. [Google Scholar] [CrossRef]
Paiva, J.C.; Leal, J.P.; Queirós, R.A. Enki: A pedagogical services aggregator for learning programming languages. In Proceedings of the Innovation and Technology in Computer Science Education Conference, ITiCSE, Arequipa, Peru, 11–13 July 2016; pp. 332–337. [Google Scholar] [CrossRef]
Krugel, J.; Hubwieser, P.; Goedicke, M.; Striewe, M.; Talbot, M.; Olbricht, C.; Schypula, M.; Zettler, S. Automated measurement of competencies and generation of feedback in object-oriented programming courses. In Proceedings of the 2020 IEEE Global Engineering Education Conference (EDUCON), Porto, Portugal, 27–30 April 2020; pp. 329–338. [Google Scholar] [CrossRef]
Petit, J.; Roura, S.; Carmona, J.; Cortadella, J.; Duch, A.; Gimenez, O.; Mani, A.; Mas, J.; Rodriguez-Carbonella, E.; Rubio, A.; et al. Jutge.org: Characteristics and Experiences. IEEE Trans. Learn. Technol. 2018, 11, 321–333. [Google Scholar] [CrossRef]
Enstrom, E.; Kreitz, G.; Niemela, F.; Soderman, P.; Kann, V. Five years with Kattis—Using an automated assessment system in teaching. In Proceedings of the 2011 Frontiers in Education Conference (FIE), Rapid City, SD, USA, 12–15 October 2011; pp. 1–6. [Google Scholar] [CrossRef]
Liu, X.; Woo, G. Applying Code Quality Detection in Online Programming Judge. In Proceedings of the 2020 5th International Conference on Intelligent Information Technology, Hanoi, Vietnam, 19–22 February 2020; pp. 56–60. [Google Scholar] [CrossRef]
Yu, Y.; Tang, C.; Poon, C. Enhancing an automated system for assessment of student programs using the token pattern approach. In Proceedings of the 2017 IEEE 6th International Conference on Teaching, Assessment, and Learning for Engineering (TALE), Hong Kong, China, 12–14 December 2017; pp. 406–413. [Google Scholar] [CrossRef]
Peveler, M.; Tyler, J.; Breese, S.; Cutler, B.; Milanova, A. Submitty: An Open Source, Highly-Configurable Platform for Grading of Programming Assignments (Abstract Only). In Proceedings of the 48th ACM Technical Symposium on Computer Science Education, Seattle, WA, USA, 8–11 March 2017; p. 641. [Google Scholar]
Pärtel, M.; Luukkainen, M.; Vihavainen, A.; Vikberg, T. Test my code. Int. J. Technol. Enhanc. Learn. 2013, 5, 271–283. [Google Scholar] [CrossRef]
Vesin, B.; Klasnja-Milicevic, A.; Ivanovic, M. Improving testing abilities of a programming tutoring system. In Proceedings of the 2013 17th International Conference on System Theory, Control and Computing (ICSTCC), Sinaia, Romania, 11–13 October 2013; pp. 669–673. [Google Scholar] [CrossRef]
Bez, J.L.; Tonin, N.A.; Rodegheri, P.R. URI Online Judge Academic: A tool for algorithms and programming classes. In Proceedings of the 2014 9th International Conference on Computer Science & Education, Vancouver, BC, Canada, 22–24 August 2014; pp. 149–152. [Google Scholar] [CrossRef]
Hu, Y.; Ahmed, U.Z.; Mechtaev, S.; Leong, B.; Roychoudhury, A. Re-factoring based program repair applied to programming assignments. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), San Diego, CA, USA, 11–15 November 2019; pp. 388–398. [Google Scholar] [CrossRef]
Edwards, S.H.; Perez-Quinones, M.A. Web-CAT: Automatically grading programming assignments. In Proceedings of the 13th Annual Conference on Innovation and Technology in Computer Science Education, Madrid, Spain, 30 June–2 July 2008; p. 328. [Google Scholar] [CrossRef]
Robinson, P.E.; Carroll, J. An online learning platform for teaching, learning, and assessment of programming. In Proceedings of the IEEE Global Engineering Education Conference (EDUCON), Athens, Greece, 25–28 April 2017; pp. 547–556. [Google Scholar] [CrossRef]
Lobb, R.; Harlow, J. Coderunner: A tool for assessing computer programming skills. ACM Inroads 2016, 7, 47–51. [Google Scholar] [CrossRef]
Vander Zanden, B.; Berry, M.W. Improving Automatic Code Assessment. J. Comput. Sci. Coll. 2013, 29, 162–168. [Google Scholar]
Brito, M.; Goncalves, C. Codeflex: A web-based platform for competitive programming. In Proceedings of the 14th Iberian Conference on Information Systems and Technologies (CISTI), Coimbra, Portugal, 19–22 June 2019; pp. 19–22. [Google Scholar] [CrossRef]
Benetti, G.; Roveda, G.; Giuffrida, D.; Facchinetti, T. Coderiu: A cloud platform for computer programming e-learning. In Proceedings of the IEEE 17th International Conference on Industrial Informatics (INDIN), Helsinki, Finland, 22–25 July 2019; pp. 1126–1132. [Google Scholar] [CrossRef]
Staubitz, T.; Klement, H.; Teusner, R.; Renz, J.; Meinel, C. CodeOcean—A versatile platform for practical programming excercises in online environments. In Proceedings of the IEEE Global Engineering Education Conference (EDUCON), Abu Dhabi, United Arab Emirates, 10–13 April 2016; pp. 314–323. [Google Scholar] [CrossRef]
Edwards, S.H.; Murali, K.P. CodeWorkout: Short programming exercises with built-in data collection. In Proceedings of the Innovation and Technology in Computer Science Education, ITiCSE, Bologna, Italy, 3–5 July 2017; pp. 188–193, Part F1286. [Google Scholar] [CrossRef]
Duch, P.; Jaworski, T. Dante—Automated assessments tool for students’ programming assignments. In Proceedings of the 2018 11th International Conference on Human System Interaction (HIS’18), Gdansk, Poland, 4–6 July 2018; IEEE: New York, NY, USA, 2018; pp. 162–168. [Google Scholar] [CrossRef]
Mekterovic, I.; Brkic, L.; Milasinovic, B.; Baranovic, M. Building a Comprehensive Automated Programming Assessment System. IEEE Access 2020, 8, 81154–81172. [Google Scholar] [CrossRef]
Paiva, J.C.; Leal, J.P.; Figueira, Á. Automated Assessment in Computer Science Education: A State-of-the-Art Review. ACM Trans. Comput. Educ. 2022, 22, 1–40. [Google Scholar] [CrossRef]
Shivakumar, S.K. Architecting High Performing, Scalable and Available Enterprise Web Applications. In Architecting High Performing, Scalable and Available Enterprise Web Applications; Morgan Kaufmann: Burlington, MA, USA, 2014; pp. 1–265. [Google Scholar] [CrossRef]
Ullah, F.; Babar, M.A. On the scalability of Big Data Cyber Security Analytics systems. J. Netw. Comput. Appl. 2022, 198, 103294. [Google Scholar] [CrossRef]
Altalhi, A.H.; Al-Ghamdi, A.A.-M.; Ullah, Z.; Saleem, F. Developing a framework and algorithm for scalability to evaluate the performance and throughput of CRM systems. Intell. Autom. Soft Comput. 2016, 23, 149–152. [Google Scholar] [CrossRef]
Zhu, J.; Patros, P.; Kent, K.B.; Dawson, M. Node.js scalability investigation in the cloud. In Proceedings of the 28th Annual International Conference on Computer Science and Software Engineering (CASCON 2018), Markham, ON, Canada, 29–31 October 2018; pp. 201–212. Available online: https://researchcommons.waikato.ac.nz/handle/10289/12862 (accessed on 2 December 2022).
Xie, J.; Yu, F.R.; Huang, T.; Xie, R.; Liu, J.; Liu, Y. A Survey on the Scalability of Blockchain Systems. IEEE Netw. 2019, 33, 166–173. [Google Scholar] [CrossRef]
Edgar-Group—GitLab. Available online: https://gitlab.com/edgar-group/ (accessed on 2 December 2022).
Combéfis, S. Automated Code Assessment for Education: Review, Classification and Perspectives on Techniques and Tools. Software 2022, 1, 3–30. [Google Scholar] [CrossRef]
Souza, D.M.; Felizardo, K.R.; Barbosa, E.F. A systematic literature review of assessment tools for programming assignments. In Proceedings of the 2016 IEEE 29th International Conference on Software Engineering Education and Training (CSEET), Dallas, TX, USA, 5–6 April 2016; pp. 147–156. [Google Scholar] [CrossRef]
Keuning, H.; Jeuring, J.; Heeren, B. A Systematic Literature Review of Automated Feedback Generation for Programming Exercises. ACM Trans. Comput. Educ. 2018, 19, 1–43. [Google Scholar] [CrossRef]
Croft, D.; England, M. Computing with CodeRunner at Coventry University Automated summative assessment of Python and C++ code. In Proceedings of the 4th Conference on Computing Education Practice (CEP’20), Durham, UK, 9 January 2020. [Google Scholar] [CrossRef]
Gilbert, S.; Lynch, N. Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. ACM Sigact News 2002, 33, 51–59. [Google Scholar] [CrossRef]
Sadalage, P.J.; Fowler, M. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence; Pearson Education: London, UK, 2012. [Google Scholar]
Mekterović, I.B.L. Edgar’s Database Schema Dump v.1.3.7. 2017. Available online: https://gitlab.com/edgar-group/edgar/-/blob/master/db/db-schema/dumps/edgar-schema-dump-v1-3-7.sql (accessed on 9 February 2023).
Dosilovic, H.Z.; Mekterovic, I. Robust and scalable online code execution system. In Proceedings of the 2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO), Opatija, Croatia, 28 September–2 October 2020; pp. 1627–1632. [Google Scholar] [CrossRef]
Mareš, M.; Blackham, B. A New Contest Sandbox. Olymp. Inform. 2012, 6, 113–119. Available online: https://ioi.te.lv/oi/pdf/INFOL094.pdf (accessed on 23 November 2022).
Varnish HTTP Cache—Varnish HTTP Cache. Available online: https://varnish-cache.org/ (accessed on 29 November 2022).
Baldoni, R.; Coppa, E.; D’Elia, D.C.; Demetrescu, C.; Finocchi, I. A survey of symbolic execution techniques. ACM Comput. Surv. 2018, 51, 1–39. [Google Scholar] [CrossRef]
Do, L.N.Q.; Wright, J.R.; Ali, K. Why Do Software Developers Use Static Analysis Tools ? A User-Centered Study of Developer Needs and Motivations. IEEE Trans. Softw. Eng. 2020, 48, 835–847. [Google Scholar] [CrossRef]
Rahman, A.; Rahman, R.; Parnin, C.; Williams, L. Security Smells in Ansible and Chef Scripts. ACM Trans. Softw. Eng. Methodol. 2021, 30, 1–31. [Google Scholar] [CrossRef]
Code Quality and Code Security|SonarQube. Available online: https://www.sonarqube.org/ (accessed on 6 December 2022).
PMD. Available online: https://pmd.github.io/ (accessed on 6 December 2022).
Find and Fix Problems in Your JavaScript Code—ESLint—Pluggable JavaScript Linter. Available online: https://eslint.org/ (accessed on 6 December 2022).
SARIF Home. Available online: https://sarifweb.azurewebsites.net/ (accessed on 19 January 2023).
Sutherland, E.H.; Cressey, D.R.; Luckenbill, D.F. Principles of Criminology. 1992. Available online: https://books.google.com/books?hl=hr&lr=&id=JVB3AAAAQBAJ&oi=fnd&pg=PP1&ots=mOPdm7zPB9&sig=64VSdr3R8j_ksaO3yykzJI4cIkw (accessed on 6 December 2022).
Albluwi, I. Plagiarism in Programming Assessments. ACM Trans. Comput. Educ. 2019, 20, 1–28. [Google Scholar] [CrossRef]
Prechelt, L.; Malpohl, G.; Philippsen, M. Finding Plagiarisms among a Set of Programs with JPlag. Available online: http://www.jplag.de (accessed on 30 November 2022).
MinIO|High Performance, Kubernetes Native Object Storage. Available online: https://min.io/ (accessed on 1 December 2022).
Prometheus—Monitoring System & Time Series Database. Available online: https://prometheus.io/ (accessed on 1 December 2022).
Grafana: The Open Observability Platform|Grafana Labs. Available online: https://grafana.com/ (accessed on 1 December 2022).
Checkstyle—Checkstyle 10.5.0. Available online: https://checkstyle.sourceforge.io/ (accessed on 8 December 2022).

Figure 1. Typical APAS modules. Solid lines outline core modules and dashed lines group optional modules. Shaded modules represent scalability challenges.

Figure 2. Code execution times for programs written in C programming language on midterm exam in the course “Introduction to programming”, executed across three servers. The total median time is 0.95 s.

Figure 3. APAS Edgar’s code runner architecture: outsourced code runners abstract the programming language/paradigm and balance the load.

Figure 4. Java code execution times during the “Object-oriented programming” exam scalability incident in March 2020. Submission queued, and evaluation times escalated and timed out.

Figure 5. APAS Edgar’s server-side pushback: code-runner internals showing WaitingRoom priority queue used when there is more than MAX_ACTV_SUBMISSIONS active submissions. The diagram shows successful (200) and Service Unavailable (503) response outcomes.

Figure 6. Exam in APAS Edgar with both client-side throttling and server-side pushback. What was previously Run button turned to countdown (59 s) and is disabled. Additionally, the last run resulted in a “Not executed” message with the hint “Pushback (service unavailable)”.

Figure 7. The list of SQL query comparison options in APAS Edgar.

Figure 8. The diagram shows the student exam SPA. The initial request is served through both CDN and APAS Server. Subsequent requests (dashed) are delayed and served only by the APAS server. Question one is cached.

Figure 9. Proposed future high-level static analyzer service architecture. From the scalability perspective, it is beneficial that these tools are stateless and can thus be horizontally scaled. The entire module is on the dedicated server(s) to avoid a negative performance impact on the main application.

Figure 10. Code similarity dendrogram produced in APAS Edgar for the fourth laboratory in the course “Introduction to programming” during the academic year 2022/2023. The normalized Levenshtein distance is used as a similarity measure. There were six different questions. Note that most dissimilarities are below 0.3, i.e., most submissions are 70%+ similar.

Figure 11. Various code similarity measures for the 8616 + 9049 pair from Figure 10. The middle section outlines the differences. The whitespace is ignored when the distance is calculated.

Figure 12. Near real-time plagiarism detection visualization client in development. Students (IDs) are on the ordinate axis, and code run time is on the abscissa. Dots represent code runs (red—incorrect; green—correct). Lines represent similarities above the selected threshold. Similarities can be selected and drilled down to show only a subset of connected dots.

Figure 13. The diagram of the proposed plagiarism detection module architecture. A document database is used as the primary data store, and a subset of data is reflected in the graph database to support graph queries for fraud patterns. Separate worker clusters are needed for near real-time plagiarism detection and offline (queued) plagiarism detection.

Figure 14. Teacher’s exam analytics report for the Databases course. The final exam is chosen by clicking on the boxplot. The lower part shows the details of the chosen exam.

Figure 15. Student’s statistics dashboard in APAS Edgar. The header shows overall measures, the middle part shows statistics for the selected exam (note the dropdown), the left chart shows the population, and the right shows the score per question compared with the population. The bottom part shows the overall score distribution of all students in the course to the left and the score per exam compared with the population on the right.

Table 1. APASs reviewed in context of core and optional characteristics.

Tool	Refs	Storage	Dynamic Analysis	Static Analysis	Plag Detection	Supported PL	Feedback
Arena	[2]		UT			C, Java	Delayed
CloudCoder	[13]	MySQL	OC			C/C++, Java, Python, Ruby
CodeAssessor	[17]	MySQL	OC			C/C++, Java, Python, Ruby	Instant
Codeflex	[18]	MySQL	OC	✓		C++, C#, Java, Python,	Instant
Coderiu	[19]	File Storage Service(FSS)	UT			Multi	Delayed
CodeOcean	[20]	PostgreSQL	OC and UT			Multi	Instant
CodeWorkout	[21]	MySQL	UT			C++, Java, Python, Ruby	Instant
Dante	[22]	Relational	OC		✓	C	Delayed
Edgar	[23]	PostgreSQL, MongoDB	OC		✓	Multi	Instant
Enki (Mooshak)	[3]		OC			Multi	Instant
JACK	[1,4]	Relational	OC	✓	✓	Multi	Instant and Delayed
Jutge.org	[5]		OC	✓		Multi	Instant
Kattis	[6]	PostgreSQL, File system	OC		✓	Multi	Instant
Moodle+VPL+ CodeRunner	[15,16]	Any relational	OC		✓ in VPL	Multi	Instant
neoESPA	[7]	File system	OC		✓	C/C++, Java, Python	Instant
PASS	[8]		OC	✓	✓	C/C++, Java	Instant
Submitty	[9]	PostgreSQL	OC and UT		✓	Multi	Instant
TestMyCode	[10]	Relational	UT	✓		Multi	Instant
Testovid within Protus	[11]		UT	✓		Multi	Instant and delayed
URI Online Judge	[12]	Relational	OC		✓	Multi
Web-CAT	[14]	Relational	OC and UT	✓	✓	Multi

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mekterović, I.; Brkić, L.; Horvat, M. Scaling Automated Programming Assessment Systems. Electronics 2023, 12, 942. https://doi.org/10.3390/electronics12040942

AMA Style

Mekterović I, Brkić L, Horvat M. Scaling Automated Programming Assessment Systems. Electronics. 2023; 12(4):942. https://doi.org/10.3390/electronics12040942

Chicago/Turabian Style

Mekterović, Igor, Ljiljana Brkić, and Marko Horvat. 2023. "Scaling Automated Programming Assessment Systems" Electronics 12, no. 4: 942. https://doi.org/10.3390/electronics12040942

APA Style

Mekterović, I., Brkić, L., & Horvat, M. (2023). Scaling Automated Programming Assessment Systems. Electronics, 12(4), 942. https://doi.org/10.3390/electronics12040942

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Scaling Automated Programming Assessment Systems

Abstract

1. Introduction

2. APAS Modules

3. Core APAS Modules

3.1. (Polyglot) Structured Data Storage and Management

Structured Data Storage in APAS Edgar

3.2. Dynamic Analysis—Untrusted Code Execution

3.2.1. Sandboxing Untrusted Code

3.2.2. Performance Considerations

3.2.3. Dynamic Analysis in APAS Edgar

3.3. Students’ Exam Application

4. Optional Advanced APAS Modules

4.1. Static Analysis

Static Analysis in APAS Edgar

4.2. Plagiarism Detection

Plagiarism Detection in APAS Edgar

5. Non-Resource-Intensive APAS Modules

6. Discussion and Related Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI