PyFlies: A Domain-Speciﬁc Language for Designing Experiments in Psychology

) Abstract: The majority of studies in psychology are nowadays performed using computers. In the past, access to good quality software was limited, but in the last two decades things have changed and today we have an array of good and easily accessible open-source software to choose from. However, experiment builders are either GUI-centric or based on general-purpose programming languages which require programming skills. In this paper, we investigate an approach based on domain-speciﬁc languages which enables a text-based experiment development using domain-speciﬁc concepts, enabling practitioners with limited or no programming skills to develop psychology tests. To investigate our approach, we created PyFlies, a domain-speciﬁc language for designing experiments in psychology, which we present in this paper. The language is tailored for the domain of psychological studies. The aim is to capture the essence of the experiment design in a concise and highly readable textual form. The editor for the language is built as an extension for Visual Studio Code, one of the most popular programming editors today. From the experiment description, various targets can be automatically produced. In this version, we provide a code generator for the PsychoPy library while generators for other target platforms are planned. We discuss the language, its concepts, syntax, some current limitations, and development directions. We investigate the language using a case study of the implementation of the Eriksen ﬂanker task.


Introduction
It is hard to imagine work in a modern psychology laboratory without the use of personal computers. These ubiquitous devices are becoming more and more capable each year, being able to present experimental stimuli as well as to record a participant's response with great accuracy. In the past, the development of psychological experiments required a great effort and a lot of low-level coding. Fortunately, things have changed in the last two decades and nowadays we have an array of freely available open-source software tools and libraries that make the process of developing psychological tests much easier.
Based on the experiment specification approach, we can divide the existing software into two categories: (1) GUI-based and (2) text/code-based software.
GUI-based experiment builders use a click/drag&drop interface to construct an experiment (e.g., OpenSesame [1], E-Prime [2], PsychoPy Builder [3], Lab.js [4], Gorilla [5]). This style is a good fit for practitioners without a programming background. Those builders are easy to learn by navigating the interface using a mouse and trying options, i.e., they are suitable for trial and error learning. They often offer a good overview of the experiment structure, although for the details the users usually have to drill down into the dialogs or other GUI elements.
Text-based builders use textual languages as a way to specify the experiment (e.g., PsychoPy [3], Expyriment [6], jsPsych [7], PEBL [8]). While they are initially not that easy to start compared with GUI builders, they offer some benefits. First, we can leverage all the development in text editing in recent decades (syntax highlighting, code completion, tooltips, quick fixes), which usually can be configured in contemporary editors without much effort. Second, standard text editing idioms can be used such as copy-paste, code commenting, etc. Third, the representation is the same as the storage format, which enables the usage of any text editor to read and edit the experiment. Fourth, plain text is perceived as a future-proof format. If the text-based builder ceases to be developed, we can still open and read experiments in the future using a plain text editor. Additionally, last but not least is a collaboration which is the key in today's development. Text-based syntaxes facilitate the usage of standard version control tools such as git [9] and mercurial [10], and services such as GitHub and GitLab to enable distributed collaboration and tracking of the history of changes. Indeed, distributed version control systems have started a revolution in the global collaboration in software development. The GitHub Octoverse report from 2020 (https://octoverse.github.com/, accessed on 1 July 2021) shows that GitHub has over 56 million developers who added over 1.9 billion contributions in the year 2020. The report shows that the numbers are increasing rapidly year after year. It is hard to achieve this level of collaboration with GUI-based experiment builders as their storage formats differ from the display format. While the user interacts with the tool using some form of graphical syntax, the experiment is stored in a format that the user is not familiar with (e.g., XML, JSON, or binary), which requires the development of custom tools for version control. This hampers the merging of concurrent changes and the investigation of the history of changes.
Textual languages can be divided into two categories: (1) General-Purpose Programming languages (GPLs) and (2) Domain-Specific Languages (DSLs). GPLs are programming languages that can be used to build general computer software (e.g., Python, JavaScript, C). The benefits of experiment builders based on GPLs are: (a) they have access to the full host language that offers unprecedented power and flexibility; (b) a user may use any additional library available for the language; (c) existing tools, debuggers, editors, etc., for the host language can be used in the context of experiment builder library as well. However, their usage requires programming skills. Additionally, the essence of the experiment design is not apparent from the implementation code. Conversely, DSLs are expressive languages tailored for a specific domain [11]. They use concepts from the target domain and can only be useful in that domain, i.e., they trade generality for expressiveness in a limited domain [12]. It is a common belief that DSL programs are easier to understand and maintain than GPL programs and that using DSLs is more efficient and effective. This has been empirically validated in several recent studies [13][14][15].
One of the recurring ideas in the history of DSLs is the involvement of non-programmers, or lay programmers [16]-domain experts who are not professional programmers but program in DSLs-directly in the development of a solution. This idea is also known as end-user development [17]. Although this level of involvement of domain experts is not always possible, it is important for the DSL to provide a syntax that can be read and comprehended by domain experts even if they do not type the program directly. A DSL that can be read by domain experts can serve not only as a design and implementation tool but also as a requirements elicitation medium.
DSLs have been successfully used in various domains. However, we find that in the domain of psychological test specification, while there exist numerous GUI-based builders, text-based builders are mostly based on general-purpose languages.
We believe that DSLs, if properly designed, have the potential to bridge the gap between GUI builders and GPL-based builders, enabling non-programmers to specify experiments while still using the benefits of text-based distributed collaboration and version control.
In the rest of the paper, we present PyFlies, an open-source DSL for implementing psychological experiments. The language design principles are oriented towards readability and ease of use by employing abstraction and first-class domain concepts. We have used PyFlies so far in educational settings. It is used to teach design of psychology experiments to students of medicine at the University of Pristina-Kosovska Mitrovica and students of psychology at the University of Novi Sad. It is also used to teach the design and implementation of domain-specific languages to students of computer science and software engineering at the University of Novi Sad. PyFlies is available as open-source software under the terms of the GPL 3.0 license. It is hosted at GitHub (https://github.com/pyflies, accessed on 1 July 2021).
The paper is organized as follows: Section 2 gives some theoretical background of the field of DSL. The PyFlies architecture and design principles are described in Section 3. In Section 4, PyFlies language abstract and concrete syntaxes are described. Code generators are described in Section 5. The results given in the form of a case study on the development of a real psychological experiment are given in Section 6. The discussion is given in Section 7. Section 8 presents the related work from the fields of DSL and psychology test builders. Section 9 concludes the paper.

Theoretical Background
This research is an application of the domain-specific language approach to the development of tests in psychology. Thus, in this section, we give a theoretical background of the DSL approach to lay the groundwork for our work.
Domain-specific languages (DSLs) are small, usually declarative languages that allow developers to write code at the appropriate level of abstraction using concepts from the domain of their expertise [18]. They provide an effective interface for the domain experts to specify solutions. DSLs can be regarded as a specialization of a more general approach called Model-Driven Engineering (MDE) [19,20]. DSLs put emphasis on domain specificity. DSLs are constrained to the given domain and can be used only to specify solutions in the given domain. This restriction makes DSLs more expressive and concise [12]. Furthermore, the DSL approach fosters the building of a community of domain experts who speak the same language [21].
In contrast, general-purpose languages can be used to create arbitrary software solutions. Thus, they use concepts of programming paradigms they conform to such as functions, classes, objects, parameters, methods, threads, calls, loops, conditions, etc. During the design and implementation of the solution, the concepts of the domain must be mapped to computing concepts of the GPL. This mapping is a manual process and the source of difficulties during the development and maintenance of the software [22]. DSLs make the mapping problem non-existent as they offer one-to-one mapping by directly operating in terms of domain concepts.
The artifact that conforms to the given DSL is called program or model. While some researchers see a difference between the two, we take the stance of the other group that treats both programs and models the same. To emphasize this equality, Kleppe suggested the term "mogram" in [23]. However, the suggested term has not seen a wider adoption so far. Thus, in this paper, we use the terms "program", "model", and "specification" as synonyms.
In the past, DSLs have been known as "little languages" [24], although depending on the domain, these languages do not have to be little in terms of the number of language constructs. However, as the focus of the language expands there is a danger that the language will lose the conciseness and expressiveness of DSL. Indeed, Fortran and Cobol started as DSLs in the area of scientific computing and business computing, respectively. However, they gradually evolved more generality over time [18,25]. We should point out that the domain specificity is not an absolute but a relative measure. Domain specificity is a matter of degree [12]. Additionally, when discussing whether a language is or is not a DSL, we must know what domain we are talking about. For example, JavaScript is regarded as a GPL, but if our domain is "web applications development", then the language could be considered as a DSL. Of course, this is just a conceived example of a too broad domain to be usable. In the context of this paper, we assume the domain of creating experiments in psychology.
Solutions based on DSLs are easier to maintain [26], validate and verify [27]. In general, whereas in GPLs AVOPT (Domain-specific Analysis, Verification, Optimization, Parallelization, and Transformation) techniques [12] are hard to achieve, in DSLs they are straightforward. Despite their proven benefits, DSLs have been seldom constructed in the past by smaller communities of developers. The reason often cited in the literature [16] is the cost of building and maintaining the language and the supporting tool-chain (parsers, editors, validators, generators, etc.). However, in the last two decades, a new breed of powerful language engineering tools has emerged which reduced this cost significantly and led to their uptake in DSLs applications. These tools have been developed as integrated environments for language engineering, fostering the fast development of languages and supporting services (editors, debuggers, visualizers, etc.). These tools are usually referred to as language workbenches, a term coined by Martin Fowler [28]. Well-known text-based representatives of this kind are xText [29], Spoofax [30], and Rascal [31]. In addition to the approach based on text parsing, there is an approach based on so-called "projectional editing" where the user interacts directly with an abstract representation of the model through the "projection surface" [32]. The advantage of this approach is that arbitrary types of concrete syntax can be supported (e.g., graphical, textual, tabular). Different concrete syntaxes can even be mixed in the same representation. Moreover, the model validation happens during editing so the interaction with the user is richer. The downside is that the tools that support this approach are more complex, and similarly to all GUI-based builders, the storage format differs from the presentation form so standard text-based version control tools and plain-text editors cannot be used. Probably the best freely available tool in this category is Meta-Programming System (MPS, https://www.jetbrains.com/opensource/mps/, accessed on 1 July 2021) [32].
The work presented in this paper builds on top of our previous research in the development of tools for building DSLs in the Python programming language [33,34]. These tools provide a fast turnaround and easy modification and evolution of the language's abstract and concrete syntax.
Each DSL consists of three ingredients: abstract syntax, one or more concrete syntaxes, and semantics [32]. The abstract syntax defines the language structure, its concepts, and their relationships (Section 4). The concrete syntax specifies how the language appears to the user and how the user can interact with the models. We can think of concrete syntaxes as interfaces to the users. Concrete syntaxes come in different forms, e.g., textual, graphical, table-based, dialog-based ones. In this paper, we use the textual syntax (Section 4). The semantics of the language defines the meaning of programs written in the given DSL. There exist different approaches to defining semantics [35]. In this paper, we use a pragmatic approach where we map our programs to the target platform whose semantics is already defined. This is achieved using source code generators (Section 5).

PyFlies Architecture
This section describes the architecture of PyFlies. To better understand the reasoning behind architectural decisions, in the first subsection, we present the design principles which were followed during the development of the language and the tool.

PyFlies Design Principles
The design of the PyFlies language was guided by the following principles:

1.
Readability: Researchers share experiments with their peers. Furthermore, it is a common belief that developers spend more time reading the code than writing it.
That means the experiment descriptions should be as easy to read and understand as possible. One of the principles of Python Zen is "readability counts", which is arguably one of the design principles that had the most impact on its adoption.

2.
Abstraction: PyFlies tries to hide as much of the implementation details as possible. Thus, in PyFlies experiment descriptions, most often, we use abstract terms. For example, instead of using explicit screen coordinates, we often use terms left, right, up, down, and let the compiler complete the mapping. These mappings, if needed, can be overridden. In different contexts, these abstract terms may mean different things. For example, a symbol left might be a position on the screen, a direction of the pointing arrow image stimulus, or a left arrow key on the keyboard. The code generator is capable of carrying out this context-aware mapping while the experimenter may still use the same term in all contexts. Another abstraction used in PyFlies is abstracting how exactly to carry out various steps in the experiment. The experimenter is more focused on providing information on what needs to be achieved than how to do it.

3.
Domain concepts: PyFlies language constructs are domain-specific. For example, instead of having concepts of general-purpose programming languages such as functions, classes, methods, etc., in PyFlies we use tests, screens, components, timings, condition tables, etc. In our experience, this reduces mental effort in designing and understanding the experiment.

4.
Modular architecture and extensibility: PyFlies follows a modular approach where target platform code generators can be developed as separate Python projects (Section 3). Moreover, components used in the experiment description are specified using a component description DSL, which will make it possible to open this language to experimenters and enable custom component specification in the future (Section 4.4).

5.
Best practice: The code generator should be written by developers that are knowledgeable about both PyFlies and the target platform. Thus, the current best practices (at least to the best of generator authors' knowledge) for the target platform are automatically provided for all PyFlies users. Since generators can be improved constantly, all PyFlies experiments undergo improvements for free. 6.
Focus on the experiment structure: PyFlies specifications are strongly focused on the experiment structure. Each experiment consists of screens and tests which are executed during the flow of the experiment. That is something often found in GUI builders but lacking in textual languages. Experiment [6], implemented as a Python library, is a notable text-based solution that employs this idea. Figure 1 shows the architecture of PyFlies and its modular design. PyFlies follows the standard compiler architecture that consists of:

Modular Architecture
• front-end (colored green in Figure 1)-parses input files, performs syntax and semantic checks and constructs the intermediate representation; back-end (colored red in Figure 1)-consists of a set of code generators for different target platforms.

Front-End
The core PyFlies project implements language front-end and the infrastructure for discovering and running generators. We use textX [33], a Python tool for the development of domain-specific languages based on Parsing-Expression Grammars formalism [36] and built on top of the Arpeggio parser [34]. TextX parses the input PyFlies files, performs syntax checks based on the provided grammar, and produces intermediate in-memory graph representation, which is further checked for semantics conformance. This intermediate representation is further transformed to the target run-time platform by the selected back-end (code generator).

Back-End
PyFlies modular design enables the development of generators as separate projects. By installing Python generator projects, using the standard Python pip package manager, we make them available to the PyFlies core. We directly utilize textX's generators and languages registration and discovery support (http://textx.github.io/textX/latest/ registration/, accessed on 1 July 2021). This enables listing and using generators from the command line interface (CLI) by the textx command, but also discovering and using generators programmatically via textX registration API.
The PyFlies core provides two generators out-of-the-box: CSV (Comma Separated Values) and Log. These two generators are generic enough to be supported by the core itself. CSV generator produces a .csv file with the data from the expanded test table (Section 4.3), whereas the Log generator produces log files with detailed information about the flow of the experiment. Although the PyFlies core project implements these two generators, they are still registered, discovered, and run the same way as all other externally implemented generators. We discuss generators in more detail in Section 5.

Language Abstract and Concrete Syntaxes
In this section, we give a brief overview of the core language constructs. A detailed description of the language is available in the PyFlies documentation (https://pyflies. github.io/pyflies/, accessed on 1 July 2021).

Language Abstract Syntax
A simplified abstract syntax of the language is given in Figure 2. An abstract syntax of a language is also known as the meta-model in the modeling communities. The full meta-model of the PyFlies language is available in the project GitHub repository, https: //github.com/pyflies/pyflies/tree/main/docs/metamodel, accessed on 1 July 2021. The diagram shows that each PyFliesModel contains a Flow definition, one or more routines represented by the RoutineType concept, and zero or more Target definitions. Routine can either be a test, represented by the TestType concept, or a screen, represented by the ScreenType concept. Each test has a ConditionTable and zero or more instances of Component.
A more detailed description of each relevant concept together with its concrete syntax is given in the following subsections.

Tests
Test is the core concept of the PyFlies language. Each test consists of a condition table and trials' components specification.
The definition of Test concrete syntax in textX grammar language is given in Listing 1. TestType grammar rule specifies test definition that starts with keyword test followed by attribute name matched by the textX built-in rule ID. It is further followed by the test body that starts with the literal string match "{" (line 2). The body of the test starts with condition table specification which is matched by the ConditionsTable grammar rule (Section 4.3). After the condition table, we have optional variable assignments matched by the rule VariableAssignment. At the end, we have component conditions specifications represented by textX attribute component_cond and matched by the rule ComponentsCondition. The closing curly brace ends the test body.  correct -> sound ( freq 500) for 300 15 error -> sound ( freq 1000) for 300 16 } The second part of the test definition specifies trial phases with component specifications.
PyFlies divides each trial into three phases: fix, exec, and error/correct. The fix phase is a fixture phase during which a fixture stimulus is shown to the subject. After the fix phase, we have the exec phase that is the main phase of the trial and the only mandatory phase (all others are optional). Depending on the subject's response, the third phase can be either correct, if the response was correct, or error, if the response was incorrect. We provide the correct response in the definition of the input components (e.g., correct property of the keyboard component is set to a symbol or a list of symbols describing the correct response).
We provide each component specification in the following form:

< boolean expression > -> < component spec ificat ions with timings >
The Boolean expression on the left side, when evaluated as true, enables the execution of components specified on the right side. There can be more components defined for each Boolean expression.
During the execution of each phase Boolean variables fix, exec, error, and correct will have a corresponding value. For example, exec will be true during the exec phase and false in all other phases. That enables creating Boolean expressions on the left side of each trial definition expression that matches a particular phase. For example:

fix -> cross () for 700
Will match the fixation phase and show a cross component for the duration of 700 ms in this phase.
The optional duration is specified after the keyword for. It is given as a PyFlies expression that evaluates to an integer interpreted as a duration in milliseconds. If the duration is not given the component runs until the end of the trial phase.
Most of the time a Boolean expression on the left side is just a phase variable but can be an arbitrary Boolean expression. For example: fix and parity == odd -> circle ( color green ) for 700 Will show a green circle in the fix phase if parity is odd.

Conditions Tables
Condition tables are used to describe trial variable values in a compact form. Each column of a condition table specifies a variable available in each test trial.
The heading row of the table defines trial variables, whereas the remaining rows specify variables' values and thus define the state of each trial. We could write a row for each trial with the exact values, thus defining the table in expanded form. Instead, we usually specify values using expressions that are evaluated during compilation yielding the fully expanded condition table. That brings more flexibility in extending the table, both column-wise and row-wise.
Listing 3 gives the definition of a condition table concrete syntax in textX. Condition-sTable uses a noskipws rule modifier as we want to control whitespaces ourselves in this rule. The condition table can be specified in either Org Mode [38] or markdown table format (https://www.w3schools.io/file/markdown-table/, accessed on 1 July 2021). At the beginning of the ConditionsTable rule (line 1), we have a match that consumes all whitespaces and comments (line 2) until the first pipe char (|) that starts the table. Line 3 defines the syntax of the table header with variable names. It starts with the pipe char, after which we have one or more (+=) variable name matched by the rule TableVarName. This variable assignment uses separator '|' textX repetition modifier during the matching (part ['|'] after TableVarName). Rules WSWithNL at line 8 and WSNoNL at line 9 match whitespaces with newline and whitespaces without a newline, respectively. The expression at line 4 consumes the header separator. The header separator is not preserved (i.e., no textX assignment is used). After the header, one or more condition specification is consumed (line 5) matched by the rule Condition (not given in the listing for brevity). Listing 4 shows two tables with three variables: color, direction and congruent. The first table is an expression-based form of a condition table. We use loop expressions in the first two columns and a Boolean expression in the third column. loop expressions will produce a row for each value from the collection it is called upon (in this case, of the list type). loop expressions are evaluated from left to right.
The second table is an expanded form of the first table. We see that we have each color from the list of colors iterated in the first column. For each color in the first column, a list of directions from the second column is iterated. And lastly, the third column's Boolean expression is evaluated in the context of each row.
Both styles of condition table are valid. We can use either, but the expression based form is usually shorter and more flexible.
PyFlies condition tables use markdown table syntax, which enables convenient editing and good readability while still being pure text. However, for the best user experience, an editor with markdown tables support should be used. PyFlies extension for Visual Studio Code (VS Code) editor offers this feature.
Besides, test definitions tables are also used in repeat with statements in the flow block (Section 4.6) where they represent variables' values for each repetition of the loop.

Components
The interaction with the experiment participants during the test is carried out through components. There are currently 11 components defined by the PyFlies. Additional components are easily specified.
We describe PyFlies components using a built-in DSL for components description. The description consists of a component name, documentation, and a set of properties with names, types, default values, and their documentation. A component can inherit other components.
A textX grammar for components definition is given in Listing 5. The optional abstract keyword (line 2) specifies components that can only be inherited but not referenced in tests. A component can inherit multiple other components. Inherited components are specified after the keyword extends (line 3). The component header is followed by its description after which the body is defined. The body of the component consists of zero or more parameters defined by the ParamType rule (line 6). The ParamType rule defines component parameters (line 10). It starts with the keyword param followed by the name attribute that is matched by the ID rule. The parameter type is specified after the colon and it can be a simple type or a list of types inside square brackets (line 12). Multiple types are used in a situation when the parameter may be one of several types (e.g., a symbol or a point, see Listing 6). The default parameter value is given after the char =. At the end of the definition, an optional description may be provided.  (20,20). This definition provides information for semantic checks of components' usage in test specifications. Override ' size ' to be of point type representing width and height . 0 for height means ' keep aspect ratio '. """ } We use the component description language internally in the core PyFlies project, but we plan to open this language to end-users, which will enable them to specify custom components. However, specifying and using new components will require users to extend the target generator they use as well.
Note that component descriptions do not specify their full semantics, e.g., they do not specify what components do and how they operate during the experiment run. That is entirely left to the code generator to decide. For example, there is no fundamental difference in the definition of visual components, such as circle or rectangle, and input components, such as mouse and keyboard. They are all defined just by their names, inherited components, and a set of properties. A consequence of this design decision is that we can introduce a new class of components in the future, either by the core project or the language users, without the need to change the core.

Screens
During the experiment, we often need to show messages to the subject, e.g., by describing what they are supposed to do or providing feedback about their performance. For this purpose, PyFlies provides screen language construct (see Listing 7 for an example). The screen definition is a simple construct that starts with the screen keyword, followed by the screen name and a block of text within curly braces. The text in the body of the screen statement is often plain text but may have dynamic parts. This dynamic expansion is implemented by passing the body text through the Jinja template engine [39], which is one of the most popular Python template engines available. In general, a template engine enables template-based code generation by mixing static text, which is left unchanged, and the text dynamically created by template language expressions [40]. In the example screen given in Listing 7, a dynamic part of the text is {{real_or_practice}} which renders the real_or_practice screen parameter, enabling the same screen to be used both for practice and real trials block. You will be presented with a digit . Press the LEFT key for odd and the RIGHT key for even digit .
Press ENTER key for the start . }

Flow
The flow part of the experiment specification defines the sequence of test executions and screen displays. The flow construct starts with the keyword flow followed by the body of the statement enclosed in curly braces (Listing 17).
The flow statement body contains a sequence of statements for executing tests, showing screens, or repeating a test or a statements block.
To show a screen you use show statement of the form: show Intro for 10000 The previous statement displays the Intro screen for 10 seconds. The for part is optional, and if not provided, the screen will be displayed until a keypress.
To execute a test, you use the execute statement of the form: execute Parity ( practice true , random true , some_param 42) Note the parameters given in the parentheses. Parameters can be specified in both execute and show statements and, if provided, can be referenced inside tests and screens. Parameters can be named arbitrarily but practice and random names have special treatment -the former is used to denote if data should be collected, while the latter is used to designate if the trials should be randomized.
To repeat a test or a block of statements multiple times, we use the repeat statement. This statement can apply to a test. For example: In Listing 8, the outer repeat loop will repeat as many times as there are elements in the image_types list used for the loop in the condition table (Section 4.3).

Targets
If a target code generator needs configuration, we can provide it in the target section. This section starts with a keyword target followed by the name of the target platform and the body enclosed in the curly braces. Within the target body, we specify key-value pairs where the key is a configuration variable name.
In Listing 9, we see a configuration, which configures background, a configuration parameter used to choose the background color, and a mapping between parities used in the experiment and the keys on the keyboard.

Code Generators
The PyFlies back-end consists of a set of target platform's source code generators, which are developed as separate projects. The PyFlies core discovers all registered generators at run-time using textX registration mechanism 3.4 based on Python setuptools (https://setuptools.readthedocs.io/, accessed on 1 July 2021) entry points concept. The entry point used for textX generators is textx_generators. Listing 10 shows the declarative registration of the PsychoPy generator using setup.cfg file. The generator is described in more detail in Section 5.2.  Each textX generator is a Python callable (Listing 11) registered using @generator Python decorator that is parameterized by the textX language name and the target platform name (line 1). The generator function accepts textX metamodel, the model created by textX from the input program file, output path, overwrite and debug flags, and the custom generator parameters (for more information, please see textX docs on generatorshttp://textx.github.io/textX/stable/registration/#textx-generators, accessed on 1 July 2021) (lines 2-3). The generator function is free to generate the target in whatever way it chooses. It is entirely left to the generator's authors. However, textX provides optional integration with the Jinja template engine (Jinja integration is supported through the textX-Jinja project-see http://textx.github.io/textX/stable/jinja/, accessed on 1 July 2021) that makes creating generators easier. The supported Jinja integration is used by all PyFlies generators described in this paper.   # call the Jinja based generator 16 t e x t x _ j i n j a _ g e n e r a t o r ( template_file , output_file , config , 17

overwrite , filters )
After the generator is registered, it is discovered by textX and can be used both programmatically through the textX registration API or by the textx CLI command. For example, listing all registered generators can be carried out by the textx list-generators command, as shown in Listing 12. Generators can be called by the textx generator command as demonstrated in Section 6.6. In the current version of PyFlies, we provide the two built-in generators and the generator for the PsychoPy library that is developed as a separate project. We describe those generators in the next subsections.

Built-In Generators
The main PyFlies project itself registers two general-purpose generators: CSV and Log (Figure 1).
The CSV code generator generates a Comma-Separated Values file for the first expanded table in the experiment flow. Due to repeat loops and the possibility of having multiple tests, there can be many different tables in the experiment flow. These CSV files can be used as input condition files for other experiment builders.
The Log generator generates a textual log file with detailed information about the experiment and its flow. It is important during test development as the test is given in a sort of "compact form", i.e., a condition table expands during run-time as explained in Section 4.3. Furthermore, due to repetitions and randomness, and the way components are instantiated during each trial, it might not be apparent to novice language users what are the actual steps the test will take. The log shows the actual steps that happen in chronological order. Thus, it serves as a sort of debugger which is very helpful during the initial stages of learning the language. It is a plain text file whose content is conveniently indented to take advantage of VS Code default code folding capabilities. Figure 3 shows the usage of a log file to reveal a particular trial of a particular test run. All generators are treated the same, so there are no fundamental differences between these two generators provided by the core project and any other generator developed for PyFlies (or for any textX language to be more precise).

PsychoPy Generator
The first target platform's code generator that is fully developed is the generator for the PsychoPy experiment library. We have chosen this library as our first target due to its maturity, flexibility, and the fact that it is supported on all major operating systems, and can achieve high levels of precision and accuracy [41]. PsychoPy is implemented in Python, which is another positive side as Python has become a de facto standard in scientific research in recent years.
The PsychoPy generator implementation consists of two files. The first file is a Python file (Listing 11) that is used to set up the Jinja template engine and to perform mapping of all PyFlies types and values to PsychoPy. The second file is the Jinja template (Listing 13) that is used to produce the final output experiment as a runnable Python script. The template consists of fixed parts (e.g., lines 1-4) that are written to the target file without change, and variable parts (e.g., lines 8-16) that are expanded by the template engine based on the information from the model. The target script can be run by either PsychoPy Runner or by the standard Python interpreter. A part of Python code produced by the PsychoPy generator for the Eriksen flanker experiment described in Section 6 is given in Listing 14. We can see how the variable parts of the Jinja template from Listing 13 were expanded in the final output code. For the full source code please see the experiment repository on GitHub (https://github.com/pyflies/ EriksenFlanker, accessed on 1 July 2021).

Listing 14.
A part of the generated PsychoPy code for the Jinja template given in Listing 13.

PyFlies Case Study
This section presents a case study on creating a real experiment. It is written in a form of a tutorial so that it could be easy to replicate.
The first step to perform is to install and setup PyFlies. The complete process of PyFlies installation is available in the Getting Started video tutorial (getting started video, https://www.youtube.com/watch?v=NVB2JHbCLY0, accessed on 1 July 2021).
For the experiment, we choose a variation of the Eriksen flanker task [42], inspired by the tutorial in [7].
In the Eriksen flanker task, visual stimuli are displayed to subjects instructed to press the corresponding key (e.g., the left key for a left-pointing arrow). In some trials, the surrounding visual field contains stimuli that might be in the same category as the target (e.g., same arrow orientations). These trials are said to be congruent. In other trials, the surrounding stimuli might belong to the opposite category than the target (e.g., opposite arrow orientations), thus flanking the target. These trials are incongruent. It has been consistently found, in multiple studies, that incongruent flanking items slow down RTs relative to congruent flankers [43]. Figure 4 shows the four stimuli where (a,b) are congruent, as the central arrow points in the same direction as the surrounding, and (c,d) are incongruent. The arrows usage is not mandatory in this test. We could use other combinations of target and flanking stimuli (letters, numbers, colored shapes, etc.). PyFlies enables easy tests variations.

Test Definition
In this part of the study, one should go to the EriksenFlanker folder on the left side, click the New File button and type in the name eriksen.pf. It is important to provide the pf file extension as it is used by the editor to recognize files as PyFlies specifications.
In the editor, one should provide the code from Listing 15. Instead of just copy/pasting, one should try to type the code to become comfortable the editor. You will notice that PyFlies provides code snippets. For example, when you start typing word test, you will be provided with the possibility to expand the test snippet. Press the TAB key to expand the snippet. Type the name of the test EriksenFlanker and press TAB again to go to the next field. Each code snippet can provide multiple fields/locations.

Screens Definition
To prepare the subject for each batch of trials, we define two screens given in Listing 16.
We have an intro screen that introduces the test to the subject. Our experiment will show this screen at the beginning before the practice trials run. The real screen will be shown after the practice run finish but before real trials start. These screens are plain text without any dynamically rendered parts. You will first do a practice run with 4 trials . A real run with 20 trial will be performed afterwards .
Press SPACE or touch / click the screen to continue } screen real { Real block ----------Now a real block of trials will be performed .
Press SPACE or touch / click the screen to continue }

Flow Definition
Up until this point, we have defined all elements of our experiment. Now we need to define the flow of the execution. In this experiment, the flow is simple: we show the introduction screen, execute practice trials, then show the real screen and lastly execute real trials where the data will be collected.
The flow of the experiment is given in Listing 17. In line 2, we use the show statement to show the intro screen. After the screen ends (terminates by a press on the keyboard), the test is executed in practice mode. Note the parameters of the test: practice set to true, and random set to true. It will run the test in practice mode with randomized trials. After practice trials end, we display the real screen (line 6), giving information to the subject that the real test is about to begin. In line 8, we execute the same EriksenFlanker test, but now we do not specify the practice parameter so it defaults to false. We passed an additional parameter repeats set to 5 as we wanted our four trials from the test's condition table to be repeated five times.

Target Configuration
The last piece of the puzzle is the configuration of the target generator. It is optional, and we can leave it out if the default settings of the target code generator suit our experiment. In our experiment, we use directions in the keyboard component. Left and right arrow keys are called left and right in the target PsychoPy library, so we do not need any mapping.
In case we would like to use other keys and, for example, background color, we could use a configuration similar to the one given in Listing 9.

Generating and Running the Experiment
At this point, the experiment description is complete. The final step is to produce the runnable program. This is achieved by running the target code generator over the PyFlies model file. Since we have PsychoPy generator installed in our Python virtual environment, we call the generator: $ textx generate eriksen . pf --target psychopy --overwrite Generating psychopy target from models : / home / igor / Eriks enFlan ker / eriksen . pf Creating / home / igor / Eriks enFlan ker / eriksen . py Done . Files created / overwritten / skipped = 1/0/0 The --overwrite flag instructs the generator to overwrite the target file if it already exists. From the output, we can see that the target Python script eriksen.py is produced. This script is our experiment implemented to use the PsychoPy library. We ran this experiment as any other Python script: $ python eriksen . py The experiment will run as instructed in our PyFlies model, and the data from the real block of trials will be stored in the data folder. The data will contain all the relevant information about each trial.

More Examples
The source code repository at GitHub has several complete examples (https://github. com/pyflies/pyflies/tree/main/examples, accessed on 1 July 2021). An example of how to create a block of trials and counterbalancing is also provided. There is also an example and a video tutorial for Posner cueing task (https://www.youtube.com/watch?v=Fm_XBnqyGfI, accessed on 1 July 2021) [45] which provide more insight into the process of producing and using tests with PyFlies.

Discussion
In this section, we discuss the limitations of the current approach and implementation. We also give some ideas for further development directions and improvements.

Calling Target Platform Code
For a DSL to be successful, it must be effective and efficient in its intended domainthat is, experiments that are considered to fit the domain should be expressible with the language in an optimal way. This can be expressed as the coverage of the domain [46]. DSL might cover too little of the domain, making some valid experiments impossible to define, or more than needed, making the language unnecessarily complex.
There is also a typical trade-off between generality and completeness. PyFlies language should be general enough to support different target platforms but detailed enough to enable a broader set of features. In other words, PyFlies is limited to a set of features common to possible target experiment platforms.
A usual approach to remedy this issue is to enable extending PyFlies definitions at prescribed places using the target general-purpose language. For example, we could provide means to call Python functions and use Python expressions at specific places in the experiment. Although this would make PyFlies more flexible, as experimenters could utilize a huge ecosystem of Python libraries, it would hamper portability between experiment platforms. Nevertheless, we find that the pragmatism of providing the ability to call into the target platform code outweighs the portability problem, so we plan to support it in the future PyFlies versions.
Another feasible approach is to use PyFlies components, which are abstract enough to enable making components with target-specific semantics. As we have already mentioned before, PyFlies component DSL can be exposed to end-users and generator authors. That would make it possible to use target-specific components in the experiment design.

Unavailability of PyFlies Features on Target Platforms
Depending on the target platform flexibility, there is always a danger that some PyFlies features cannot be mapped to target platform features, i.e., the feature set of PyFlies is not a subset of the target platform feature set.
In this case, the only option is to warn the experimenter that the feature is not available and that the experiment description should be altered to avoid the feature.

Pre-Evaluation of PyFlies Expressions
All expressions were pre-evaluated during compilation, and a generator obtained the final values. This is fine for non-random expressions, but random expressions (e.g., choose or shuffle subexpressions) are the problem as values must be generated at run-time to be truly (pseudo)random. To support the run-time generation of random values, and the ability to call into target-specific code, expressions should be translated to the target platform.
This feature is particularly important for defining timing values such as inter-stimuli intervals (ISI) where the user would like to implement a particular approach in choosing random values (e.g., using 50 ms steps so that the ISI is an integer number of 60 Hz screen refreshes, or using a Gaussian distribution of values). This could also be beneficial for custom experiment designs where different randomizations and selection of conditions can be specified.
One way to implement expression mapping is to require each target to supply mapping for each PyFlies type/operation. That might be relaxed to be just a recommendation, in which case PyFlies compiler might pre-calculate all subexpressions which are not available in the target generator. For example, providing just mapping for choose would be enough to support random run-time generation in simple cases where only choose is used, but for example in 1..10 choose + 10 the target is required to support + operation mapping.
We can execute an analysis of expressions and issue warnings if some part of an expression could be translated but is not due to the non-availability of translation for operations in the other parts of the expression.
Another consideration is in which case expression translation should be used. For example, loop expression for table expansion should stay pre-evaluated to have a stable predetermined number of trials for a test. Conversely, component parameter values, duration, time reference, etc. could be made translatable.

Additional Generators
One of the benefits of having DSL with code generators is to achieve experiment portability across a wide range of experiment platforms. For this, code generators for multiple platforms should be implemented. Our current plan is to provide at least one generator for a web-based platform.
In the current version, we have implemented a generator for PsychoPy. One direct way for PyFlies to target the web is to support PsychoPy Builder format as the output (https://www.psychopy.org/psyexp.html, accessed on 1 July 2021). The PsychoPy Builder format is a textual XML-based format from which both Python and JavaScript code can be generated. By implementing the PyFlies code generator for PsychoPy Builder XML target, we can receive both Python and JavaScript support while the target code quality is guaranteed as the code generator is created by PsychoPy experts.
Additionally, as code generators can be implemented as separate independent projects, they can be contributed by other developers who are knowledgeable on particular target platforms.

Timing Considerations
Since PyFlies is not a run-time platform, the timing accuracy and precision can be as good as what can be achieved by the chosen target platform. Nevertheless, it is still a code generator's responsibility to provide the best timing performance possible with the target experiment platform. One of the improvements for achieving higher precision with visual stimuli, due to the limitation of the display's refresh rate, would be an option to specify durations in ticks instead of milliseconds [47].
An overview of the timing performance of various experiment run-time platforms can be found in [48].

Better Support in Editors
For the PyFlies language adoption and positive overall user experience, editor support is of great importance. The current VS Code extension offers code snippets and syntax highlighting that helps in the development. However, editor support could be further extended.
First, we could provide better syntax and semantic checks with an explanation of what the user is doing wrong and a suggestion of how the problem might be fixed. Our plan to support this is by implementing Language Server Protocol (LSP) [49] for PyFlies.
LSP is an open protocol started by Microsoft to separate language cleverness from the editors and IDEs. This way, a single PyFlies LSP server would contain all the knowledge to perform operations such as auto-complete, go to definition, or documentation on hover. This server would serve multiple editors and development environments, reducing the effort to support new editors.
Another feature that would help writing condition tables is displaying an expanded version of the table in hovering popups.
Although collaborative editing in PyFlies is supported by using text-based version control systems such as git, we could further improve interactivity by implementing a web-based editor. In web-based editing, the experiment is stored and edited collaboratively in the cloud. With the code generator for the web platforms, this would make the whole process available online, lowering the barrier to use the tool. All the user would need, to both author and deploy experiments, is an account on the cloud service.

Initial Feedback
In further work, we plan to perform an analysis of the language and the approach by performing a controlled experiment with the users. However, we do have some anecdotal evidence on the usability and the general feel of the language that we gathered from our students during their course assignments and from psychologists with whom we were discussing language and its features. We observed that users that are already familiar with some programming language and general code editing were able to pick up the language quickly after seeing our video materials and completing some training from our side. Their general feedback was positive in comparison to their experience in using general-purpose languages. They observed that they were able to specify experiments quicker with fewer errors. However, users that were mostly accustomed to graphical builders experienced a steeper learning curve as they had to become familiar with standard code editing idioms.

Related Work
In this section, we present and analyze related work. We organize this section into two subsections. DSLs have been successfully used in various domains. To show the versatility of the approach, the first subsection presents some DSL-based research from different domains and compares them with our work from different viewpoints. The second subsection gives an overview of relevant work in the field of psychology test builders.

Domain-Specific Languages
Kosar et al. [21] present a unique hand rehabilitation platform RehabHand based on DSL and code generation techniques. The language uses a simple textual syntax and enables therapists to write rehabilitation exercises in natural, domain-specific terminology and share them with patients. The exercise is then translated to source code which is uploaded to various rehabilitation devices. The approach regarding the language itself is similar to ours. The semantics of the language is described, similarly to our work, as a source code generator. However, the technology used to develop a source code generator is Xtend [50], a general-purpose programming language for Java Virtual Machine. The language itself and the supporting language services are developed using a Java-based language workbench xText [50].
The work of [51] presents a RobotML, a DSL to design, simulate and deploy robotic applications. The language syntax is graphical and the language defines not only abstractions from the domain of robotics, but also component-based architectures. The authors report that, although the development time has not significantly decreased they observed multiple advantages: (a) more time is spent on the design than on dealing with low-level details, (b) the architecture is made explicit, (c) switching to a new target platform is much easier. Similar to our work, this language is meant to be used by domain experts. However, RobotML uses graphical notation while PyFlies uses textual.
Visser presents a case study in DSL engineering in [46]. The author has designed and built WebDSL, a technical DSL for web applications. In the development of the language, several DSLs for language engineering have been used. SDF [52] has been used for syntax definition while Stratego/XT [53] has been used for code generation. Since this is a technical language, it is oriented towards software developers.
Johanson and Hasselbring [15] present an empirical study of a non-technical, i.e., application domain DSL, by evaluating the Sprat Ecosystem DSL [54], which is a DSL for specifying high-performance marine ecosystem simulation experiments, for its effectiveness and efficiency. The results show that the participants' correctness point score was increased by 61-63% compared with the GPL-based solution and their time spent on the tasks was reduced by 31-56%. Furthermore, the Ecosystem DSL receives higher user ratings than the GPL-based solution concerning quality characteristics such as simplicity of use and maintainability of solutions. DSL analyzed in this study is, similarly to ours, targeted towards scientists.
Pajić et al. [55] present a specification of a domain-specific modeling language for the extraction of event logs from enterprise resource planning (ERP) systems. The models defined with this language are automatically validated and transformed into SQL code. Contrary to our solution, the language presented in this work defines a graphical syntax by which the domain experts can create custom complex queries to obtain event logs. Similar to our solution, the language is oriented towards users with little programming skill.
In paper [56], another technical DSL is presented whose aim is to facilitate the development of database-oriented business applications. Similar to our work, the presented language uses textual syntax but it is built using xText [50]. The language is based on entities and services concepts. From the model specification, a fully functional web application can be generated.
A graph layouting library and DSL for its configuration are presented in [57]. This DSL is a technical declarative language whose aim is to ease the configuration of a complex graph layouting library. It is implemented in textX [33].
Another technical DSL is ALAS [58]. This language is designed to support the development of mobile intelligent agents which are deployed to Siebog multi-agent system [59]. ALAS facilitates interoperability by enabling translation of agents' specifications to target GPL supported by the node where the agent runs.
In the paper [60] a music programming and live coding HMusic DSL is presented. This language is built as a so-called internal DSL [28]. These kinds of languages are embedded into GPL languages by utilizing features of the host language. HMusic is embedded in the Haskell functional programming language. Similar to ours, this language belongs to the application domain category. However, internal languages are constrained by their host language so the syntax cannot be tailored to users' needs. Furthermore, the usage of the language still requires programming skills as the users operate with the host language and its tool-chain.

Psychology Test Builders
Previously, we introduced a classification of psychological test builders based on the GUI-based, GPL-based, and DSL approaches. In this section, we focus on the GUI-based ones as they naturally operate on a domain-specific level using domain concepts but focusing on graphical syntaxes and the text-based DSL approach which we use in this paper.

GUI-Based
OpenSesame [1] is an established free and open-source software. It is written in the Python programming language and provides a usual GUI-based means for experiment creation. It has a palette of objects and the experiment can be created by drag&drop operation. It provides great flexibility enabling the writing of snippets of Python code where necessary. The stimuli presentation can be delegated to various backends. It is extensible so additional features may be provided through plugins. Save format can be a plain text which resembles a DSL so it could be used as a basis for text-based version control and collaboration. Although direct editing is possible, OpenSesame files seem too verbose and not optimized for it. E-Prime [61] is a tool for quickly creating psychological experiments using a palette of objects and drag&drop operations. The tool features millisecond accurate stimulus presentation and response time, various stimuli objects such as sounds, images, videos, text, buttons. It supports integration with various hardware devices. However, E-Prime is not free open-source software and it runs on Windows operating system only.
Lab.js [4] is a browser-based, free and open-source experiment builder. Experiments can be created by writing code using a high-level JavaScript library or using an online graphical interface. Experiments comprise HTML, CSS, and JavaScript files that can be downloaded, run locally, or uploaded to cloud-based services. It directly supports concepts such as screens, flow, page, loop, sequence. Similar to Lab.js is Gorilla [5]. It is also an online browser-based GUI experiment builder with the aim to provide a complete integrated SaaS (Software as a Service) solution where users can build, evolve, manage and deploy their experiments.
PsychoPy [3] is another popular free and open-source solution. It started as a GPLbased solution in the Python programming language but in recent years the project team developed a GUI-builder called PsychoPy Builder which can be used to quickly create experiments. From the GUI-based design, a code for the PsychoPy library can be generated automatically or the design can be saved as an XML-based save file format. PyFlies in this version provide a source code generator for the PsychoPy library (Section 5.2).

DSL-Based
The most relevant psychology builders to our work are those that use textual DSLs built from scratch. To the best of our knowledge, in this category we have PEBL [8], DMDX [62], and PsyToolkit [63] The current offering in the area of DSL for psychology tests seems to lack the desired level of abstraction and readability. PEBL [8], while being flexible and capable of providing support for a wide range of experiments, demonstrated by a large battery of tests, is more on the GPL side of the languages' spectrum with concepts similar to those found in other general-purpose programming languages. That makes PEBL specification more verbose and harder to comprehend.
PsyToolkit [63] is probably the most similar to our work. It is not free and opensource but it is free to use. It provides a browser-based editor and DSLs for creating questionnaires and reaction-time experiments. There is a library of free examples which can be downloaded from the project website, investigated, and run through the provided free service. While the questionnaire DSL looks simple and easy to learn, the reaction-time experiments DSL lacks abstraction and ease of use. Contrary to PyFlies, it is defined in a more imperative style and the table of conditions can only be specified in expanded form (i.e., all entries must be provided). This makes experiments unnecessarily verbose.
DMDX [62] is another popular DSL with GUI front-end developed in recent years [64]. DMDX language is large. According to [64], it comprises 414 keywords with 224 synonyms and 75 different branching keywords. The syntax of DMDX is designed more than three decades ago and seems optimized not for humans but machines. That makes DMDX usage hard without a specialized GUI builder.
GUI-based builders cannot effectively utilize the uptake in text-based distributed collaboration and version control. Furthermore, they might not appeal to users who become used to modern text editors and text-editing idioms but would like to work on a higher level of abstraction.
On the other hand, GPL-based builders require programming skills and operate on the level of general-purpose programming concepts. This hinders the readability and comprehensibility as the experimental design is not apparent by looking at the code.
To the best of our knowledge, very few builders which belong to the DSL crowd are either closer to general-purpose languages (PEBL [8]), too low-level and not optimized for domain expert's usage (DMDX [62]), or could be improved in terms of abstraction and conciseness (PsyToolkit [63]).
As a possible move in the right direction, this paper presented PyFlies, a free and open-source domain-specific language for the specification of experiments in psychology. PyFlies is used for several years in educational settings. However, we believe that it could be useful for psychology practitioners as well.
Our aim with the language was to capture the experiments' essence in a concise and readable form, without the technical clutter, usually introduced by target execution platforms.
To support a gentle learning curve, we provide several examples, full documentation, and a video tutorial series.
We provide a Visual Studio Code extension for a better user experience, especially in the condition tables editing domain. The extension provides code snippets and syntax highlighting, which makes the authoring of new experiments a pleasant activity. Further work on the editor is targeted towards better semantic checks with explanations, automatic code fixes, Language Server Protocol support, web-oriented editing, etc.
The PyFlies language is based on Python programming language and features a modular architecture where generators can be developed as separate projects. In this early version, we provide a code generator for the PsychoPy target. We plan to develop code generators for other platforms as well.
We are aware that PyFlies has limitations. In the previous section, we discussed the current shortcomings of the approach and implementation and gave some ideas for further work we plan to do. Some limitations are inherent due to the DSL approach, whereas the others are current implementation issues that will be improved in future versions.
In further work, we plan to perform an analysis of the language and the approach by performing a controlled experiment with the users where we would measure the time taken to build a test, difficulties in using the language, and VS Code extension. We also plan to carry out a language usability evaluation using some of the established conceptual frameworks (e.g., Use-Me [65]).
PyFlies is a free and open-source project which is developed by the community. It is hosted at GitHub Section 1, and it is provided under the terms of the GPL 3.0 license. Everyone is welcome to contribute code, documentation, tests, bug reports, etc. We hope that this paper will motivate researchers with programming experience, who are knowledgeable in experimental software, to implement additional code generators.