1. Introduction
In the design of the regional development plan of the Gorenjska region, the brainstorming process plays an important role in the initial steps, with the goal of providing innovative ideas that would benefit the community [
1,
2,
3,
4]. New tools such as ChatGPT [
2] present the opportunity to enhance group brainstorming sessions with the aid of AI. In the present study, we did not use the ChatGPT interface directly but considered the integration of the existing software tools for brainstorming (the “kresilnik” tool in our case) with the OpenAI application programming interface (API), which enables advanced use of the generative pretrained transformer (GPT) models. In the previous cycle of designing the regional development plan of the Gorenjska region [
3], the tool “kresilnik” was used to gather, categorize, and prioritize the innovative ideas in the initial stage of designing the regional development plan. In the present research, we propose the concept of hybrid brainstorming sessions as well as the concept of realizing hybrid brainstorming tools.
Classical brainstorming sessions will not be the same since the invention of ChatGPT [
5,
6,
7,
8,
9,
10]. This invention has had a profound impact on the methodology of brainstorming [
11,
12]; therefore, it was our intention to provide a novel hybrid framework for the process of generating ideas.
Involving a group of experts in the design of the regional development plan is important in order to apply collective intelligence [
13] and innovative ideas [
14,
15,
16]. There are cases where collective intelligence outperforms a single expert, for example, in the field of radiology [
17]. Novel collaborative information systems should, therefore, leverage the potential of collective intelligence [
18,
19]. However, in the application of AI, one should consider the ethical aspects [
20] that might occur, as well as the technical issues; for example, if unethical ideas generated by AI would be presented for consideration by the expert group. The proposed methodology, which will be described in this article, provides a framework for incorporating the Large Language Models (LLM) of AI into the process of group brainstorming. We expect that the expert group will evaluate the generated ideas and that the decision-making process will still be under the control of the expert group. Nonetheless, we anticipate that the proposed methodology will harness the capabilities of AI to generate innovative ideas and enhance the overall ideation process during the brainstorming phase.
To test the feasibility of augmenting the “kresilnik” brainstorming tool, a software system design was defined that enabled the integration of the tool with the OpenAI–GPT-3.5–turbo model via a web socket over a secure API-key encrypted connection.
In our previous research [
2], we estimated the usefulness of ChatGPT without direct access to API as well as the Ayoa [
21] tool, which enables AI-supported brainstorming. Preliminary research showed that AI tools can generate useful ideas in the complex topic of regional developmental planning. However, in order to have better control, integration within the custom-made “kresilnik” tool was needed in order to fine-tune the output of the model with the variation in the temperature parameter, in our case, as well as to provide an appropriate prompt within the API call.
An important developmental aspect was the integration of several different LLM such as CLAUDE, Bard, and ChatGPT [
22]. As shown in our previous research [
23], the integration of multiple cloud-based systems, which conform to the Koložvari–Škraba condition [
23], could provide better results than if one would apply only a single cloud-based LLM AI system. Regarding the general approach to the hybrid brainstorming process, the proposed methodology could be applied with the integration of other LLM and AI systems in order to boost the classical brainstorming process.
The emphasis of the study was on the process of generating innovative ideas rather than on decision-making. Nevertheless, in our previous research [
2], we have also considered the application of AI in the field of decision-making with promising results, which might be an interesting topic for further research.
In each brainstorming session, we started with the initial question or call for ideas. When using OpenAI-GPT-3.5-turbo API, an appropriate prompt should be formed.
The initial question that was posed to the OpenAI-GPT-3.5-turbo model was the following: “In the period up to 2034, with what activities will we take advantage of the strengths and eliminate the weaknesses of the Gorenjska region in the field of human resources development? Please give one idea according to the principles of brainstorming”. This initial question was similar to the one posed to the group of human participants in 2018, except the year was changed from 2027 to 2034, and the statement “Please give one idea according to the principles of brainstorming” was added, which was also in the instructions given to the participants of the real committee in 2018. Therefore, one could consider that the same question was posed here.
The question above, in English, was analyzed by the OpenAI tokenizer [
24] of the original Slovenian language, yielding 84 tokens and 203 characters. The tokens were marked with different colors, as shown in
Figure 1.
“The GPT family of models process text using tokens, which are common sequences of characters found in text. The models understand the statistical relationships between these tokens and excel at producing the next token in a sequence of tokens” [
24].
We should also mention that this initial question was stated in the Slovenian language, which is exceptional. The real (i.e., human) group included eight members, who generated 95 ideas in 24 min. In our experiments, sets of 95 ideas were generated with different GPT temperature parameters.
Ideas generated by the OpenAI-GPT-3.5-turbo model were examined by the generation of the word cloud at different model temperatures. The temperature governs the randomness and, thus, the creativity of the responses. LLM predicts the next best word when the initial prompt is provided, one word at a time. The model assigns a probability to each word in the model vocabulary and picks among these words. With a temperature of 0, the variation in the selection of words is small; the algorithm tries to pick the word with the highest probability. A higher temperature would result in the selection of words with a slightly lower probability, which would lead to more variation, randomness, and creativity [
25]. If one wants to experiment and create many variations quickly, a high temperature is better.
With the “kresilnik” system, 8 × 95 = 760 ideas were generated at eight different temperatures of the GPT-3.5-turbo model (0, 0.25, 0.5, 0.75, 1, 1.25, 1.5, and 1.75) in order to examine the functioning of the GPT-3.5-turbo model and the appropriateness of the results generated at different temperatures. We examined how different temperatures influenced the innovativeness of the ideas and whether the ideas might be applicable to the design of Gorenjska’s regional development plan.
In order to distinguish between the process of generating ideas by the human group and by AI, the entropy of the generated ideas can be used. In our previous research, the time needed by participants to generate ideas was recorded, enabling us to compare the generation process of humans versusAI. We also observed the frequency distributions of the entropy H as well as the correlation between the length of the generated ideas and the time needed, and the corresponding distributions.
The present research made a unique comparison between the human group and the AI system in the process of generating innovative ideas. The results will enable the development of novel information systems, enhance the methodology of brainstorming, and contribute to a means of detecting human-generated and artificially generated ideas.
The main original contributions of the study are the following: (a) a definition of the modified hybrid brainstorming process, where the human expert group is supported by AI; (b) the technical specifications of the novel design of the hybrid human-OpenAI system for generating ideas; (c) an analysis of the effects of variations in the temperature parameter of the OpenAI-GPT-3.5-turbo algorithm on the generated ideas in terms of entropy, the number of characters, their latency, and distribution; (d) a comparison of the ideas generated by the group of human experts and the ideas generated by AI; (e) a confirmation of the significant differences between the group of human experts and AI in terms of the entropy.
3. Results
The left part of
Figure 5 shows the word cloud for 95 ideas generated by Gorenjska’s regional development committee in 2018. On the right, the corresponding histogram of the top 20 words is shown with frequency on the x-axis. One can observe that the real committee proposed considering the “elderly”, while the ideas generated by the GPT API did not emphasize this topic. One should also consider that the topic of “human resources” could be understood specifically to be for the Gorenjska region in a particular timespan, i.e., 2018. The word cloud and corresponding histogram for the top 20 keywords presented in
Figure 5 can be applied as a reference point to compare the ideas generated by the OpenAI GPT API.
Figure 6 and
Figure 7 show the word clouds for 95 ideas generated by the OpenAI GPT API at temperatures of 0, 0.25, 0.5, 0.75, 1, 1.25, 1.5, and 1.75. Word clouds are shown in the first column, with corresponding word histograms showing the frequencies of particular words in the right column. The image in the top left in
Figure 6 is for a temperature of 0; the figure on the right shows the corresponding word histogram for the top 20 words at a temperature of 0 etc., and the subfigures layout is similar in
Figure 7. At the start, when the temperature was low, several keywords were emphasized. The larger the text, the greater the frequency of those specific keywords. With increasing temperatures, the importance of the keywords became more evenly distributed, which might be better observed in corresponding word histograms. Interestingly, the word cloud and corresponding word histogram extracted meaningful keywords even at a temperature of 1.75, where a significant proportion of the generated text was somewhat random. Here, a word cloud with corresponding word histogram might be considered as an appropriate filtering method to extract helpful suggestions from the OpenAI GPT API.
One could observe that at a temperature of 1.5, the GPT API sometimes returns “hallucinational” results [
34], providing some English text phrases such as “Our instrumentalization however, may call for introducing mechanisms that bolostequally sitintparenthood and We”, some Chinese simplified text such as “此公海冬请OA原图帝劣因 不只有斢果”, fragments of computer code such as “if(this.href.indexOf(’#ghost’)>-1){bref=location.href.split(’#’)[0];anchor=csv”, and apparently random characters such as “ingmethodamaupzerðrrazmislekitoliko”.
Nevertheless, some ideas generated at the temperature of 1.5 were still impressively innovative and original, such as “Creating events and study programs at Gorenjska public schools that focus on the 21st century with the aim of creating a new generation force that can stay in the region and promote the sustainability and development of the wider community.”
At a temperature of 1.75, some of the generated ideas were barely useful. Here, the level of hallucination was very high, mixing different languages and apparently random strings of characters. An example of such ideas is: “Opening a local training center or support system for small and medium-sized enterprises. A joint program would be developed as IQSEJA” (“Odpiranje lokalnega izobraževalnega centra ali podpornega sistema za mala in srednja podjetja.Razviti bi bil skupni program kot IQSEJA-lta karton kurslarī bu verstandrevalktu Karivgenlassalirdizilani na ovo prodručb iz voulksi teridas tyukeućih. Toringg treneniiisisk…”).
The following case was similar: “We should take advantage of Gorenjska mountains because of winte123” (“Izkoristiti bi morali vrline gorenjsKIH gor zaradi zims123_k_o nivojsanju griDen motena na lv2cjvh rjugin vsxt-g mq523 kvula.zros bo232 @hm5276 zag917ki sod zb GAIM327-LKO.(Naše verige besedic ”---ccc---, “---rjuga(r/m/i.P)###imetnik bokeškrb --- z (#mrzlice)).EN: COVID KILLS". Lemma=GIBREL 326Q|#Y;;ISO-Sr(FSK/T31089/O2384ISR, osalz4M332%)”).
One could observe that the starting text of the idea, printed in italic, could be useful; however, the following text is somewhat random. The system could also generate emojis such as
,
, and
at a temperature of 1.75, e.g., “(TES(DSV29tkzaapmaEn
zypoBenBAcrahslAdSh | Chelah!
| Neveljavana Id
)”.
However, the randomness of the proposed ideas might trigger positive associations for the interacting subjects, providing new, innovative ideas.
At a temperature of 1.75, also some useful ideas might be generated, such as “Establishment of ‘mondene’ Gorenjska market place”, “Digital knowledge for Gorenjska competitiveness. To enable the introduction of informatics as a more comprehensive subject already in primary schools and the planned orientation of educated computer scientists…” followed once more by random text.
One could apply a filter after the generation of ideas that would extract useful text. Here, the real human subjects in the brainstorming session might filter out the good parts of the text. Certainly, the algorithm could be useful here, but it would be somewhat challenging to extract the good parts. Here, the possibility of the application of AI with the aid of human actors could be usefully exploited.
At a temperature of 0, the system returned ideas that were partly or even completely the same (printed in italic in the example below), although some parts of the idea could still be different: “The establishment of a regional centre for career development, which will provide free counseling and education for various target groups, from school and university students to the unemployed and employed, and connect employers with job seekers.
The establishment of a regional centre for career development and education, which will connect educational institutions, companies and other organizations and provide training, mentoring and opportunities to gain experience for young and experienced workers”.
However, at a temperature of 0, all 95 ideas started with the “The establishment of a regional center…” This might have been suggested by the previous regional development plans, which may have included such ideas.
One could observe that the ideas are somewhat repetitive, and their innovativeness might be questionable. All members of the real committees might cover all needed expert areas. However, if one compared the ideas produced by the smaller expert group to the set of ideas generated by GPT, it is possible that a particular important expert area would lack proper coverage. Therefore, GPT could be perceived as a useful tool to help participants consider the wider scope of the initial brainstorming question.
The usefulness of the generated ideas is always subjective, since the final decision regarding the acceptability of a particular idea is determined by the regional development committee’s expert group. However, in order to estimate whether certain ideas generated by “kresilnik” would be worth considering, we analyzed the 95 ideas generated by the regional development committee’s expert group and the 95 ideas generated by “kresilnik” with the temperature set to T = 1. The following keywords from the newly generated ideas were not present within the set of 95 ideas generated by the regional development committee’s expert group (here, we present only the keywords so as not to burden the text with the complete ideas):
- -
Digital literacy;
- -
E-learning;
- -
Sustainable tourism;
- -
Green industry;
- -
Education on artificial intelligence, blockchain, and autonomous vehicles;
- -
Establish connections with foreign experts and institutions;
- -
Digital skills;
- -
Soft skills;
- -
Educational camps for youth.
Whether these topics should be included in the regional development plan could be considered by each individual citizen living in the region. Some of them, such as “green industry”, are within the EU’s strategic “Green Deal” development plan. This might be something that each citizen of the EU might be interested in including in the regional development plan of his/her region. The quality of the ideas is, therefore, hard to determine and it is in the domain of the regional development committee’s expert group. However, if we look at the suggestions, even the proposed keywords provide some meaningful input into the brainstorming session.
In order to illustrate the potential usefulness of the generated ideas, the following idea might be considered, which was generated at a temperature of T = 1.5 and was not included, even partially, by the real regional development committee’s expert group: “Inclusion of the population with different ethnic and cultural backgrounds in decision-making about projects and their implementation with the aim of increasing community, cooperation, tolerance, understanding and greater equality in opportunities for career development”.
The judgment as to whether such an idea might be considered within a regional development plan is up to each reader.
For the set of generated ideas, the entropy of each idea was determined by Equation (1). The image on the left in
Figure 8 shows the increase in the average entropy H computed by Equation (1). One can observe an increase in entropy at temperatures of 1.5 and 1.75. This was mostly due to the addition of languages other than Slovenian in the results as well as the lengthier text generated. The right-hand side of
Figure 8 shows the corresponding average time (in seconds) needed to generate a particular idea. At the temperature of 0.75, one could observe a steady and somewhat exponential increase in the time needed to generate a particular idea.
The average entropy of the ideas generated by the GPT API ranged from H = 4.247 bits up to H = 4.579 bits, while the entropy of a human committee was H = 4.567 bits. In general, one could conclude that the entropy H increased with the temperature, which was more apparent at temperatures higher than 1. With a correlation coefficient of r2 = 0.97, one could conclude that the ideas with higher entropy (i.e., those that are more innovative) take up more time to generate.
Figure 9 shows histograms of the entropy
H of the human group and the OpenAI-generated sets of ideas from temperatures of 0 (temp0) to 1.75 (temp1.75). On the y-axis, the absolute frequency of the
H bins is shown. One can see the different shapes of the distributions.
Table 1 shows the results of the Shapiro–Wilk test for the human group and the results of OpenAI for temperatures from 0 to 1.75 with a step of 0.25. The W statistics are a measure of how well the ordered and standardized sample quantiles fit the standard normal quantiles. One can observe that distribution of entropy at temperatures of 0.5, 1, and 1.25 fulfils the criterion of normality.
While the distribution of the entropy of ideas generated by the human group was not Gaussian (normal), this might be an indicator that the set of ideas was artificially generated.
The nonparametric Mann–Whitney U-test was conducted to assess the similarity between the entropy of the ideas (H) of the human group and the ideas generated by OpenAI. At the level of p = 0.001, the entropy of none of the AI-generated sets matched the entropy of the human group’s ideas (U-stat: temp0 = 2353.0, temp0.25 = 2341.0, temp0.5 = 2114.0, temp0.75 = 2228.0, temp1 = 1851.0, temp1.25 = 1809.0, temp1.5 = 1253.0, temp1.75 = 708.0).
Figure 10 shows the correlation between the length of the generated ideas as measured by the number of characters and the time needed to generate the ideas in seconds. The linear trendline is also shown, along with the value of r
2 and
p-value. The first correlation subplot is for the human group (r
2 = 0.11,
p = 0.001). The r
2 for T0 was 0.01 with
p = 0.330, which might be attributed to the fact that the results of the OpenAI API were the most deterministic and that the time needed to generate the idea was partly dependent on the network’s latency. The time taken to generate the ideas at T0 was also the shortest. A similar situation can be seen at T0.25; however, here, a clear trend with r
2 = 0.47 could be observed where, at
p = 0.000, approximately 47% of the variation in times needed to generate ideas can be explained by the length of ideas. For other temperatures from T0.5 to T1.75, higher correlation coefficients could be observed with the value of
p = 0.000.
If we observe the correlation plots, a distinction can be made between the process where the human group was involved and that of OpenAI. If we look further into the differences in the process of generating ideas, we can inspect the distribution of the time needed for generating ideas, which is shown in
Figure 11. The x-axis of the graphs shows the time needed to generate ideas, while the y-axis presents the absolute frequency. One can observe that the distribution of the time needed was not symmetrical for the human group. One could expect an exponential distribution of the times needed to generate ideas.
We have performed the Lilliefors test [
35,
36,
37] for the distribution of times needed to generate a particular idea of the human group with results: h = 0,
p = 0.0636, k = 0.1068, c = 0.1101. A value of h = 0 indicates that the exponential distribution is a reasonable model for the data at the significance level
p = 0.05 since the obtained
p-value of the null hypothesis test is 0.0636. Here k and c represent aspects of the Kolmogorov–Smirnov (KS) statistics, which are commonly used in goodness-of-fit tests.
Only the distribution of times needed to generate ideas by the human group are exponentially distributed. All other distributions could not be considered as exponential according to performed Lilliefors tests.
Table 2 shows the mean, standard deviation, skewness, and kurtosis for nine distributions shown in
Figure 11 for the human group and GPT algorithm at temperatures between 0 and 1.75 with a step of 0.25. In the GPT algorithm, a temperature value of approximately 1.2 might be considered the threshold at which the GPT algorithm goes from “talking sense to talking nonsense” [
38]. Skewness represents the measure of the asymmetry of distribution, while kurtosis represents the measure of the distribution’s tail thickness. Up to the temperature 1.25, the absolute value of skewness of the distribution of the times needed to generate a particular idea by the GPT algorithm is smaller than the one of the human group, which is S = 2.862. At temperatures 1.5 and 1.75, the distributions become more asymmetrical again with the skewness value of 3.186 and 1.806, respectively. Kurtosis is the highest for the human group, indicating that a significant number of times needed to generate ideas are in the tail of exponential distribution. This means that in a few cases longer times are needed to generate ideas. This might also be confirmed from practice, where participants might generate one idea in approximately 10 min.
With a combination of the correlations of the length of the ideas and the time taken with the distribution, the distinction between human and artificial processes might become more precise.
A similar situation might be observed if we consider the distributions of the number of characters in the generated ideas, which are shown in
Figure 12. Again, the distribution of the human group was distinctively asymmetrical; however, according to the Lilliefors test, it is not exponential.
Table 3 shows the mean, standard deviation, skewness, and kurtosis for nine distributions shown in
Figure 12 for the human group and the GPT algorithm at different temperatures.
Here, the number of characters is considered. Again, up to the temperature 1.25, the absolute value of skewness of the distribution of the number of characters in a particular idea generated by the GPT algorithm is smaller than the one of the human group, which is S = 1.751. At temperatures 1.5 and 1.75, the distributions become more asymmetrical again, with the skewness value of 2.745 and 1.842, respectively.
Considering the distributions in the normal range of temperature for the GPT algorithm (0–1.25), one could expect more symmetry in times and character distributions in GPT-generated ideas than in the human group.
Through a combined analysis of the correlations of time needed to generate ideas and the length of the ideas in addition to an analysis of the distribution of the time, length, and entropy, we may be able to distinguish whether the underlying process is human or artificial.
The distribution of entropy is one of the proposed metrics that could enable us to compare human-generated sets of ideas and the ideas generated by the OpenAI-GPT-3.5-turbo model. This is important for the identification of ideas generated by the human groups versus the ideas generated by AI algorithms. Entropy by itself is a proposed metric that can be used to estimate the variability of the ideas generated by the OpenAI-GPT-3.5-turbo model. It has been shown that entropy is positively correlated with temperature values greater than 1.
The length of the generated ideas and the time needed for generating the ideas are technical factors that determine the performance of the OpenAI-GPT-3.5-turbo model. Nevertheless, these two metrics could be used to identify the ideas or responses generated by the human group versus the ideas generated by the AI algorithms.
These metrics are not meant to evaluate the ideas themselves. Ideas are, in the first phase, evaluated by the human actors, who in this case were members of the regional development committee. The ideas were realized in the real world; for example, the “Kovačnica Business Incubator” (
https://kovacnica.si/en/, accessed on 21 September 2023) was built in 2023. However, the quality of the proposed idea will ultimately be estimated within the next few years by the citizens of the Gorenjska region as taxpayers. We should mention that the proposed hybrid framework, the development of which has been described here, was prepared for the next round of the regional development plan of the Gorenjska region, which will be supported by the “kresilnik” hybrid brainstorming tool.
With the rise of AI, identifying the process of generating ideas is important not only for determining the ideas’ origin [
39,
40,
41,
42,
43,
44,
45] but also for better understanding the process of human innovation.
5. Discussion
The proposed system design was successfully tested through the generation of 760 ideas on the topic of the regional development plan. An examination of the ideas by content indicated that these ideas might be useful for the development of the real regional plan. In further research, the hybrid process of generating ideas should be tested by real committees, where the OpenAI GPT API would be applied in combination with human agents. This might contribute to a better set of innovative ideas [
48,
49] for the creation of the regional development plan. Some ideas appear to be quite useful and, even more surprisingly, focused on the Gorenjska region.
Generating ideas at different temperatures might be useful for focusing the committee at lower temperatures. At lower temperatures, the major keywords with higher relative frequencies could be extracted. On the other hand, higher temperatures provide a greater diversity of keywords and might be used as the seed for the generation of further innovative ideas by the human participants.
The explainability of the ideas generated by the GPT models is a challenging task even for OpenAI as the creator of the algorithm [
22], which may encourage future research in the area of: “Interpretability, explainability, and calibration, to address the current nature of ‘black-box’ AI models” [
22]. Explainability should, therefore, be included as part of the functionality in the next versions of the “kresilnik” brainstorming tool.
One should be aware that AI tools might generate ethically questionable ideas, such as “automated attendance with facial recognition” [
2]. This might be an important challenge for the application of the proposed methodology; however, it is expected that all the generated ideas will be evaluated by the expert group of the regional development committee. One could expect that ethically questionable ideas would be ranked last by the committee members, who are expected to follow high ethical standards. In the hybrid form of brainstorming, each generated idea is reviewed by a human expert and pushed to the group only if it is acceptable by the human expert; certainly, it could also be edited and augmented by the expert and later pushed to the group in its edited form. Hybrid work should, therefore, also address the ethical aspects of the proposed ideas as expected by the members of regional development committees.
Word clouds with the corresponding word histograms might be used as filters for extracting the major topics addressed by the initial brainstorming question.
With an increase in the temperature of the GPT model, one could observe a distinct exponential increase of entropy as well as the time needed to generate ideas. This is important for the further development of computational capabilities; creativity comes at a computational cost.
The distribution of the entropy of the ideas generated by the human group was not Gaussian (normal), which might be a useful indicator showing that a set of ideas was artificially generated.
At a temperature of T = 0, the ideas were very narrow; in most cases, parts of the ideas were repeated. On the other hand, at T = 1.75, the ideas might generate random text. However, even at T = 1.75, useful ideas might be generated.
A comparison of the characteristics of the idea generation processes of the real committee and AI provides the possibility of distinguishing between human and AI, which might be useful in many areas.
Since the results of the study are based on statistical methods, further analysis and a repetition of the experiments should be conducted. The final test of the proposed hybrid human–AI system’s architecture would be the evaluation of the ideas that are proposed, at least partially, by AI; to implement them in the real world and for them to be recognized by the public as positive. This would be a topic for further research. Nevertheless, the proposed methodology should be applied and tested in similar situations. The development of the field of LLM of AI is extremely fast [
22]. We were able to apply and analyze the OpenAI-GPT-3.5-turbo algorithm. By the time of writing, new versions of the GPT algorithms have been provided, which should also be explored and compared. The proposed architecture could nonetheless be applied to newly developed models. It is important that the architecture is flexible and universal. In our case, the JavaScript API calls provide possible future compatibility with the new cloud-based LLM AI algorithms.
With the proposed technical framework, novel decision support systems [
50,
51] could be developed, providing the possibility of integrating human and AI in the process of generating innovative ideas by leveraging the rapid progress in the field of AI [
21].