1. Introduction
In this essay, the authors discuss conversational systems (also called chatbots) of natural language processing (NLP) in machine learning (ML), from the philosophy of science point of view. The authors’ position on the theory of how science operates is one of naturalism [
1]. Hence, the objective of this essay is to evaluate conversational systems’ research activities in light of this philosophical theory. This theory of knowledge is similar to the precept and example of, the now defunct, logical empiricism, which viewed only verifiable statements as meaningful [
1]. Understanding of the way the world functions or the theory that explains observations may influence what is perceived. Just as the scientific community holds on to certain assumptions alluded to by Kuhn [
2], the conversational systems community is not exempt from these assumptions. The assumptions, central to naturalism, are a collection of beliefs and values, untested by the scientific processes. They, however, give legitimacy to the scientific systems and set boundaries of investigations. One such basic assumption is that random sampling is representative for an entire population [
3]. Possible benefits from this essay are that it summarizes improvements made in the science of developing conversational systems; and that it suggests that certain practices, such as the peer-review system and the use of qualitative, less biased, and large datasets, will bring further improvements.
Conversational systems are software systems that use natural language to communicate with users. This may be through written text or spoken dialogue [
4]. The development of conversational systems began in the 1960s with Eliza being the product of such early studies [
5]. This was a turning point in artificial intelligence (AI)—the imitation of human intelligence by software or hardware. AI is different from logical reasoning, problem-solving, or symbol’s manipulation. However, some members of the AI community will agree that logic plays some role in the plethora of AI research areas [
6]. Machine learning, which has become popular over the past few decades, is a subset of AI which is concerned with the learning of patterns for making predictions or performing specific tasks by using algorithms and statistical models without explicit programming [
7]. The learning procedure takes place during training, with the aim of generalizing to ‘unseen’ data while avoiding overfitting (memorization) [
8]. Natural language processing systems can be trained using text corpora (a large and structured set of texts) [
9]. Examples of chatbots include Apple’s Siri and Google Assistant.
Alime Chat is another chatbot developed by Alibaba researchers, mainly for Chinese [
10]. It was developed for customer service operations at Alibaba
1 and can handle about 85% of the total customer service operation [
10]. It is mainly a hybridized chatbot that leverages the capabilities of both information retrieval (IR) and machine learning generation models. Information retrieval and generation model approaches are categorized as data-driven because they rely mainly on data sources [
11]. The latter synthesizes novel sentences, word by word, based on a dialogue history and persona (if included) [
12,
13]. Meanwhile, the information retrieval approach retrieves stored information, such as documents, images, speech, and video, from repositories [
9,
11,
14]. The reason for selecting Alime Chat research and its related studies is because they mark new trends in conversational systems’ problem solving and many of them are being used in industry as well. Indeed, Alime Chat currently answers millions of customers’ questions per day at Alibaba.
When a philosophy of science outlook regarding a given research subject is taken, there are at least two possibilities: One being to look at the research activities in the discipline being studied and evaluate the various philosophical theories proposed about the functioning of science and its epistemological status; the second being to adhere to a particular theory of how science operates and choose to evaluate the discipline’s activities against the chosen philosophical theory; we chose the second approach. In the following, you will find the methodological issues section, the exposition of the chosen studies section, and the summary and conclusion section. The methodological issues section summarizes the approach and some metrics used in conversational systems research, while the exposition section discusses some of the research activities from the point of view of the philosophy of science. Finally, the summary and conclusion section reiterates the main features of the discussion.
3. Exposition of the Chosen Studies
According to Thagard, when we can deduce statements, based on observation, from an occurrence, then a theory around such occurrence is verifiable [
19]. For example, researchers in conversational systems, including Alime Chat, conduct several experiments and collect data by observation to make inferences [
10,
11]. Inference refers to the process of drawing conclusions, sometimes done after a statistical analysis is carried out. Statistical analysis is the evaluation of data for the purpose of inference [
3]. There are three main types of inference: deduction, induction, and abduction. What is inferred is necessarily true in deductive inferences, given true premises. Meanwhile, the nature of induction and abduction is one of non-necessary inference [
20]. In a comparative study method, two or more systems’ performances are assessed based on certain defined metrics (such as BLEU or GLEU) and the better or worse system is established from the outcome of several observations, as an average. Hence, though it is possible in some observations to find cases where a system with a low performance performs better than a system with a high performance, this is not sufficient enough to question the preeminence of the better system. Such a case can merely be seen as an anomaly. This is because one or a few out of many cases is not enough to invalidate a position, since many instances were conducted to arrive at an average.
Methods of inquiry require objectivity in their approach. Objectivity, whose value and attainability has been repeatedly criticized in the philosophy of science, is usually regarded as the basis of the authority of science or the reason for valuing science [
21]. It prescribes that the components of science (such as methods and claims) should not be influenced by personal interests, community bias, or other similar factors [
21]. Product objectivity and process objectivity are the two basic ways of understanding objectivity. Product objectivity is based on science’s theories, experimental results (e.g., BLEU scores), observations, and similar products constituting accurate representations of the world [
21,
22]. Process objectivity is multi-faceted and shows how science is objective to the point that the scientist’s individual bias or contingent social values are not what science’s processes and methods depend on [
21]. An examination of the several conceptions of the ideal of objectivity is outside the scope of this essay. However, it has been argued that the facts of science are necessarily perspectival because of the involved apparatus and sociological factors [
21]. Hence, given that full objectivity may not be deliverable, the conversational systems community plays a key role in describing what constitutes objectivity, which brings about trust in the science, as part of the social process. Indeed, Longino admitted that her analysis was not meant to be complete but to provide a starting framework from which the epistemologist (philosophers of the theory of knowledge) community could fill in further details [
22].
Objectivity is a value which, as mentioned earlier, has been criticized extensively in the philosophy of science. Willingness to let the facts determine our beliefs, marks our objectivity. This is a position Longino does not seem to be averse to [
22]. However, possible suspicion of what constitutes “the fact” from her submission, suggests that this needs to be carefully considered. For example, she suggests that the data used in a research experiment (which count as facts in that study) also need to be checked for reliability [
22]. Hence, checking that the data has been interpreted by the authors in a subjective-free way is an important function in a peer-review process [
22]. Furthermore, identification of possible institutional bias in the post-publication stage of a given idea was rightly identified by Longino [
22]. This means that scientific publications should not be seen as the end. Attempts to reproduce experiments, subsequent use and modification by others are equally essential and can eventually compensate for institutional bias [
22].
Conversational systems research makes use of the scientific method. The scientific method has process objectivity as its basis [
21,
22]. As Longino pointed out, the scientific method is the use of non-arbitrary and non-subjective criteria for developing, accepting, and rejecting a scientific view [
22]. Since objectivity itself may not be fully attainable, this has an impact on scientific methods, and again, makes the role played by the conversational systems community relevant to prescribing what constitutes the scientific method. This view is supported by Longino, who identified two shifts in perspective related to the scientific method, the second shift being made possible by refocusing on “science as practice”. In her work, she proposes that this involves the subjection of hypotheses and the background assumptions to varieties of conceptual criticism [
22]. Her point about objectivity of scientific methods being a function of both observational data and background assumptions lends credence to practices in the conversational systems community [
22]. Usually, the methods used in conducting experiments are provided for scrutiny, by researchers, to ensure their external and internal validity. Such information gives assurance to the conversational systems community about the objectivity of the results and the data used. Therefore, statistical analyses on such data can also be seen as objective. For example, Alime Chat researchers clearly stated the source of the data used, the architecture of the network, and the steps involved in producing the experiments [
10]. This is also the case in a related study by Song et al. [
11]. Furthermore, Longino observed that experiments based on unstable, quickly-evolving assumptions, lack objectivity [
22]. Hence, observer effects, which may cause undue influence on research, are not objective. Methods employed in research should be a collection of social processes (such as the peer-review process for scientific publishing), as argued in [
22]. This view is similar to Kuhn’s position on the acceptance or rejection of a paradigm, which he argued should be a social process as much as a logical one [
2].
In research on conversational systems, the type and size of data used for training influences the quality of the conversational systems created. For example, a small dataset utilized as an underlying corpus will produce poor performance when compared to a large dataset [
9,
11]. Similarly, a biased dataset (either being a stereotyped dataset or a partial dataset) will be reflected in the performance of a conversational system, as was witnessed with Microsoft’s chatbot Tay, which posted racist comments and conspiracy theories online after having been exposed to data of users who (intentionally or unintentionally) exploited the chatbot’s sensitivity by posting many racist comments and conspiracy theories [
23]. After valuable discussion with the anonymous reviewers of this essay, we should add that it is, in general, difficult to create an unbiased dataset. Indeed, for machine learning, a bias is typically needed to actually learn something. The most crucial issue, however, is to remove unwanted/harmful biases, such as racist, gendered, societal discriminatory, or hate-speech entries. Furthermore, an example for creating a less biased dataset (in the context of an insurance company) would be taking all inquiries (not only made in chats, but also by phone calls and physical visits) made by all customers and randomly selecting a subset of that. Public fora, such as conferences, workshops, and journals, provide avenues for criticism of research and its constituent parts. It is also through such avenues that shared standards can be learned and responses to criticism given. Despite concerns (such as unwarranted blocking of publications) regarding the peer-review process in scientific publishing, it is considered a very useful system for evaluating the objectivity of research methods and claims made in scientific papers [
22]. It is a useful filter system that assesses whether an article conforms to generally agreed guidelines provided by the research community. The various articles on conversational systems cited in this essay were published in peer-review journals, which means they had been subject to some critical evaluation or criticism by members of the scientific community before being published.
In refuting conjectures, Popper was opposed to the procedure of inference as a result of many observations [
24]. However, usually, claims made in conversational systems research are based on evidence from observations. This approach raises the concern of how many observations are sufficient to avoid refutation, as expressed by Popper. Furthermore, Lipton categorically states that this approach cannot be taken as a proof of evidence [
25]. Although abduction may be considered in a philosophical debate, the nature of the problem or debate plays an important part in its application, some even considering induction to be a special type of abduction [
20]. Taking into account that we must be careful when concluding from empirical data, it is generally accepted that examples help in argument clarification and empirical confirmation and can increase the probability of the conclusion or claim. For example, Alime Chat researchers repeated 2136 tests in order to validate the obtained high performance of their system. Although Popper may have disagreed with this approach, the willingness of the conversational systems community to confirm or disconfirm their position, based on sufficient evidence, suggests that it is a reasonable approach. The willingness of the community to change, based on active research, is one of the scientific criteria alluded to by Thagard [
19]. Lakatos may have approved this approach as the right one, since blind commitment is as serious a crime as any according to him [
26]. Researchers in the area of conversational systems are not blindly committed to the claims or theories made, but are making strong efforts to ascertain the facts by reproducing experiments and are, in some cases, even advancing the field of research by trying out new methods. For instance, in determining if their hypothesis of a hybrid system was better, the Alime Chat researchers developed a new hybrid system and ran similar tests comparable to the old systems [
10]. Song et al. similarly compared five architectures, including a baseline [
11].
Confirmation by verification is not the only approach applicable in conversational systems, though this approach is sufficient for those who believe a theory is scientific only if it is verifiable [
19]. The condition for refuting a claim can also be used. Popper states that in order for a claim or theory to be considered scientific, one should present a condition in which such a theory can be considered falsifiable or refutable [
24,
26]. Such a test can be applied to some of the claims made in the conversational systems research society. For example, in order to compare Alime Chat with another chatbot in production, the researchers conducted 878 experiments on each of the chatbots [
10]. In order to falsify their claim that Alime Chat was better, the researchers argued that the other chatbot had to win by conversing better (when answering questions, as evaluated by humans) in a majority number of times.