An Efficient Information Retrieval System Using Evolutionary Algorithms

: When it comes to web search, information retrieval (IR) represents a critical technique as web pages have been increasingly growing. However, web users face major problems; unrelated user query retrieved documents (i.e., low precision), a lack of relevant document retrieval (i.e., low recall), acceptable retrieval time, and minimum storage space. This paper proposed a novel advanced document-indexing method (ADIM) with an integrated evolutionary algorithm. The proposed IRS includes three main stages; the first stage (i.e., the advanced documents indexing method) is preprocessing, which consists of two steps: dataset documents reading and advanced documents indexing method (ADIM), resulting in a set of two tables. The second stage is the query searching algorithm to produce a set of words or keywords and the related documents retrieving. The third stage (i.e., the searching algorithm) consists of two steps. The modified genetic algorithm (MGA) proposed new fitness functions using a cross-point operator with dynamic length chromosomes with the adaptive function of the culture algorithm (CA). The proposed system ranks the most relevant documents to the user query by adding a simple parameter ( ∝ ) to the fitness function to guarantee the convergence solution, retrieving the most relevant user’s document by integrating MGA with the CA algorithm to achieve the best accuracy. This system was simulated using a free dataset called WebKb containing Worldwide Webpages of computer science departments at multiple universities. The dataset is composed of 8280 HTML-programed semi-structured documents. Experimental results and evaluation measurements showed 100% average precision with 98.5236% average recall for 50 test queries, while the average response time was 00.46.74.78 milliseconds with 18.8 MB memory space for document indexing. The proposed work outperforms all the literature, comparatively, representing a remarkable leap in the studied field.


Introduction
The field of information retrieval (IR) was born in the 1950s out of this necessity.Over the last forty years, the field has matured considerably.Several IR systems are used on an everyday basis by a wide variety of users.IR finds textual documents that satisfy a user's information needs from within a substantial number of documents, which are commonly stored on computers.Nowadays, millions of people engage in IR when they use search engines for the web and use their e-mails.Additionally, IRS covers other kinds of data and information problems not specified beyond that specified in the core definition.Data that do not have a clear and semantically explicit meaning are called unstructured data.This is the opposite of structured data, for example, the relational database of the sort firms that used to maintain product inventories and staff records.Practically, no data are truly "unstructured."This is true of all text data if you count the concealed linguistic structure of human languages.Most text, which is represented in documents by explicit markup (such as the coding underlying web pages), has structure, such as headings, paragraphs, and footnotes.For "semi-structured" information such as a document where the title contains Java and the body contains threading, the IRS is also used to facilitate the search for information type [1].One goal of the IRS, which has special importance, is to provide the requirements of users with the most relevant documents.Different techniques and models were designed to achieve this important goal [2].One of the main tasks in IR is classification, which is a characteristic that is suitable for machine learning.Learning algorithms use examples, attributes, and values, which IRSs supply in great quantities.This system includes three areas: symbolic learning, neural networks, and evolutionary computation-based algorithms [3,4].Some of the most widely used learning-based IR models are evolutionary computing algorithms such as genetic algorithms (GA) [3].
In addition, a universal repository of knowledge and culture in the digital world has allowed the direct sharing of ideas and information at an unpredictable rate.So, there is a need to access the digital world's data in the form of documents.These documents are useful for sharing information with every user, which is considered information retrieval.It is a computerized process of producing relevance-ranked list documents based on an inquirer's request [1].Therefore, in recent years, using textual datasets to speed up the execution of critical tasks has been part of many people's and organizations' daily routines; most of the datasets include a massive collection of documents from a variety of sources, where datasets could be research papers, articles, news, digital libraries, books, messages, e-mails, or web pages.Due to the continuous development of information technology, the huge amount of information (datasets) is growing.Hence, finding the most satisfactory information from databases is increasingly challenging [2].Data organization is required to conduct data update and query operations effectively.Indexing is one of the adopted methods, where different indexing techniques are being explored for the content [3].Indexing techniques are essential to improve the dataset's performance and security.In contrast, information retrieval (IR) indices have a variety of drawbacks, including huge index sizes, the inability to accumulate an output search, and potential security risks.
There are more motivations for this paper to develop the proposed system, i.e., generally, two main problems in IRSs are still encountered by the web user when trying to retrieve documents, which are related to the user query; one of these is that many of the retrieved documents are highly ranked and are not related to the request of the user.Additionally, the main information retrieval problem is determining the document's relevance to the user's requirements.Practically, this issue is considered a ranking problem that needs to be solved based on the matching between all documents and the user query dealing with information retrieval [5,6].The second problem is that many related documents are found in the dataset but are not retrieved [7][8][9][10][11][12][13]. The relevant documents can be found by generating an optimal query [14][15][16][17][18][19].An initial query is used with random or estimated weights, and new queries are generated based on the modification of the new weights.Based on the identified techniques, the new retrieval model is evaluated by comparing it with some existing models in a controlled environment.Thus, these approaches are still insufficient [20][21][22][23].Therefore, we propose the advanced document indexing method (ADIM) as a preprocessing stage that takes less time to build and produces a lower index size.After the modified genetic algorithm (MGA) was integrated with the culture algorithm (CA) for the first time, these two algorithms were adapted and developed to work in IRSs to retrieve relevant documents for user queries.
Thus, the new indexing method was created on our proposed methods that can deal with huge datasets and respond to queries in near-linear time with little I/O overhead.
To the best of the authors' knowledge, targeted results have not yet been achieved.However, the novel work of this paper uniquely contributes to solving these problems by:


Proposing a novel indexing technique called the advanced document indexing method (ADIM) applied to large IRS-indexed files joined with modified GA and CA for retrieving relevant documents to the user queries;  Reducing the amount of storage required for the produced ADIM;  Modifying genetic algorithm (MGA) and integrating with culture algorithm (CA) to retrieve relevant documents.
The remainder of the paper structured as follows: Section 2 includes a background and literature review.Section 3 explains the details of the proposed system (methodology).Section 4 illustrates the proposed system implementation.Section 5 describes the discussion and analysis of the conducted results.Finally, Section 6 summarizes the conclusions and recommendations for further study.

Background and Literature Review
IRS is the process whereby the user information needs an actual list of citations converted into documents in storage containing information that is useful to the user.IRS stores and manages the information on documents.The system serves the users in finding the useful information they need; the IRS explicitly does not return the information; it returns the location and existence of the document that information might contain instead.IRS includes the relevant documents that satisfy the user information and does not include irrelevant documents [24][25][26].Figure 1 depicts the general structure of the IRS.In Figure 1, the user and system communicate with each other using respective queries, retrieving the set of documents.The most natural form of communication is used to communicate with each other for the information needed; such a natural communication method is called a request.In the automatic query, it takes the input as a request and gives the output as the initial query.Based on the initial query, some or all words in the request are converted to query terms by a trivial algorithm.Relevance feedback inputs the initial query to some retrieved relevant or irrelevant documents to output a successive query.The next subsections describe the IRS' techniques and methods.

IRSs Models
The development of the documents and user query representing information as well as retrieval method or processes are all described by the information retrieval (IR) model.Three models make up the fundamental IR models (i.e., Boolean, vector space, and probabilistic) [27][28][29].Table 1 depicts each of them with limitations.

Boolean
It uses a theory set, which is Boolean algebra.It has three elements (i.e., the NOT, the OR, and the AND) to form a query.
 It fails to rank the result list of retrieved documents. Each document is linked to a certain group of words or keywords.
 User queries can also be expressed as keyword phrases with AND, OR, or NOT separating them. The Boolean function classifies a document based on its relevance.

Vector Space
It aims to order documents according to how closely each one resembles the user query.Documents and user queries are represented as a vector, whereas the angle between the two vectors is calculated using a cosine function.The vector space model has been introducing a termweight scheme named the tf-idf weighting.
 It is required to compute weights. Weights include a term frequency factor (tf) that counts the number of times a term appears in a document or userquery text as well as an inverse document frequency factor (idf) that counts the opposite number of documents that a term appears in.

Probabilistic
It is initiated to order documents according to how likely they are to be relevant given a user query.Vectors d and q, which are binary vectors, represent both documents and user queries.
 For the probabilistic model, index term weight variables are all binary.

Indexed Techniques
There are several popular indexing methods for IR such as signature files and inverted indexes.Table 2 depicts the description and limitations of these techniques.

Indexed Technique General Description Limitations
Signature File Compared to the original file, it is significantly smaller.It also has a greater search rate. It is used only in security systems.

Inverted Index
Every document is composed of a list, which depicts the contents of the document for retrieval purposes.Fast retrieval is obtained if one can invert those keywords.All the reference words are stored alphabetically in a file called an index file.For each keyword, a list can be kept of pointers to the characterized documents in the postings file.
 It is used by all commercial systems and can be used in large spaces in memory but required a high execution time.

Evolutionary Models (Search Techniques) Overview
There are various search methods (i.e., linear, brute force, and binary) [30,31].The search techniques and their disadvantages are described in Table 3.

Linear
It is a fundamental method for discovering a certain word or keyword from a list of words or array that sequentially and individually verifies each element's presence.It is the least complicated technique.


It is very slow, particularly when dealing with an ordered list.

Brute Force
It is a well-known technique, itemizing all potential participants to the resolution, determining whether each participant reveals problem statements.


Despite its simplicity, it returns a solution when found.

Binary
It finds the position of a particular input value (i.e., the search key) within an array sorted by some key value.The given array needs to be arranged in an ascending or descending order.The middle element key value of the provided arranged array is compared to the search key value using this technique.When both


The procedure repeats the procedure to the sub-array (left or right) of the main element when the search key value is smaller or greater than the key value of the element in between.keys' values matched, a matching item is discovered and indexed.


If the left-over array is empty, the search key will not be in the array, and a specific string is provided, indicating that an array bit is missing.

Genetic Algorithms (GA) Overview and Related Work
The group of mathematical models known as GA is based on the ideas of natural selection and evolution.This heuristic is frequently employed to produce helpful optimization and look for answers to issues.GAs build chromosomes using selection factors, recombination, and mutations to develop chromosomes from selected problems using chromosome-like data.
Typically, a randomized set of chromosomes is used to start the GA process.These chromosomes symbolize a challenge that needs to be overcome.Depending on the characteristics of the problems, each chromosome's positions are re-coded with numerical, alphabetic, or bit values.The positions denote genes; they change randomly throughout the development process.A population is a group of chromosomes present during the evolutionary stage.The evaluation function determines each chromosome's validation.There are two main considerations while evaluating something.In order to imitate population transformation and natural breeding, the crossover is performed.Chromosomal selection favors the fittest chromosomes for survival and synthesis.In the form of partial functions or by utilizing various chromosome-coding schemes created expressly for a given issue, GAs can virtually mimic any type of limitation.Figure 2 depicts a simple GA structure.Many publications have introduced advanced interactive genetic algorithms (IGA) to improve IR.
Lee et al. [32] presented sparse fitness evaluation with an interactive genetic algorithm (IGA) to reduce the user's burden.Clustering was adopted to split the population into multiple subpopulations.A representative individual was selected from a subpopulation, where all fitness values of the remaining individuals were determined based on the representative's value and distance.Lee et al. applied sparse fitness using 100 population sizes and 10 clusters.
A different IGA using a paired comparison (PC-IGA) was implemented by Watanabe et al. [33].It permitted the required user to compare two individuals and select the best.The selected individual still competes until the best individual is obtained.The experimental results reached 7000 generations with different values of fitness function (FF).
Whereas IGA based on fuzzy logic was suggested by Sun et al. [34], the paper presents individuals' fuzzy and stochastic fitness for replacing the individual user's evaluation.It reduces users' burden by using a fuzzy number for fitness allocation.Furthermore, Sun et al. suggested building a surrogate model to evaluate individual fitness to improve IEC or to replace user evaluation with different ranking distances (i.e., 10 and 20), and different values of FF.
Additionally, Wang et al. [35] proposed the IGA using the support vector machine, which is used to construct the classifier taking advantage of the initial user's selected examples that method reduced user fatigue when using a 250 population size.The experimental results reached 94% recall, and the time of execution took many hours.
Moreover, a model of adaptive learning evaluation to assess beauty in the evolutionary art system was proposed by Li et al. [36].It managed to extract specific features from the evolutionary images and real paintings.Furthermore, an accurate learning method was selected, training these features to establish a model.Another strategy was adopted for modifying evolutionary operators in IEC.The population size was 100, whereas the mutation rate was 0.5 with a high error rate.

Culture Algorithms (CA) Overview
Another evolutionary-based method is called a "cultural algorithm" (CA), which uses culture as a motivation to store related data that may be accessed by everyone in the population throughout several generations [37][38][39].The culture may be seen as a dynamic source of information that shapes the social norms that diverse groups of people adhere to.Culture, like human civilizations, evolves with time, yet it offers a starting point for analyzing and recording a person's behavior within a community.CAs were created to simulate how the cultural component changes with time as it learns and gains knowledge.The belief space serves as a knowledge conduit between each generation that is evolving in CAs, which can be seen as an extension of genetic algorithms.
According to this methodology, CAs can be applied in numerous fields to lead selfadaptation activity across evolution systems.CAs are used as one of the specific models for evolution computations.Re-engineering of the commercial rule expert systems was carried out using CAs, and the knowledge discovery systems utilizing decision trees adopted a top-down strategy for networks' complexity reduction and performance enhancement.Due to the inherited complexity of real-word issues, semantic networks are being used to depict the numerous relations that occur within various problem contexts [37,38].Figure 3 depicts the CA components.
To the best of the authors' knowledge, none of the previous research has investigated using the genetic algorithm and culture algorithm together in IRS.

Methodology
The proposed IRS includes three stages: stage 1 represents an advanced document indexing method (ADIM) used to prepare the WebKb dataset to maintain the high performance of IRS (the dataset requires preprocessing in order for the algorithm to work efficiently on accurate and reliable data), the second stage is query search processing, and the final stage is the evolutionary algorithms (i.e., genetic algorithm and culture algorithm) as an integration of two of the ML techniques.GAs are used to solve an optimization problem and are applied to reduce the overhead during the classification.CAs are also used for classification by applying the evaluation stage (i.e., precision, recall, and accuracy).The next subsections explain these stages.

Advanced Document Indexing Method (ADIM) Stage
This stage is composed of two main steps: WebKb dataset reading and ADIM.The main steps of ADMI are shown in the flowchart in Figure 4.

WebKb Dataset Reading
This dataset contains worldwide webpages obtained from computer science departments at several universities.WebKb is composed of 8280 HTML-programed semi-structured documents.These documents contain seven directories: department, students, staff, faculty, projects, courses, and others, as shown in Table 4.The individual directory includes five classes with the university names as shown in Table 5.In Table 6, the most important tags are given a high weight to retrieve the relevant documents in the shortest amount of time.Therefore, the title is a significant tag as it contains terms near to the request of the UQ, appears only one time, and is not repeated; hence, the weight is six, the weight of each header and sub-header (h1, h2, and h3) is five, the weight is four for an anchor tag, three for italics (I) and bold (B), and one for the body.
Then, the page-information table is created to find the total-tags-weight, given by Equation (1).
Total-tag-weight = Weight (W) tag + ∑Weight (W) tag (1) where the weight (W) tag is the weighting word.

Query Search Processing Stage
The proposed system converts a string of user queries into a set of words stored in an array.It fetches the word and page list from the word-information table, saving them in an array called query words (QW).The information of each word in the database is acquired, merging each word of the page list with the other page-list words of the entered query into one array called the ID list.It is used as the initial generation of GA; however, the ID list contains a repeated ID that is useful in the initial generation as the page contains the most frequent ID.This increases the population's probability.Algorithm 2 describes the main steps of the query search algorithm (QS).Loop//For i = 0 to UQ length 4.
Split the user query (string) into words.5.
Stop until UQ is completed.7.
Loop//For each word in X [W] Do 8.
ID-list [pages-list] = merge Pages-list for each word with the other pages-list word of the entered query.11.
Stop until the criteria are met.12.
Return ID-List array.

End
Algorithm 2 begins with initialization lists for both QW, X, and UQ.It splits UQ into words and stores these words in the list of X.For each word in the X list, obtain the word and page list from the word-information table in the results, create QW (word, page list), and merge the page list for each word with the other page-list words of the entered query to obtain the ID list (page list).

Modify Genetic Algorithm (MGA)
GA was modified and integrated with CA to be adaptable with IR for optimization problems.It has several parameters that are implemented in various methods.GA operators (i.e., initial generation, fitness function, parent selection, crossover, and mutation) are summarized in Algorithm 3. The following subsections describe the details of these operators.
In the first generation evaluate the FF for each chromosome.5. Loop 6.
Following operators to create the next generation: 7.
stop until the criteria are met.11.
From the last generation, return the best chromosome's content.12. End

A. Initial Generation
Each chromosome is represented as a set of pages using the page ID, which is a list of three integers, i.e., the university, directory, and page code.The chromosome length represents the number of shared pages between the query and QW, representing the ID list.

B. Fitness-Function
GA works to generate several generations before reaching the best solution or finding this solution.From the initial generation, these generations can be obtained by applying the operators of GA (i.e., selection, crossover, and mutation).Thus, the fitness function is used to assess the performance.FF is used to evaluate the relevance of documents to the user query.Documents are use the FF in two operators of GA; these two operators are selection and mutation.The proposed fitness function (FF) for this system is shown in Equation ( 2).
where C is the chromosome, g is the gene (page), SP is the number of shared pages (pages that contain the words of the user query), SPT has shared pages (all pages contain words of user query together), WOT is 1-6 weights of tags (HTML, head, sub-header (h1, h2, h3), and body), PID is the page ID (each page contains the words of the user query together), and ∝ is a simple value that guarantees the remainder of all pages relevant to the user query at the first position of each chromosome, given by: All equation parameters are detailed as follows: SPT represents several shared pages of words from user queries together.It can be checked within all pages from the initial generation to find which page contains the words of the user query together.Each appearance of query words together can increase 0.1 of the page fitness value determined by the experiment.In the following example, a chromosome consists of six shared pages; only three of them contain the words of the user query together.When the total weight and word count are computed from the page-information table, the FF value is 1100.0-1-362, 0-3-53, 0-2-18, 0-1-383, 0-1-399, 0-4-45 1100 Check if the query found in these pages came together; if this condition is verified, it can be increased by 0.1 for each appearance of the query.Suppose the query words appear four times in these three pages so that the value of FF becomes computed as follows: For the first appearance become: 1100 + 0.1 = 1100.1For the second appearance becomes: 1100.1 + 0.1 = 1100.2For the third appearance becomes: 1100.2 + 0.1 = 1100.3For the third appearance becomes: 1100.3 + 0.1 = 1100.4This is the final FF value of the chromosome, and the chromosome becomes: 0-1-362, 0-3-53, 0-2-18, 0-1-383, 0-1-399, 0-4-45 1100.4WOT represents integers 1 to 6, the weight of query words appearing in tags.For each word of the user query, WOT assigns a weight to determine the tag of this query.For example: If the query is the operating system and these query words come together, the query increases to six due to the query being in the tag title and the weight of this tag is six.Moreover, if present in another page in the body, the weighting becomes one, as the weighting of the body is one, and accordingly for other queries.
Suppose the FF value of the page contains the query; the operating system is 1000, and this page contains this query in the title, and then this value is increased by six as follows: FF = 1000 + 6 = 1006 Suppose this query appears on the body of the other page, then the FF value will be: FF = 1000 + 1 = 1001 The WOT gives more importance to the query appearing in the title than to the query appearing in the body.
PID represents the page id and multiplies for each page, containing all queries together with a value of 0.5 determined by the experiment to ensure that the fitness of this page stays high.

C. Parent Selection
Parent selection is one of the three main operators of the GA, which is used to produce the next generation.Parent selection is controlled by FF.Better selection of parents of high quality ensures a higher probability to copy the best individual to the next generation to produce better offspring.There are several selection methods.One of these methods is to select parents randomly without any restrictions; this type is simple.However, this method has some disadvantages, such as that it allows irrelevant documents to be selected.The second method is called tournament selection; this method involves choosing a set of individuals randomly from the population.The third method is the truncation selection of this type with a threshold and fraction of the best individuals selected.The fourth method is the Genitor selection; this method works individual by individual during the selection of offspring for birth according to linear ranking and selection of the current worst individual to replace.The privilege of the simple random sampling method is over-proportional selection; it allows some weaker solutions to survive in the process of selection.The solutions may include some components, which could prove the usefulness of recombination.Furthermore, truncation selection may stick at local optima and cannot converge from initially selected chromosomes.
Finally, the elitism method is used in the MGA by the process of parent selection.This means the best individual copies in the next generation; this technique allows only strong individuals to participate in the process of the generation of solutions, preventing weak individuals.The best individuals have maximum fitness.This improves the GA performance increasingly.

D. Crossover Operator (Single-Point)
A simple and adequate single-point crossover is used in this proposed system to exchange the right part of the first chromosome with the right part of the second chromosome, starting with determining the cross-point (CP) value, which is calculated in Equation (4).CP splits the chromosome into two parts (left and right), then examines each gene on the first chromosome with the genes on the second chromosome.If this gene is the same one on the second, it stays on the first chromosome and does not exchange.However, if the gene is not found on the second chromosome, it can be exchanged with the second gene, and accordingly for the other genes.Ultimately, fitness values can be increased or decreased according to gene exchange.Algorithm 4 illustrates the proposed single-point crossover operation.Figure 6  CP = Eq.( 4).

E. Mutation Operator
To avoid local convergence, the mutation is implemented with a tiny probability (1% in the presented work).In mutation, one gene is selected randomly from the generated offspring and replaced with one from the search space.Figure 7 demonstrates this operator.

F. Stopping Criteria
The MGA operation is repetitive.Every iteration has a generation, consisting of a solution group of chromosomes.The iteration number was 15.Thus, the stopping criteria were either MGA with 15 iterations or lacking solution enhancement.The achieved enhancement was measured based on the predefined FF.Accordingly, it was lacking in accuracy enhancement.For enhancement, we assumed a threshold of 0.05 to measure the difference in accuracy between two consecutive generations; whenever the difference is less, the process stops.

Culture Algorithm (MCA)
After the MGA has generated the initial population, the CA uses the training set for the adaptation process to achieve high-performance results.Algorithm 5 depicts the details of this algorithm.

GA and CA Integration
We apply the ADIM (the preprocessing stage is explained later in Section 3.1), QSA (query search algorithm explained in Section 3.2), and evolutionary algorithms.MGA is used to select the best chromosome, as the data consist of high-dimensional features with seven different classes that are used to extract the prepared data.If the fitness function percentage is not acceptable, it affects the CA parameters, changing them and calculating a new accuracy.What is meant by not being accepted is that the accuracy is less than the threshold.For this work, the threshold was set to the minimum accuracy achieved by other researchers, which was 90.31% [40].If the results are accepted, they are saved for future comparison with the results obtained from other iterations.The following step examines the iteration number.If it is greater than the threshold, then the result is stored, and the process stops; otherwise, the process returns to the first step.Figure 8 shows the flowchart of the presented integration system.In the proposed work, the GA at the first stage was for 15 generations.At each generation, the FF is examined.The proposed hybrid IDS algorithm is presented in Algorithm 5: Algorithm 5. Integration between MGA, and CA 1.
Read the WebKb dataset and apply ADIM Algorithm 1 to obtain two tables and split into two parts: training and testing set.3. Loop 4.
For each training and testing set 5.
Create an initial generation.6. End for 7.
Apply CA to the current training set with specific adaptations loop.9.
Evaluate the performance of the results using the FF.

Implementation
The proposed system performance is evaluated using the following measures: accuracy, recall, precision, and F1-score.It is implemented by the software Visual Basic.Net 2019 and uses a Core i7 CPU and 64-bit OS Windows 11 as hardware.The following subsection explains the details of the implementation.

ADIM Experimental Results and Memory Efficiency
The proposed method indexes all meaningful words and adds the required information.This information is contained within two tables; the first table stores words and their page lists.The page list includes an ID list, which contains code on three entities (directory, university, and page).This ID reduces the searching time, retrieving the related documents faster.The memory space is reduced compared to the traditional method by removing each word outside tags and each word attached with numbers or a special character, e.g., operating 565.The traditional method requires a large memory space of 2 bytes (2 × 8) for each entered read document.The total words in the dataset are 67,672; the required space is 8,965,186,560 MB; however, the proposed system requires only 18.8 MB to store data in memory, as shown in Figure 9. Therefore, as the first contribution, ADIM reduces the time and memory space of the documents index by removing special words, delimiter sentences, and stop words.

Query Length Producer and Results
Since a one-word query is not meaningful, there are 50 queries with different lengths of two-to-four words prepared particularly to test the proposed system with a specified number of related documents.Table 7 depicts these queries with the length and number of related documents.The queries' numbers are used to evaluate the accuracy and performance of the proposed system to discover all the queries' related documents.Two main steps are applied: the first step (webpages intersection with the entered query words) finds and references all the documents that have all user query words; the second step (only shared page selection) filters all relevant documents to select only the documents that contain query words.The last step of the proposed system is returning relevant documents to the user query with a high FF value in the last iteration (15 iterations) as an example using the following query: Query: ANALYSIS CRYPTOGRAPHIC PROTOCOLS This query contains four documents of shared pages, but only one relevant document contains all the words of the query together.

Modified GA Procedure and Results
A modified GA is used to integrate with a CA in the proposed IRS (the second contribution).The modified GA is based on a proposed FF, with a random selection mechanism, elitism mechanism to parent selection with proposed single-point crossover, and ordinary mutation with dynamic length to the chromosome.The optimal solution of the proposed system is achieved when setting the GA parameters to the values in Table 8.

Experimental Result of Population Size and Discussion
The modified GA algorithm with different sizes of populations (25, 50, and 75) is evaluated by a recall measure using 20 queries.This list of queries is tested by 8280 documents of the dataset, with 15 numbers of iterations as shown in Table 9.Table 10 shows the recall of these sizes while the precision measure is always 100%; we observed that the recall of 75 populations is better than the recall of the other two populations (25 and 50).The results are explained in Figure 10. Figure 10 depicts the recall number of retrieved relevant documents in these sizes of populations.Figure 11 shows the accuracy of these populations.The proposed system's effectiveness is evaluated based on the precision and recall measurements.Table 10 shows the proposed system test results of 50 different queries with different lengths.In Table 10, the results of the proposed system ADIM (MGA integrated with CA) show that the average value of recall is 98.5%, precision is 100%, and response time is 0.467 milliseconds.These results mean the proposed system is efficient in retrieving relevant documents to user queries by verifying two important principles of accuracy and fast retrieval time.

Comparison with Other Studies
A comparison study was carried out between the proposed IRS, traditional IR, and related work.Table 11 shows the comparison study.Table 11 shows the proposed IRS requires less memory storage and has a shorter response time for document indexing with high recall and precision values.These values prove the proposed system's effectiveness and performance.

Discussions and Analysis
Through several experiments, a set of evaluation measurements and comparisons with other related works were made of the following results: (a) The proposed ADIM reduced the time of document indexing and reduced the memory space by using simple processes; these processes are removing special words and delimiter sentences and stopping word removal.
(b) The suitable number of queries is 50 queries to test the proposed system in terms of performance and accuracy.In addition, queries contain a different number of related documents (small, medium, and large).
(c) The suitable population size for the MGA is 75 to retrieve all the possible related documents to the user query or retrieved results close to the number of these relevant documents.
(d) CA applied the adaptation process when the result of accuracy is not sufficient, which ensures the best accuracy and high-performance system.
(e) A set of evaluation measurements and comparison with other related works made the results; the average value of recall is 98.5%, precision is 100%, and the response time is 0.46.7 milliseconds.
(f) GA operators work with a proposed FF, random selection mechanism, elitism mechanism to parent selection, single-point crossover, and ordinary mutation.The optimal solution of the proposed system is achieved with a population size of 75 at iteration 15.
(g) displaying the relevant document to the user query requires no ranking step display because to find the optimal solution, the documents are ordered according to their fitness value; this process saves time in the ranking step and guarantees the most relevant document will be displayed first.
Figure 12 demonstrates the concise and concrete results when executing the system along with the visualization method.

Conclusions
MGA and CA were used in IRSs to solve two main problems that are still encountered by web users when trying to retrieve the relevant document.The first problem is the irrelevance of many highly ranked retrieved documents to the user's query, whereas the second problem is the irretrievable relevant documents in the dataset.The major aim of the paper is to develop a novel evolutionary-based IRS that retrieves user-relevant queries with the best performance of precision, recall, response time, and storage space as compared to all the previous systems, reducing the document indexing time and memory space.The experimental results reached average values of 100% precision, 98.5236% recall for all the test queries, 00.46.74.78 ms response time (i.e., the process of retrieval is fast), and 18.8 MB required memory space.Furthermore, the modified CA algorithm showed the adaptability to work sufficiently with the developed operators, i.e., FF, random selection mechanism, elitism mechanism to parent selection with the proposed CP, and ordinary mutation technique.The optimal solution of the proposed system was achieved with a population size of 75 with 15 iterations, retrieving the most relevant documents to the user query without a ranking step.Moreover, this proposal saves the ranking step time and guarantees the most relevant documents to the user query as the documents are ordered according to their fitness value.We suggest designing another type of document indexing that can be employed with different types of documents without any preprocessing and to reduce the required memory space.We also suggest applying the proposed system to a larger dataset.

Figure 1 .
Figure 1.The general structure of the IRS.
Figure 5 depicts this.

Figure 8 .
Figure 8.General flowchart of the presented integration system.

Figure 9 .
Figure 9. Memory space for document indexing.

Figure 12 .
Figure 12.The general execution of ADIM-MGA with integrated CA.

Table 1 .
General description and limitations of IRS models.

Table 2 .
General description and limitations of indexed techniques.

Table 3 .
General description and disadvantages of searching techniques.

Table 4 .
Directory names of WebKb dataset with a number of documents.

Table 5 .
Directory contents.The search engines store indices of all the available documents in the documents index.It proposes a method for document indexing (i.e., ADIM), as described in Algorithm 1, that reduces the required memory storage space during query processing.Document indexing fetches the documents' source codes, preprocessing them to create two tables: page information and word information.Page information is included in the page information table including ID, p name, word count, and total-tags weight.The word information table includes the related information of the words in terms of words and pages list.End While 5. pages-information table (id, p-name, total-weight, and total-count-word), 6. words-information table (word, pages-list), 7. End For8.EndAlgorithm 1 starts by creating open-source code for each webpage in the Web Kb dataset and creates two tables; the first table is the Pages-information id table.This table is related to taking the directory, university, and page code from the dataset to facilitate access.Thereafter, it creates Pages-information.The page name table contains the name of each page.While not at the end of the document, the tokenization process is performed by extracting the words (W) from the document and removing them if this word is a special character or stop word or sentence delimiter, otherwise, the total weight and count of words for each web page table is computed.Lastly, a second table of word information is created.The word table implies words, and all web pages contain the same W. Hence, the output of this algorithm is these two tables.The proposed documents indexing runs the following steps: Else Pages-information.Total_weight table = total_weight(doc.(i),Pages-information.Total count-word table = summation (W), Words-information.Word table = W, Words-information.Pages-list table = all web page contains the same W, End If 4.The special-word table is constructed by removing (stop words, special characters, and sentence delimiters).TheHTML tag information table is constructed by removing preferred tags (i.e., HTML, head, sub-headers (h1, h2, h3), body), and generating the weight of the removed tags.Table 6 depicts documents with tags; a document tag represents a specific importance level.It contains the related essential information of the requested user query.

Table 7 .
List of a queries with length and number of related documents.

Table 8 .
Modified genetic algorithm parameters.

Table 10 .
Experimental results of 50 queries.

Table 11 .
Comparison with traditional and other related work.