From Ranking Search Results to Managing Investment Portfolios: Exploring Rank-Based Approaches for Portfolio Stock Selection

: The task of investing in ﬁnancial markets to make proﬁts and grow one’s wealth is not a straightforward task. Typically, ﬁnancial domain experts, such as investment advisers and ﬁnancial analysts, conduct extensive research on a target ﬁnancial market to decide which stock symbols are worthy of investment. The research process used by those experts generally involves collecting a large volume of data (e.g., ﬁnancial reports, announcements, news, etc.), performing several analytics tasks, and making inferences to reach investment decisions. The rapid increase in the volume of data generated for stock market companies makes performing thorough analytics tasks impractical given the limited time available. Fortunately, recent advancements in computational intelligence methods have been adopted in various sectors, providing opportunities to exploit such methods to address investment tasks efﬁciently and effectively. This paper aims to explore rank-based approaches, mainly machine-learning based, to address the task of selecting stock symbols to construct long-term investment portfolios. Relying on these approaches, we propose a feature set that contains various statistics indicating the performance of stock market companies that can be used to train several ranking models. For evaluation purposes, we selected four years of Saudi Stock Exchange data and applied our proposed framework to them in a simulated investment setting. Our results show that rank-based approaches have the potential to be adopted to construct investment portfolios, generating substantial returns and outperforming the gains produced by the Saudi Stock Market index for the tested period.


Introduction
Nowadays, financial markets (e.g., stock exchanges, currency markets, and commodity exchanges) play a major role in the global economy by reflecting countries' economic growth and stability [1,2].The stock market is a type of financial market that provides an effective platform for listed companies and investment institutions to trade and exchange various types of securities (e.g., stocks, derivatives, and options).For listed companies in particular, stock markets can provide a way to realize fair share value, increase the potential of growing a company's capital, and provide liquidity for shareholders.For investors (both individuals and investment firms), stock markets provide a set of tangible opportunities to diversify investment portfolios and produce financial gains, while keeping a transparent environment [3].
However, making investments in financial markets is not an easy or straightforward task; it requires tremendous effort from financial analysts and investment advisers to study a target stock exchange in search of investment opportunities.Generally, domain experts perform extensive research on stock markets for companies, which involves collecting a large volume of data (e.g., periodic financial reports, announcements, news, etc.), performing several data analytics tasks, and making inferences to reach investment decisions.With the rapid increase in data generated per company listed in those markets, the task of manually analyzing companies' financials gradually becomes much harder, especially when timely investment decisions are needed.
Fortunately, with the recent successes of computational intelligence methods that were adopted in a wide range of data analytics applications (particularly in the financial analytics services' domain [4][5][6][7]), these methods can be well exploited to assist financial analysts and stock market investors when analyzing companies and making informed investment decisions.This paper is an attempt to explore one category of these methods-rank-based approaches-and use them to select stock symbols from a target financial market and rank them according to their relevance to an investment plan.The rank-based approaches used in this work, known as learning to rank (LtR) algorithms [8], were originally proposed to search and retrieve textual content to be used in a variety of search applications (e.g., search engines, recommender systems, etc.).
The stock selection task, considered in this work, can be seen as a ranking task by nature.Therefore, in this paper, we advocate for adopting these methods for our task and formulate the investment task as a ranking problem.We also propose a set of features suitable for representing stock symbols in LtR methods.To examine the usefulness of our methods, we selected the Saudi Stock Exchange, one of the fastest-growing exchanges around the world, as a case study and performed our evaluations by creating several simulated long-term investment portfolios with an investment period of four years.The findings from our evaluations suggest that the rank-based approaches are very useful for long-term investments and for achieving substantial performance compared to the gains produced by the Saudi Stock Market's index.
To summarize, our work makes the following contributions.First, we created a new dataset consisting of many instances such that each instance is represented using a diverse set of features.This dataset allows us to experiment with LtR frameworks, specifically for financial analytics tasks.Secondly, we reformulated our investment and stock selection task as a ranking problem and conducted a comprehensive exploration of several rank-based methods (both learning-based and fusion-based).Lastly, we provided an evaluation framework and examined the usefulness of two performance measures used to evaluate the effectiveness of rank-based methods.Our examination shows which of these measures positively correlate with investment returns and indicate the real performance of rank-based methods.
The remainder of this paper is organized as follows.Section 2 provides a brief introduction to the topic of the paper by defining the research problem and examining prior work for stock market investment.Section 3 proposes our framework that applies rank-based approaches to stock investments and describes our dataset in a detailed way.In Section 4, we show an empirical evaluation of the proposed framework and provide analysis and discussions of the evaluation results.Finally, Section 5 summarizes the contributions of this paper and provides concluding remarks.

Background
Rank-based search systems, formally known as information retrieval systems, have been widely used in many applications, including web and multimedia searches [9,10], information filtering, task suggestions [11,12], question answering [13,14], and clinical decision support [15][16][17].A typical search system works by accepting a search request provided by a user as an explicit query consisting of several keywords that define the user's information need.The search system then processes the search request and produces search results, usually as a list of retrievable items (e.g., webpages) that are ranked according to their estimated relevance to the user's query [18].Consequently, the user will scroll down through the produced results list and consume some items, which may satisfy his or her needs.
Since the early adoption of search and retrieval systems, several rank-based approaches (i.e., retrieval models) have been proposed to rank search results.These approaches include some conventional models, such as the vector space model (VSM) [19], which ranks search results based on the cosine similarity scores between a query and a textual item, the language model (LM) [20], which ranks search results based on the similarities between a pair of language models for queries and items, and other probabilistic models such as BM25, which ranks results in decreasing order of their estimated relevance to queries [21].However, due to the complexity of search and ranking tasks, relying solely on conventional ranking models may not be sufficient when building very effective systems to address users' needs (i.e., systems that can produce high-quality results).Incorporating other approaches, particularly learning-based methods such as learning to rank (LtR) [8], to combine various data sources (e.g., ranking models, user clicks, previous user interactions, and results of related queries) during retrieval time has been shown to be beneficial because it can leverage the performance of these systems and improve the quality of search results [22][23][24].This will ultimately increase user satisfaction by servicing their needs.

Learning to Rank (LtR)
LtR is one category of machine learning (ML) methods that has been adapted for search and ranking problems.LtR methods learn by incorporating many parameters (i.e., features) that are extracted from each query and result-item pair (e.g., from a query and a potentially matching search result) [8].The resulting models can then be used to rank search results for new search requests from users by predicting the relevance of items retrieved for a given query and ranking these items (i.e., results) accordingly.LtR methods are generally differentiated by the type of machine-learning methodology they rely on (e.g., regression trees, neural networks, or SVMs) and can also be differentiated by the type of loss function that they employ (i.e., pointwise, pairwise, or listwise function) [25][26][27][28][29][30].LtR methods that use pointwise are trained to predict the relevance of an item to a query without taking into consideration the inter-dependency between the items in the search results list.In other words, predicting one item's relevance to a query is solely based on that item's feature values and without considering its relationship to other items in the search results list.In contrast, both pairwise and listwise methods consider the inter-dependency; pairwise methods consider the relative order between two items in the search results list, whereas listwise methods consider the predicted relevance of an item with respect to the other items in the list [8].
To learn an LtR model for a ranking task, a sample of training data that consists of the following sets is required:

•
A set of queries, Q = {q 1 , q 2 , q 3 , . . ., q m }, where each query q i represents a potential search request by a user.

•
A set of retrievable items (or documents), D = {d 1 , d 2 , d 3 , . . ., d n }, where each item d j can be a potential search result for query q i .• A set of relevance judgments, R = {r 1,1 , r 1,2 , r 1,3 , . . ., r 2,1 , . . ., r m,n }, where each judgment r i,j is labeled by humans to indicate whether an item j is relevant to a query i.

•
A matrix of query-item pair features, F, where each row consists of a vector f i,j = {f 1 , f 2 , f 3 , . . ., f k } that captures certain properties related to a query i and item j.
Finally, each training instance t i,j (a pair of query i and item j) in the dataset can be represented as a vector of the feature values f i,j and a label r i,j .Once a model is trained using this data, it can be used to rank search results for new (unseen) queries by extracting feature values from each query-item pair and using these values to predict whether an item is relevant to a given query.The final ranking of all items will be based on the estimated relevance of these items to query i [16].

Stock Selection as a Learning to Rank (LtR) Problem
The underlying problem that this work considers is concerned with assisting an investor or a financial analyst in building an investment portfolio that is represented by a set of stock symbols' shares selected from a target stock market (in our case, we will consider selecting stocks from the Saudi Stock Exchange).More specifically, we will focus on addressing how to select stock symbols from a given market in an effective way.Our main objective is to maximize the portfolio's returns at a specified time period.Assuming that we focus on long-term investments (i.e., no active management of a portfolio is needed), this period can be set to a single year.
Having discussed search and ranking problems, we propose reformulating our problem as a ranking task to potentially apply LtR methods.Ultimately, our goal is to build a ranking system that can rank a set of potential investment stock symbols (i.e., companies) based on their relevance to certain criteria (i.e., based on which stocks are expected to make positive returns at the end of a given period).Using this formulation, a user query can be defined as an implicit question (i.e., not in the textual form) of which stocks should be selected given a time frame (e.g., "I am at the start of the year 2020; which stocks will make high returns by the end of the year?").The items in our formulation can be defined as the stock company symbols that are available for investment depending on the user query.
The features can also be defined as a vector of property values that captures certain aspects of a symbol for each query-stock symbol pair.For instance, if the user's query is to predict the most profitable stocks by the end of a given year (i.e., 2019 or 2020, where the prediction of each year is considered a distinct query), then the feature values will contain certain properties about a symbol (e.g., revenue, capital value, market value) to capture the period prior to and up to the time when the query is initiated (i.e., use the data collected up to the end of 2018 to predict stocks for 2019).Once the data are defined using the described formulation, LtR methods can be applied to train models, which can be applied to stock selection.In Section 3, we thoroughly describe our methodology for applying LtR to Saudi Stock data.

Related Work
Researchers and scientists have focused their attention on financial markets due to their importance in shaping the global economy and reflecting on countries' economic wealth.A wide range of computational intelligence approaches have been developed to analyze markets' movements and assist in enhancing the task of investing in markets.For machinelearning (ML) approaches, most of the proposed approaches are focused on developing models to assist investors and financial analysts with their long-term (i.e., passive) and short-term (i.e., active) stock market investments [4][5][6][7][31][32][33][34].
For instance, Chiang et al.'s [4] work is notably aimed at applying multi-layer perceptrons (MLPs) combined with particle swarm optimization to predict the movements of U.S. market indices (e.g., NASDAQ and SP500) for the next day in a trading period.Predicting the movements of these stock market indices has been shown to be useful in deciding the entry and exit points for trading actions and leads to investment returns.Alsubaie et al. [5] explored several ML models, such as support vector machines (SVMs), MLPs, and naïve Bayes, and considered various sets of technical indicators.The authors used these models to simulate active trading actions in the Saudi Stock Exchange and showed that different ML models resulted in different investment returns, with naïve Bayes resulting in a higher performance than the other models.Alsulmi and Al-Shahrani [6] also explored applying several ML models, including long short-term memory networks (LSTMs) and random forests, to the task of investment and trading in the Saudi Stock Market.Their study's findings suggest that combining ML-based trading with a portfolio's risk management techniques is very useful for the trading task and has the potential to outperform the conventional hold-and-buy strategies adopted by many investors.
All of the approaches described above focus on exploiting various ML methods to invest and make financial returns by actively trading in the stock market (e.g., identifying entry and exit points of stock and actively buying/selling the stock within short periods).However, other types of approaches attempt to analyze stock market data by relying on various ML and computational intelligence techniques to select and recommend sets of stock symbols that are suitable for constructing long to medium-term investment portfolios.One example of this category is the active learning method introduced by Yan and Ling [7], which is also called prototype ranking and is based on clustering.The proposed approach learns a network model, mainly by utilizing two features (stock prices and the volume of traded shares) to select some of the potential stocks listed on NYSE and AMEX.The findings show that the approach is useful for stock selection and is comparable to other non-ML methods used for this task.
Yu et al. [31] introduced another stock selection method which relies on supervised ML with SVMs and principal component analysis (PCA).The method is used as a classifier rather than a ranker and is applied to predict the top stock symbols out of the 677 symbols listed in the Chinese A-share stock exchange; each symbol is represented by seven features that represent different ratios (e.g., earnings ability, cash ratios, and risk levels) and a target label.An analysis of this method indicates that it has the potential to identify top stocks from the target stock exchange.Yuan et al. [32] also explored several supervised ML models, such as SVMs, MLPs, and random forests, and used them for long-term investment and portfolio stock selection for the Chinese A-share stock market.The proposed method utilized a large number of features (mainly features related to the daily trading of stocks, including opening price, closing price, and volume) to predict which stock symbols are expected to perform the best.Similar to Yu et al. [31], the methods proposed in this study are used as classifiers, not rankers.
Other studies, such as Song et al. [33], explored using the LtR approach for stock selection, which is accomplished by defining the investment task as a ranking problem.Song et al. [33] used a set of statistics based on investor sentiment collected from news articles as features for training several LtR models.The aforementioned method is applied for stock selection in the U.S stock market by considering two investment strategies: longonly and long-short strategies.Findings from this work indicate the potential of LtR methods for this task due to it outperforming S&P 500 index's returns for the considered testing period.Saha et al. [34] also proposed formulating the stocks selection task as an LtR task by introducing an ML method that is based on relational graphs of market stocks.Although the method is applied in active daily trading and not long-term investments, empirical evidence indicated its usefulness for the task of stock selection by considering two U.S. markets (NASDAQ and NYSE).
Our work in this paper shares some similarities with prior work, such as representing our investment task as a ranking problem.Nevertheless, our work is distinguished because we reformulate the task using the LtR framework by clearly defining queries and items and explaining how they are linked using the pairwise feature values.This allowed us to consider a more comprehensive list of LtR learners and to explore a new set of features representing each query and item pair.Moreover, we applied our methods in the context of the Saudi Stock Market, and to our knowledge, this work is the first to adopt these methods for such a stock exchange.

Materials and Methods
This section describes our methodology for implementing the LtR framework into stock symbol selection.We first describe the proposed representation of our problem and then examine the data collection process used to gather the data for our approach.Afterwards, we discuss how to aggregate the collected data to generate learning features.Lastly, we discuss ways to apply model learning using several LtR algorithms.

Problem Representation
We represent our problem, which is concerned with selecting stock symbols to maximize a portfolio's returns, as a ranking problem.Therefore, as described in Section 2.2, we propose applying LtR methods to learn models for ranking stock symbols.We assume that users intend to build long-term investment portfolios and set the investment period to one year.Learning an LtR model for this task requires a set of queries Q (i.e., a set of implicit questions of which company stocks to select for each year), a set of items D (i.e., company stocks), a set of pairwise ground truth labels R (to indicate whether company stock is relevant to a given year's query), and a set of pairwise feature values f i,j (to indicate certain statistics about a company stock j for a given year's query i).Ultimately, each instance in our data will be a vector of pairwise feature values (for query i and item j) and a target label r ij (e.g., f i,j = [f 1 , f 2 , f 3 , . . . ,f k ] → r ij ).By applying an LtR algorithm to the provided data instances, we can train a model that predicts stock relevance and ranks them accordingly.Next, we discuss our process for collecting and generating the data to build our model.

Data Collection
The target market we consider in this work is the Saudi Stock Market (Tadawul), which has over 200 listed companies.Tadawul is one of the fast-growing stock exchanges worldwide and has a market capitalization of over US $ 2.22 trillion (ranked 9th among the 67 members of the World Federation of Exchanges) [35].One limitation is that no publicly available dataset is suitable for applying LtR methods to our target stock exchange.Therefore, part of our methodology is concerned with collecting data from several sources and aggregating data to generate a dataset suitable for training LtR models.Consequently, we developed a bot for crawling our required data, which occurs through two main tasks: acquiring a company stock's profile information and gathering each stock's annual financial results.We describe these two tasks in Sections 3.2.1 and 3.2.2.We implement our bot using Java and by relying on jsoup parser [36] to fetch URLs, extract the required data properly, and further manipulate data.
In addition to the crawled data, our analysis will rely on the historical market data the Saudi Stock Market authority has released [37].The data contain the daily trading information for all the listed stocks for the period we considered in this study.Section 3.2.3provides more insights into this data, including the main parameters used.

Stocks' Profile Information
The market authority of Tadawul provides a profile for each company listed in the stock market.The profile presents detailed information about the company stock, including stock symbol code, listing name, sector, listing date, establishment data, and equity profile.Figure 1a-d show samples of company profile information provided on Tadawul's website.
Electronics 2022, 11, x FOR PEER REVIEW 7 of 22 profits, and profits per share for a given period.Figure 2 shows a sample of the financial results provided on Tadawul's website.From these results, we select the annual results for each stock (revenue per year, profit/loss per year, profit/loss per share, etc.), and we run our bot to crawl their data by extracting their suitable HTML tags.As with the company profile data, we will use the crawled data for this part later to produce features and match them with query-item pairs.Because we need the companies' profile information to generate some of our features, we run our bot on these profiles to extract a set of suitable HTML tags for the following attributes: symbol code, listing name, sector, listing date, paid-in capital, the number of issued shares, and paid-up value per share.In addition, we extract some statistics regarding the changes in a company's capital since its listing date in the market, as Figure 1d shows.This information will be processed later during the data aggregation stage to generate suitable features matched with a suitable query-item pair.

Stock Financial Results
In addition to companies' profiles, the market authority of Tadawul provides the financial results of the stock market's listed companies, which each company announces for several periods: three months, six months, nine months, and one year.The results include several attributes indicating the company's performance, such as revenues, net profits, and profits per share for a given period.Figure 2 shows a sample of the financial results provided on Tadawul's website.From these results, we select the annual results for each stock (revenue per year, profit/loss per year, profit/loss per share, etc.), and we run our bot to crawl their data by extracting their suitable HTML tags.As with the company profile data, we will use the crawled data for this part later to produce features and match them with query-item pairs.run our bot to crawl their data by extracting their suitable HTML tags.As with the company profile data, we will use the crawled data for this part later to produce features and match them with query-item pairs.

Stocks' Historical Trading Data
The historical market data Tadawul's authority releases (through their EReference data service in [37]) include information about stocks' trading prices per day since their initial listing.Every instance of the data represents a trading day for a stock in the market.It includes several attribute values, such as stock company name, symbol code (each stock symbol's unique id), date, stock opening price, stock highest price, stock lowest price, stock closing price, and the volume of shares traded that day.Table 1 shows a sample of the historical trading data Tadawul provides.It is worth mentioning that these data are used to generate some feature values and to facilitate the process of producing the groundtruth labels for our training instances (which we will describe next).

Stocks' Historical Trading Data
The historical market data Tadawul's authority releases (through their EReference data service in [37]) include information about stocks' trading prices per day since their initial listing.Every instance of the data represents a trading day for a stock in the market.It includes several attribute values, such as stock company name, symbol code (each stock symbol's unique id), date, stock opening price, stock highest price, stock lowest price, stock closing price, and the volume of shares traded that day.Table 1 shows a sample of the historical trading data Tadawul provides.It is worth mentioning that these data are used to generate some feature values and to facilitate the process of producing the ground-truth labels for our training instances (which we will describe next).

Data Aggregation and Feature Generation
Having collected the data from several sources (i.e., companies' profiles, annual financial reports, and historical trading data), we now aggregate such data and use them to generate a dataset that is suitable for training LtR models.We produce feature values for each query-item pair such that for each query (i.e., each year included in our analysis), we produce a set of statistics for each company.These statistics are intended to indicate these companies' performance throughout a year (e.g., net profits, capital growth, and P/E ratios) [38] and differentiate companies.Additionally, these statistics can reflect the changes in companies' stocks from one year to another (increase in paid capital, change in market value, etc.).Overall, we generated a set of 15 features for each pair of query i and item j.Table 2 presents these features along with a description of each one.We extract some of the considered features directly from the aggregated data (e.g., symbol code, sector, paid capital, and total net profits/loss) whereas we estimate other features, such as market value, net profits to capital (as a percentage), price-earnings (P/E) ratio [38], and price-earnings (P/E) indicators [39], by performing simple calculations using the extracted data or applying a financial analyst rule of thumb.The difference (%) in paid-in capital between two consecutive years.

Capital growth frequency
The frequency of increases in a company's capitalization.

Market value growth (1 year)
Estimated by (market value year i -market value year i-1 ).Market value growth (3 years) Estimated by (market value year i -market value year i-3 ).P/E ratio Estimated by (share market price/profit per share).
P/E indicator Indicator of whether a P/E ratio value is high, medium, or low, estimated by a financial analyst rule of thumb [39].
In addition to generating features' data, we produced the set of relevance judgments, the ground-truth label R, for each query i and item j (i.e., the year i and a stock j).Fortunately, rather than relying on human feedback, we can estimate those labels by examining the historical daily trading information and whether a stock symbol generates a positive return for a given year.For instance, to estimate whether a stock symbol, j, is relevant to invest in for year i, we generate the label r i,j ∈{0: not relevant, 1: potentially relevant, 2: definitely relevant, 3: highly relevant} by measuring the difference in the price of j at the start of year i and its end.If the difference indicates a growth in the stock price, j is labeled with one of relevance labels for year i; otherwise, it will be labeled as not relevant.It is worth noting that to simplify our task and for illustration purposes, we only considered four labels (three levels of relevance and one for non-relevance).Additionally, the distinction among these labels is defined by setting the threshold values t 1 , t 2 , and t 3, as Equation (1) shows.Later, in our evaluation section, we discuss the suggested values for these parameters.

LtR Model Learning
Once the features and labels are generated for each pair of i and j, we reformat the data to make the resulting dataset well-prepared for the LtR learning procedures.LtR frameworks, such as RankLib [40] and TF-Ranking [41], have a specific format for representing data instances such that each instance, a pair of query i and item j, is represented as (r i,j qid:i 1: f i,j,1 2: f i,j,2 3: f i,j,3 . . . .k: f i,j,k ).r i,j is a label indicating the relevance of item j (a company stock in our case) for query i (i.e., a year), qid is the query id, and 1, 2 through k represent feature values for that pair.Now, we can apply LtR learning procedures to train a model for the stock selection task such that for a new unseen query (i.e., a new year), it predicts the stocks with the most potential positive investment returns by the end of that year.Training an LtR model involves deriving a function that maps the input space (i.e., data instances) to the output space (i.e., predictions) relying on the feature values by the input data.In the derivation of such a function, a loss function is needed to guide the learning process and measure the correctness of produced predictions to the ground truth-labels.As described in Section 2.1, LtR algorithms are generally categorized according to their loss functions as pointwise, pairwise, or listwise (see [8] for a detailed review).Several algorithms have been proposed for LtR model learning, and in this work, we consider nine of these learners spanning various ML techniques (trees, boosting, neural networks, etc.) as well as various loss functions.We implement the considered algorithms using a recent version of the RankLib tool [40].Table 3 lists these algorithms.
Table 3.The considered LtR algorithms with their corresponding ML models and loss function.

LtR Algorithm ML Method Loss Function
Linear regression [25] simple regression pointwise MART [26] trees pairwise LambdaMART [27] trees listwise LambdaRank [28] neural network listwise Coordinate ascent [29] optimization search pointwise RankBoost [30] boosting pairwise Random forests [42] trees pointwise RankNet [43] neural network pairwise ListNet [44] neural network listwise In addition, we consider applying rank fusion methods that can produce ranked lists by combining the results from several LtR methods.Particularly, we examine two rank-based fusions, inverse square rank (ISR) [45] and reciprocal rank fusion (RRF) [46], which are defined by Equations ( 2) and (3) below. (2) ISRScore(j) and PRFScore(j) in the above equations represent the scores of an item j after we apply the corresponding fusion method to combine the ranked lists from several LtR methods.N(j) represents the number of ranked lists that item j appears in, R k (j) represents the rank of item j in ranked list k, and L is a constant (it is usually set to 50).Finally, to optimize LtR learners' learning process, we rely on the normalized discontinued cumulative gain (nDCG) [47].It measures a ranked list's performance by utilizing items' graded relevance (i.e., it considers several levels of relevance, as in our case) rather than considering only binary relevance (e.g., relevant vs. not relevant), as in precision [48], recall [49], and F1 measures.nDCG works under the assumption that relevant items are more useful than marginally relevant items, which in turn are more useful than non-relevant items.Moreover, it favors highly relevant items appearing at the top of the ranked list and performs score penalization when they appear at the bottom.For query i, nDCG is measured at specific ranking position k (i.e., the top k results) according to the following equations, where iDCG is the ideal discontinued cumulative gain computed for a ranked list of ideal items as defined in Equation (5), r i,j is the degree of relevance of item j to query i, and log 2 (j + 1) is the discounting factor.Next, we describe our evaluation of the proposed approach relying on Saudi Stock Exchange data.

Results and Analysis
Having described our methods for applying LtR for the stock selection task, in this section, we evaluate these methods.We start by describing our setup for our experiments.Then, we report the results of evaluating LtR models' effectiveness and provide a case for applying these models when investing in the Saudi Stock Exchange.Finally, we analyze our results and provide further discussions.

Experimental Settings
The used dataset consists of the historical data for the Saudi Stock Market containing information about listed companies in the market (excluding REITs and ETFs).We accumulated the dataset using the procedures described in Section 3.3.The produced dataset covers the period from 2013 to the end of 2021 (nine years) and includes 1437 instances such that each instance is represented by 15 features and a target label (ranging from 0 to 3).We set the thresholds t 1 , t 2 , and t 3 for labeling data instances, defined in Equation (1), to 0%, 25%, and 50%, respectively (we selected these values because they lead to an effective balancing of the data among the various labels and effective grouping of the stocks based on their returns).
We trained nine LtR models (described in Section 3.4) to select stock symbols for the last four years (2018, 2019, 2020, and 2021) in our dataset.For instance, to predict the top stock symbols for 2018 (i.e., rank the 167 stocks listed for that year), we trained our LtR models on the data for the period starting in 2013 and ending in 2017 (excluding any instances from 2019, 2020, and 2021).We did so to eliminate any potential learning bias and avoid overestimating these models' effectiveness.We did the same for 2019, 2020, and 2021 (e.g., to predict the top stocks for 2021, we trained with the instances for the period starting in 2013 and ending in 2020).Moreover, to fine-tune each learner, guide the learning process, and avoid overfitting, we randomly selected 10% of our training data and used it as a holdout validation set.Additionally, as described in Section 3.4, we used nDCG@10 as the main metric to optimize these learners on our dataset.
Finally, we performed two types of experiments, one to measure LtR models' effectiveness (i.e., the performance of these models) while they are used to rank stock symbols, and the second to examine these models' usefulness in constructing investment portfolios.For the first set of experiments, we report LtR models' effectiveness using two common measurements for search and ranking systems: precision@k [48], which relies on binary relevance and measures the proportion of items that are relevant in the top k results of a ranked list, and nDCG@k [47], which considers graded relevance and measures ranking effectiveness as defined in Equation ( 4).
For the second set of experiments, we created several simulated investment portfolios, each with a capitalization of 100 K Saudi riyals (SAR), and we simulated investment in the stock symbols each of the learned models selected.We measure these models' usefulness by estimating the returns (profits/losses) each portfolio made for the four years included in our testing data.We report the results for both experiments in Sections 4.2 and 4.3.
Therefore, it would be more effective to consider nDCG the main indicator of the model's performance because it can accurately account for various relevance levels and rewards ranking models that have highly relevant stock symbols (i.e., generating returns of at least 50%) appearing at the top of a ranked list.nDCG is similar to precision because it shows high disparities in performance values among the LtR models, suggesting that these learners are not equivalent, considering our task.The ranking effectiveness, on average, is shown to range from 0.1634 (RankBoost) to 0.5125 (LambdaRank), considering nDCG@10, although such a difference can be much higher, as in 2020 for both models (RankBoost resulted in 0.0571 nDCG@10 whereas LambdaRank resulted in 0.9552).A multi-way ANOVA test shows that there is no statistically significant difference among those learners (as a group) considering our task (although the p-value of 0.06 is close to the significance threshold, 0.05).However, when we compare each pair of learners, a pairwise t-test would indicate that a significant difference remains among some of them in several models (e.g., LambdaRank vs. RankBoost).Table 5 summarizes the results for this part for all model pairs.Table 5.A pairwise one-sided t-test is applied to each pair of the LtR model considering the nDCG metric."1" indicates that a statistically significant difference among a pair was observed, whereas "-" indicates no statistical significance.
Another observation from Table 4 is that the performance of LtR models degrades as one moves down in the ranked list.This is especially true for the nDCG measure (i.e., selecting the top 10 stocks would be more effective than selecting the top 20).This is often the case in various search and ranking tasks (e.g., as in [12,16,24]) because ranking models typically work by attempting to push more relevant items to the top of a ranked list as the user is expected to ignore the items further down in the list and only focus on the top (nDCG and other measures for evaluating ranking effectiveness are based on this assumption [47,48]).
To summarize our analysis for this part, our results in Tables 4 and 5 suggest that there is a noticeable difference among the various models used for this task.Thus, we can clearly see that four of our learners (LambdaRank, LambdaMART, Random forests, and ListNet) have achieved high effectiveness compared to other learners.The performance results of these learners are relatively high considering other search and ranking tasks (e.g., as in [16,50,51]), and our statistical analysis of these learners using the considered testing period indicates that the four models are comparable.On the other hand, our analysis shows that two of the learners (RankBoost and Linear regression) performed poorly and resulted in the lowest effectiveness among all learners.This makes those learners less suitable for this task.
Besides our experiments for this part, we conducted further experimentation to examine whether combining the ranked lists produced by the different LtR models can lead to effectiveness that is higher than having a single model selecting a set of stock symbols.Table 6 presents the results of these experiments using the two rank fusion methods described in Section 3.4.As Table 6 shows, neither rank fusion method outperformed the top LtR models adopted for this task.ISR seems to be comparable with the top four LtR models described previously (statistical analysis confirms this observation).In contrast, one can see that the RRF fusion method performed poorly compared to a single LtR model.Our further analysis in the following section will provide more insights into the usefulness of these fusion methods.

The Usefulness of LtR Models for Investment Portfolios
We evaluated the usefulness of adopting LtR models to select stock symbols for investment portfolios.We did this by constructing several simulated investment portfolios and emulating the process of investing in our target stock market.We considered diversifying these portfolios by examining two scenarios: investing in the top 10 stock symbols selected by each model and investing in the top 20 selected stocks.For each scenario, we divided our investment capital of 100 K SAR equally among the selected stock symbols.The simulation was applied for the four years (2018, 2019, 2020, and 2021) in our testing such that the investment period is set to a single year (i.e., a set of stocks will be selected by a learner, and shares will be purchased at the start of a year and then sold by the end of that year).The performance of each learning model will be determined by the total returns (profits/losses) on its corresponding portfolio.Tables 7 and 8 show our results considering the two scenarios: investing in the top 10 selected stocks and investing in the top 20.Both tables also compare the results to the returns of the Saudi Stock Market's main index, TASI.Moreover, Table 9 compares the results of our top portfolios to the returns produced by the best-performing hedge funds investing in the Saudi Stock Exchange [37] for the same testing period.
Table 7 shows that the LtR models considered in this study can be categorized into two groups based on their returned earnings.On one hand, we see that five of our learners, namely Random forests, LambdaRank, ListNet, LambdaMART, and MART, resulted in high investment returns, having substantially outperformed the market index for almost all the years included in our testing (Table 10 shows a sample of the top stocks selected by the best two models reported for the last three years).The increase in performance (i.e., as a measure of returns) of these models is five times greater than the performance of the market index, TASI, or even more (e.g., in the case of Random forests).Additionally, comparing these models (in Table 9) to the best performing hedge funds investing in the Saudi stocks and managed by investment firms reveals the high potential of LtR models when considered for the investment task as it is shown that our top model, Random forests, resulted in returns that are three times higher than the best of these hedge funds.can be addressed with several reasons.One is the tendency of ranking models to move potentially more relevant items to the top of a ranked list while keeping what might be less relevant further down on the list (i.e., the stocks selected by a model when considering a deeper ranking cutoff will have more partially relevant and irrelevant stocks than when considering a shallow cutoff).Another reason is that we explicitly set the optimization parameter during the training stage of all learners to optimize for ranking cutoff 10, making these learners focus on enhancing the quality of the results that are at the top of the list.It is also worth noting that our performance results, as a function of investment returns per model, show high correlations with the effectiveness results reported using nDCG in Tables 4 and 6 of Section 4.2.The top models with the highest returns for our investment task are also among the top models evaluated by nDCG, as reported in the previous section.Likewise, we see that the lowest-performing models by one measure are also among the lowest by the other (e.g., RankBoost and Linear regression are the lowest by both nDCG and returns).We verified this observation by estimating Pearson's correlation coefficients among model returns and model effectiveness using both nDCG and precision.Table 11 shows the results.It is clear from Table 11 that the nDCG metric, considering both ranking cutoffs, has achieved an almost perfect correlation with the investment returns made by the models' portfolios, reaching 0.93 for both nDCG@10 and nDCG@20.This contrasts with the precision measure, which shows some degree of correlation with the models' returns; however, it is still low compared to nDCG.This result suggests that nDCG is, in fact, more suitable for indirectly inferring the models' performance values and estimating which model will produce higher returns than the other models.Therefore, using nDCG during the training stage of an LtR model (to optimize the learning process and evaluate loss function, as in our case) would be very effective in capturing the model's true performance (i.e., it is assumed to be as effective as training with the model's actual returns, except the latter may not be straightforward for our ranking task).
Finally, to recap our analysis for this part, we show a successful adaptation of several LtR models for selecting stock symbols for investment portfolios.Our results show that more than half of the considered models (including a rank fusion method, ISR) can produce relatively high investment returns and outperform the market for the specified period.From that, we can conclude that learning to rank, as a framework, can indeed be very useful (when using suitable learners) for providing investors and financial analysts with recommendations on which of the listed companies in the market to consider for an investment plan.Perhaps combing recommendations from several tools and sources would be even more useful than relying on a single tool or a model.

Further Analysis and Discussions
Having presented a study for implementing and adapting LtR models for stock selection in financial markets, we now provide further discussions and include some remarks about our work.One might note that, although our study indicates the potential usefulness of LtR models for the investment task, it is not clear what the impact of the considered features for training these models is (i.e., whether the included features are suitable for distinguishing between instances and learning to discriminate among different labels).
We addressed this question by measuring the importance of each feature as it is being used by itself for training an LtR model.Note that it is expected that the measured importance of features will vary from one model to another, as different models make different assumptions about these features.However, for simplicity and to reduce the dimensionality of our problem, we consider a single model, LambdaMART, one of the top LtR learners as shown in Section 4.2.We also use nDCG@10 as a measure of the feature's importance.Table 12 summarizes our results, showing the feature's importance as an average of nDCG@10 value (the feature "symbol code" is added to every single feature to distinguish between data instances).Table 12 indicates that the chosen set of features can indeed be used for our task, as they provide good discrimination among the different labels and predict the most relevant stock symbols for a given year.Those features, however, vary in their impact and importance, as we see that a statistic such as "the growth in the market value of a company within the last three years" is more significant than "a company's capital or its market value" (as indicated above, these observations can only be generalized for LambdaMART, but not all LtR learners).Combining those features in a single model is expected to result in high prediction effectiveness, as shown in Table 12.
Finally, we conclude our discussion by drawing the reader's attention to a few issues related to our work.First, in this work, we showed a case study for applying a set of machine learning and computational intelligence methods for the task of passive investment portfolio management.Nevertheless, our work does not aim to advocate for adopting passive management over active management (or vice versa).This is beyond the scope of our analysis, as our thought is that both have some benefits and some drawbacks (e.g., passive management comes at lower computational and managerial costs; however, it may lead to a major drawdown of an investment portfolio).Second, it should be noted the proposed framework in this paper is aimed at providing financial analysts and investors with a set of tools for assisting them in decision-making when considering the task of stock selection.It is not aimed at advocating the full automation of the investment task or replacing domain experts with machines.We believe that, due to the high risks associated with investing in financial markets, such tasks require human experts' supervision and intervention (if needed).
Lastly, although our framework has been shown to lead to high effectiveness and to have the potential to produce high investment returns, one might argue that our empirical study is limited in that it considered a period in which the markets were growing and trending upward.This is a viable concern and a limiting factor of this study, as indeed the period considered does not exemplify a recession period for financial markets.Moreover, there might be a potential bias in the used data as historical data for financial markets are generally known to be biased by their nature (which could be addressed due to a variety of macroeconomic-related factors).However, we argue that our analysis shows that, for some years in our testing period, the overall growth of the market is relatively low and is not comparable with the returns produced by our top-performing learners.This may suggest that these models are important even during recession periods, as they are expected to detect stock symbols with high potential returns from a large pool of underperforming symbols.

Figure 1 .
Figure 1.Samples of profile information that Tadawul has released for each stock symbol, including (a) company identification information, (b) equity profile, (c) company overview information, and (d) company capital-changes history.

Figure 1 .
Figure 1.Samples of profile information that Tadawul has released for each stock symbol, including (a) company identification information, (b) equity profile, (c) company overview information, and (d) company capital-changes history.

Figure 1 .
Figure 1.Samples of profile information that Tadawul has released for each stock symbol, including (a) company identification information, (b) equity profile, (c) company overview information, and (d) company capital-changes history.

Figure 2 .
Figure 2. A sample of annual financial results companies release and Tadawul publishes.

Figure 2 .
Figure 2. A sample of annual financial results companies release and Tadawul publishes.

Table 1 .
Sample of daily trading Saudi Exchange data releases by Tadawul's authority.Stock prices are reported in Saudi riyals (SAR).

Table 2 .
A set of 15 features is generated for each year i (query) and stock symbol j (item).

Table 4 .
The ranking performance results for applying nine LtR models to predict the top stocks in Saudi Exchange for four years, 2018, 2019, 2020, and 2021.Underlined values represent the models with the highest effectiveness for a metric.Superscript numerals in parentheses represent the rank of a model among all models using nDCG.

Table 6 .
The ranking performance results for applying two rank fusion methods, ISR and RRF, to combine the ranked lists of the nine LtR models for four years, 2018, 2019, 2020, and 2021.

Table 7 .
The returns produced by each model's simulated portfolio (top 10 selected stocks) for the four years, 2018, 2019, 2020, and 2021.Underlined values represent the models with the highest earnings.Superscript numerals in parentheses represent the rank of a model among all models based on total and average returns.

Table 8 .
The returns produced by each model's simulated portfolio (top 20 selected stocks) for the four years, 2018, 2019, 2020, and 2021.Underlined values represent the models with the highest earnings.Superscript numerals in parentheses represent the rank of a model among all models based on total and average returns.

Table 9 .
The performance of the best performing hedge funds managed by investment firms in Saudi Arabia.Returns are compared with the top two performing simulated LtR portfolios for the period from January 2018 to December 2021.Underlined values represent the portfolio with the highest earnings.

Table 11 .
Pearson's correlation coefficients (p) between models' returns and models' performances (using both nDCG and P) are estimated for the years 2018, 2019, 2020, and 2021.

Table 12 .
Feature importance is estimated by training a model for each feature and reporting average nDCG@10 values.