Know Your Clients’ Behaviours: A Cluster Analysis of Financial Transactions

: In Canada, ﬁnancial advisors and dealers are required by provincial securities commissions and self-regulatory organizations—charged with direct regulation over investment dealers and mutual fund dealers—to respectively collect and maintain know your client (KYC) information, such as their age or risk tolerance, for investor accounts. With this information, investors, under their advisor’s guidance, make decisions on their investments that are presumed to be beneﬁcial to their investment goals. Our unique dataset is provided by a ﬁnancial investment dealer with over 50,000 accounts for over 23,000 clients covering the period from January 1st to August 12th 2019. We use a modiﬁed behavioral ﬁnance recency, frequency, monetary model for engineering features that quantify investor behaviours, and unsupervised machine learning clustering algorithms to ﬁnd groups of investors that behave similarly. We show that the KYC information—such as gender, residence region, and marital status—does not explain client behaviours, whereas eight variables for trade and transaction frequency and volume are most informative. Hence, our results should encourage ﬁnancial regulators and advisors to use more advanced metrics to better understand and predict investor behaviours.


Introduction
Investors hire financial advisors to help them select, facilitate, and manage their investment choices.In Canada, the client-advisor relationship varies by institution and regulatory regime.Some investors ask advisors to provide advice but ultimately make their own investment choices, other investors ask for a recommendation and then approve the advisors investment choices, while still others delegate full discretionary investment choices to the advisor.However, regardless of the relationship, advisors are expected to provide recommendations that are suitable for the client.
Suitability is described by regulators in Canada as a meaningful dialogue with the client to obtain a solid understanding of the client's investment needs and objectives, and to explain how a proposed investment strategy is suitable for the client in light of the client's investment needs and objectives (Ontario Securities Commission, 2014).One of the suitability determinants for advisors is to determine the general investment needs and objectives of their client and any other factors necessary for them to determine whether a proposed purchase or sale is suitable (Know Your Client or KYC).The assumption is that any subsequent purchases or sales (trading behaviour) will conform to the KYC attributes and therefore be suitable 1 .
In this paper, we consider unique interconnected datasets of financial transactions and KYC attributes to examine the relationship between KYC and trading behaviour.The KYC data is comprised of objective demographic and identifying information and subjective financial situation information, where both are used to generate a client's risk tolerance.We quantify trading behaviour through metrics designed using an extended Recency, Frequency, and Monetary (RFM) model from behavioural finance.Our hypothesis is that groups of investors with similar KYC attributes will have the same risk tolerance and trading behaviours.
KYC information should inform a risk tolerance score which the financial advisor -informed by suitability regulations -uses to delineate client investment transactions.
We conduct our analysis using a machine learning k-prototypes clustering algorithm and visualize the clusters using t-distributed stochastic neighbour embeddings.Using advanced data analytics, our analysis shows that: • Objective and subjective KYC data have little influence on trading behaviours (cf.Table 1).
• The distribution of risk tolerance across each clusters' trading behaviour is found to be similar, showing that trading behaviours may on occasion be inconsistent with the KYC generated risk tolerance (cf. Table 1 and Figure 12).
• KYC criteria appear to concentrate investors within narrow and rigid swim lanes and appear to do a 1 An important aspect of suitability is the product recommendation or KYP which we will address in subsequent papers.
1 poor job of accommodating trading behaviours to the extremes-either highly risk-averse investors or those seeking higher risks (cf.Table 1 and Figure 12).At the onset, the hypothesis for this paper was that a thorough and complete assessment of investor KYC data should lead to an accurate determination of risk tolerance and suitability requirements.In turn, those determinations should manifest downstream in trading behaviour and, eventually, in portfolio construction2 and investment outcomes.Our conclusion that KYC data does not demonstrate a strong relationship to the trading behaviours exhibited by investors is important because "Know Your Client" is a foundational principle behind the concept of "suitability" and the corresponding investment regulatory framework deployed in many jurisdictions 3 .The principle has also become more important as employers and governments de-risk retirement and savings programs post-2009 and move more of the burden of investment decision making from professional portfolio managers to individual investors4 .Furthermore, the topic has become more urgent given the events of early 2020.
At this point, it is important to acknowledge that investor behaviour is a complex and dynamic topic.
Investor behaviour is not only driven by the investors personal motives such as their goals and financial needs but it is also influenced by the advisor relationship, dealer processes, regulatory obligations, and market influences.As well, while the client onboarding and discovery process is foundational, it is also contextual and time-dependent since the corresponding product recommendations are constantly changing in real-time.While the dataset and analysis used in this paper are unique, we are not privy to some of the subjective or undocumented influences and we cannot include them in our algorithms.We have also examined only one set period of time.It is therefore impossible for us to determine why the KYC process is not leading to the outcomes we would expect.Our analysis has inspired the question "Could protocols be improved?but we cant answer the question without further research6 .
The paper reads as follows: The rest of Section 1 is a literature review on KYC regulations and trading behaviour and Section 2 introduces the client and advisor financial data collected by a dealer, and develops the features that were used to measure client behaviours.Section 3 describes the machine learning methods used to identify investor groups based on their KYC information and behaviour metrics.Section 4 shows the results from that clustering and Section 5 discusses the implications of the results and future work.

Investment suitability
Investors hire financial advisors who, in turn, recommend or distribute suitable financial products from investment dealers.The regulations for investment suitability for clients in Canada have been in place for decades and were formed through a collaboration of dealers, advisors, and regulators, with significant updates in 2009.This paper studies the KYC obligation that requires financial advisors and dealers to conduct due diligence on clients and take reasonable steps to establish such things as their identity, creditworthiness, investment needs, financial objectives, and risk tolerance.The KYC obligation is designed to protect clients and advisors from unnecessary financial risk that does not align with the needs of the client, and ensure advisors and dealers are acting in good faith.

Know your client
To fulfill the KYC suitability requirement, advisors meet with clients to determine the clients identity, investment needs, financial objectives and circumstances, and risk tolerance.Many, but not all, will use a formal questionnaire to help gather this information and score the risk tolerance7 .An effective KYC protocol collects two types of information: (1) objective demographic information (legal identity), and (2) subjective information, from the perception of the client and their financial advisor, on the client's investment needs, financial objectives, investment knowledge, appetite for risk and circumstances.For example, the questionnaire typically establishes the client's identity by their full name, social insurance number, date of birth, address, and phone number.For investment needs, financial objectives and circumstances, they are asked about their income, net assets, living expenses, time horizon for the investment account, potential withdrawal of funds from the account over a year, how they would change their portfolio based on the market changes, how they set aside savings, plan for retirement, and make retirement savings plan contributions.
To help determine risk tolerance, they are asked about investment knowledge, dependants, debt, willingness to take on risk-based on situational questions, and what they want to accomplish with their wealth.
Research in the area of effective KYC protocols is at the emergent stage and has focused on the collection and evaluation of KYC information.The main focuses of research by the financial community have been on the objective information for improving compliance to prevent illegal or terrorist activities and decreasing the cost associated with increased compliance.Where KYC research exists, it tends to focus on cost efficiencydistributed ledger systems (Moyano and Ross, 2017), how the financial crisis in the USA from 2007 to 2009 may have been affected due to non-compliance to US KYC regulations (Bilali, 2011), on using KYC to protect client accounts (Mondal et al., 2016), and on improving auditor effectiveness in evaluating KYC compliance (Smet and Mention, 2011).
In contrast, few studies have been conducted to study the subjective information of the KYC obligation and their relationship to advisor and client investment behaviours, client investment objectives and outcomes, and dealer strategies to assist their advisors (Ontario Securities Commission, 2015).Picard and de Palma (2010) reviewed a number of existing risk tolerance assessment tools and concluded that while the neoclassical economic concept of risk tolerance is clear, its measurement through surveys is unclear.Since the economic definition of risk tolerance is a variation in future spending, many economists use questions that measure income volatility over time in order to assess risk tolerance.These questions are theoretically correct, but their performance as predictors of actual investment behaviour during volatile stock markets is mediocre (Guillemette et al., 2012).

Trading behaviour
At the onset, the hypothesis for our research was that a thorough and complete assessment of an investor's KYC data should lead to an accurate determination of their risk tolerance and suitability requirements.In turn, those determinations should manifest downstream in trading behaviour and, eventually, in product recommendations, portfolio construction and investment outcomes.
In this paper, we look to better understand the relationship between collected KYC information and trading behaviours through applications of behavioural finance and statistical analysis.Behavioural finance is the intersection of psychology and finance to explain the trends and actions of financial markets, institutions, advisors, and individual investors.Behavioural finance has three main areas of application: analysis of patterns in stock returns, studying trading activity, and corporate finance (Subrahmanyam, 2008).Our analysis focuses on trading activity.Our dataset encompasses over 23,000 clients who work with financial advisors at an anonymous investment dealer under the auspice of the Investment Industry Regulatory Organization of Canada (IIROC) regulatory regime.We use an extended RFM behavioural finance model (Lumsden et al., 2008).RFM models are used primarily in direct marketing to analyze customer behaviours through the recency of their last purchase, the frequency of their purchases, and how much is spent on each purchase.
RFM models have been embedded in data mining algorithms (Birant, 2011).
It is important to acknowledge that investor behaviour is a complex and dynamic topic.Investor behaviour is not only driven by the investors personal motives such as their goals and financial needs but it is also influenced by the advisor relationship, dealer processes, regulatory obligations, and market influences.While the dataset and analysis used in this paper are unique, we are not privy to some of the subjective or undocumented influences and we cannot include them in our algorithms.It is therefore impossible for us to determine why the KYC process is not leading to the outcomes we would expect.Our analysis has inspired the question Could protocols be improved?but we cant answer the question without further research -which we discuss in Section 5.In this section, we describe the KYC information and trades and transactions recorded in the data.We use descriptive analysis to demonstrate the demographics of our data and that the data is of good quality.
We describe the features engineered from the data to be used in clustering, including unique metrics that measure client behaviours.

Data description and processing
The data is comprised of 52, 025 accounts for 23, 970 clients with associated KYC information, trade and transaction details from August 13th 2018 to August 12th 2019.The datasets were edited by the data donor prior to our receipt to ensure all client identifiers were anonymized consistent with Canada's Personal Information Protection and Electronic Documents Act (PIPEDA) and standard research ethics protocols.
Even using anonymization practices, there is still the possibility that clients could be identified using machine learning algorithms (Rocher et al., 2019).Therefore, no individuals will be identified or referenced in this paper and any subset of the data cannot be shared with readers.
The data is organized into linked datasets where entries were uniquely determined by an anonymized account ID or other relational database information.The specific datasets we used are a KYC information dataset and a trades and transactions dataset.We created new features derived from both datasets that effectively supplement the KYC information with metrics that measure trading behaviours.
The data was processed by cleaning the data for improper entries (e.The distribution of account residency is shown in Table 3, with the majority of accounts owned by clients in the province of Ontario. Figure 3 shows the distribution of annual income.The income distribution has an average of $70, 658 and is right-skewed, with 50% of clients making less than $60k.There are also income spikes at $50k and $100k, $150k and $200k.Table 4 shows the number of accounts per client.Most clients have two accounts and few have five or more.Our dataset contains a combination of trades and transactions for each client.We reserve the word "trades" for any interaction with mutual funds, stocks, securities, and bonds, and "transactions" for any interaction that does not include those interactions such as collecting dividends and interest.Trades are logged as orders, which are either active, inactive, filled, rejected, cancelled, or expired.In this paper, only filled orders are studied and the study of investor behaviours through all of their order history and is deferred to future work.
Each trade and transaction is recorded with the type of product or transaction, size, value, currency type, security identification code, order date, process date, value date, and more.Using the trades and transaction dataset, we determined the variables that we believe contain information on client behaviours and developed new metrics using feature engineering to measure client behaviour.

Feature engineering
Feature engineering in data science is the process of using industry knowledge about data to construct metrics or "features" that can act as a measure for a quantity to be used in a machine learning model (Zheng and Casari, 2018).Features generated from an RFM model can be used in conjunction with a machine learning algorithm (Anitha and Patil, 2019).We construct features that using objective and subjective KYC information, and trade and transaction information that we believe to be related to client investment behaviour.Our features are an extension of an RFM model and fall into four categories: recency, frequency, monetary, and profile (RFMP).
The RFMP features are aggregated into a cross-sectional dataset that is static in time, where the crosssection is calculated on the last day recorded (August 12th 2019) in the dataset.Table 5 lists the features used for the clustering algorithm described in Section 3 and to generate the results shown in Section 4. We now describe each type.
Profile features describe the client as who they are and what their financial goals are.Commonly, they are considered influential factors to the behaviour of the client (Foerster et al., 2017).Profile features are generated from KYC and account information for each of the clients.Some of the profile features were immediately ready for usage (for example, the time horizon of the account) whereas other variables needed to be derived; age in years is calculated from birth dates and the number of accounts is determined by searching the database for client accounts.The recency feature is calculated as the number of days since a client's most recent trade or transaction.
The frequency features are calculated through a client's overall amount of trading throughout the history of the dataset.These two features types provide some information on their own, but when used together are more than the sum of their parts.If they have a large total number of trades (frequency) and months since their last trade (recency), this means they have a "burst" investing behaviour.These feature types when used together provide an interesting picture of client behaviours.
The monetary features are features engineered from trade and transaction amount details, rather than their temporal attributes.Specifically, a trade size multiplied by the value for each unit is the total monetary value in CAD, which we will refer to as the trade amount.If we looked at each trade as equivalent-similar to recency and frequency-then we will incorrectly consider that purchasing a stock is the same as re-investing a dividend.The stock purchase is an active trade that a client or advisor initiates, whereas a re-invested dividend is not.We classify trade sizes into the three metrics given by T hird-party initiated trade size Systematic trade size = Auto withdrawal + P re-authorized contribution+ P eriodic trade size = Buy (securities) + Sell (securities) + Contribution + Exchange + P ayment + Electronic f unds transf er (EF T ) + W ithdrawal where the descriptions of the trade types can be found in Appendix A. Third-party initiated trades are comprised of trade types that are initiated by a third party, such as a coupon collected as cash from a bond.
Systematic trades are comprised of self-imposed automatic investment strategies, such as an automatic monthly withdrawal from savings to purchase a mutual fund.Periodic trades are client or advisor initiated trades and transactions, such as an unscheduled purchase of a mutual fund for a TFSA.
Figure 4 shows the relative percentages of transaction sizes comprising the three behavioural metrics in Equations ( 1) to (3) versus time.For third-party initiated trade size, dividend and income distribution dominate most of the transactions, and there appears to be a cyclical trend for dividends paid at the beginning of every month.For systematic trades, automatic withdrawal represents the majority of the feature size and has an obvious cyclical trend.There are spikes for asset allocation at the beginning of the year and six months in; a bi-annual cycle for asset allocations in systematic trades.For the periodic trades, the buy and sell types dominate without any cyclical trends.The features we engineer in this section are used directly as variables in our clustering model in Section 4.
The next step is to take our engineered features and use them in a clustering algorithm.The theoretical underpinnings for our algorithm are described in the next section, which is followed by empirical results from clustering in the subsequent section.

Clustering theory and methods
Clustering is an unsupervised machine learning algorithm that is used to draw inferences about grouping commonalities from like-individuals in high dimensional data.It is a popular method for exploratory data analysis that finds previously unknown structures in data without specifying the underlying data generating process.Clustering is a powerful technique used in many fields, such as identifying fake news (Hosseinimotlagh and Papalexakis, 2018), bioinformatics (Krishna and Murty, 1999;Lan et al., 2018), text mining (Berry and Castellanos, 2004), and wireless sensor networks (Abbasi and Younis, 2007).
Clustering bears the task of grouping our set of clients by considering the similarity of their attributes and trading behaviour (Xu and Wunsch, 2008).For obvious reasons, we are interested in applications of clustering for financial data analytics (Le-Khac et al., 2012), particularly the area of Behaviour Clustering Analysis (BCA).Popular clustering algorithms used in this field are k-means (Steinley, 2006) and k-modes Huang, 1998;Chaturvedi et al., 2001;Huang and Ng, 2003).In this section, we introduce the k-prototypes algorithm that allows for both continuous and categorical data to cluster clients based on their similarity.
Next, we introduce t-distributed stochastic embeddings that reduces the dimensions of the data based on the similarity of each data point.The embeddings display the data in low-dimensions by similarity, while the clustering algorithm identifies the clusters among the data points.

k-prototypes clustering
The k-prototypes algorithm used here is similar to the k-means algorithm, where k-prototypes incorporates methods for including categorical data (Huang, 1997).Suppose we have a set of N accounts each with a unique identifier or index in the set N = {1, 2, . . ., N }.The goal of any clustering algorithm is to put clients into k groups or clusters such that • each client is put into exactly one cluster; • clients within a cluster have similar attributes; and • clients in different clusters have dissimilar attributes.
Mathematically, the k clusters form a partition of the the client index set into k subsets.Let N denote the set of client indices for all clients in cluster , = 1, 2, . . ., k, and P N = {N 1 , N 2 , . . ., N k } denote the partition of the client index set.Furthermore, let n denote the number of clients in cluster , such that Each client has attributes that describe the individual given by their attribute vector x i , i = 1, . . ., N .
These attributes are a combination of p numeric variables (e.g., age) and q categorical variables (e.g.marital status).Without loss of generality, we put the numeric attributes in the first p positions of the attribute vector and the categorical attributes in the last q positions giving The clustering algorithm works in an iterative fashion according to the following steps.
1. Initialize the centroid (location) of the clusters by selecting k clients as "prototype" centroids.
2. Allocate the clients to the clusters with the closest centroid.
3. Compute an overall cost of the allocation by computing total distance of all clients from their assigned centroids.
5. Re-allocate the clients to the clusters with the closest (updated) centroid.
6. Compute the overall cost by computing total distance.
7. Iterate steps 4-6 until there is no change in the overall cost and output the clusters.
We kickoff the clustering party by randomly selecting k clients to serve as the initial centroids (locations) of the clusters.Specifically, the initial centroids are given by the attribute vectors of the randomly-chosen k clients and are denoted by where c j is the cluster-, attribute-j centroid.Attributes in the centroid vectors are positioned in exactly the same order as in the client attribute vectors.As we shall see, as clusters are formed the centroids get updated according to the individuals within each cluster.
After initializing the cluster centroids, we need some way of deciding how to put the clients into the clusters so that individuals within clusters are similar (close) and individuals across clusters are dissimilar (far apart).
To measure the similarity between client i and cluster we use the distance metric where Note that the distance metric is zero if and only if the attribute vector is exactly the same as the centroid and if there are no categorical variables (q = 0) then d(•, •) is the usual Euclidean distance.
For client i the distance between its attribute vector and each of the cluster centroids are computed, d(x i , c ), = 1, . . ., k, and the client is placed in the closest cluster (e.g., minimum distance).This is done for all N clients (the clients initially chosen as centroids will clearly be placed in the correct cluster), with each client assigned to exactly one of the clusters.
After all clients are assigned to a cluster, the overall distance between individuals and their cluster centroid is computed by the cost function The cluster centroids are updated by independently finding the middle for each cluster's attributes.For the numeric attributes, the centroids are updated to be the within-cluster average value.Specifically, the updated j-th attribute for cluster is The categorical attributes of each cluster are updated using the mode, given by where M is the mode function.Next, we re-allocate each client to clusters using the minimum distance between the client attribute vector and the updated cluster centroids.After re-allocation, the overall cost is computed using Equation 8.If the total cost is unchanged from the previous iteration, we stop; otherwise, the cluster centroids are updated and clients are re-allocated.This is repeated until the total cost function is unchanged.
Since the initial set of k cluster centroids (e.g., k clients serving as initial centroids) is chosen randomly, the clustering process is repeated for a large number of randomly-chosen initial cluster centroids to better search for the global minima of the cost function.Each initial cluster centroid produces clusters and their total cost.The best (and final) cluster is the one that minimizes the cost function over all randomly-chosen initial cluster centroids.Typically it is infeasible to look at all possible k initial cluster centroids, which is the reason for the random sampling of the initial cluster centroids.For example, with N = 25000 clients and k = 5 clusters, the number of possible ways of choosing the initial cluster centroids is 25000×24999×24998×24997×24996 5!
which is an infeasible number of possibilities to examine.

Visualizing clusterst-distributed stochastic neighbour embeddings
Visualizing high-dimensional data by projecting it onto a lower-dimensional space is commonly used (Yang, 1999).The computationally efficient dimensionality reduction tool used herein is the t-distributed stochastic neighbour embeddings (t-SNE) (Maaten and Hinton, 2008).The t-SNE method provides a significant dimensionality reduction from high dimensional data to two-or three-dimensions while preserving the significant structure.This method is a nonlinear mapping which, as opposed to linear mappings, performs better for preserving the local structure of data-that is, this method keeps similar clients close together in a low-dimensional visualization.This is important for visualizing clusters since we are using a clustering method that evaluates clients by their similarity.Therefore, the t-SNE method creates a map of clients based on their similarity, and then we independently apply the clustering algorithm to the data-all without specifying the data generating process.
Figure 5 displays the visualization of some sample client data; t-SNE is applied to project the high dimensional data into the 2-D space.For the t-SNE method, "perplexity" is an important parameter that affects the visual behaviour of data projection.Different datasets require different perplexities to display the clusteringor lack thereof-features present in the data.According to (Maaten and Hinton, 2008), the perplexity can be viewed as the algorithm's method to measure the number of effective nearest neighbours with typical values between 5 and 50.Choosing the perplexity value requires the user to tune it during the modelling process.
There is no standard method for specifying the perplexity value.Furthermore, larger datasets require a larger perplexity (van der Maaten, 2009).For our dataset, the perplexity value is set to 200 to get a stable embedded data plot.

Results
In this section, we discuss the results of applying the method described in Section 3 to the client data discussed Section 2. The data cleaning, feature engineering, clustering algorithm, t-SNE embedding visualization, and analysis are implemented using Python version 3.6 and R version 3.5.3(R Core Team, 2020).The implementation of the k-prototypes clustering algorithm originated from a GitHub repository (de Vos, 2020) and the t-SNE algorithm used for data visualization is in the sklearn Python package (Pedregosa et al., 2011).
Figure 6 shows a two-dimensional similarity representation of the data using the t-SNE algorithm with a perplexity of 200 9 .Each point represents one client's attributes projected down to two dimensions, where the Euclidean distance between clients by their embedding represents a quantification of their similarity.
The next step is to use the k-prototypes clustering algorithm to identify the optimal number of clusters k for this client dataset.defined as

Choosing the optimal number of clusters
where a i is a similarity measure of client i to clients within their cluster given by and b i is a similarity measure of client i to the clients in the most similar or closest neighbouring cluster given by The best assignment value for the coefficient is 1 and the worst value is -1, and values near 0 indicate overlapping clusters.Negative values generally indicate that a client may be poorly assigned, as a different cluster is more similar.Figure 7 shows average Silhouette coefficient S = 1 N N i=1 S i for k = 2 to 8 clusters.The average Silhouette coefficient is maximized for this clustering method when we choose k = 5 clusters.
The DB score (Davies and Bouldin, 1979) is another cluster partition evaluation metric that compares the similarity between clusters with the size of the clusters themselves.The DB score is calculated as where k is the number of clusters, s i is the average distance of all clients in cluster i from the centroid c i , and d ij is the distance between cluster centroids c i and c j .The DB index quantifies the density of clusters and clusters which are farther apart.Hence, the DB index decreases as separation between the clusters increases.Similarly to the averaged Silhouette coefficient, the second plot in Figure 7 indicates a k = 5 clustering partition yields the optimal clustering results.
Figure 8 shows the overlaid cluster membership on the t-SNE visualization.Among the 5 clusters, cluster 1 has 19% of the clients and its data points are green on the embedding map, cluster 2 has the largest portion of clients with (36%) and is labelled blue, cluster 3 has 27% of clients and is labelled purple, cluster 4 the least portion (7%) of clients and labelled black, and cluster 5 has 12% of clients and is labelled orange.
From the two-dimensional embedding map in Figure 8, there are distinct boundaries between clusters 2, 3 and clusters 1, 4, 5.There are overlaps between clusters 1 and 5, clusters 2 and 3, and clusters 1 and 4. It is noteworthy that higher dimensional embedding can reveal other higher-order boundaries that distinguish these overlapped clusters.The projection from three-dimensions to these two dimensions creates the visual appearance of overlapping.Figure 11 shows the monthly average trade amount over time, where the shaded areas are 95% bootstrapped pointwise confidence intervals.We note first the scale of each type of trade in the figure, where there are three different orders of magnitude.This may be caused by the nature of the trade types or by the number of elementary trade types within each of the trade type classes defined in Equations ( 1) to (3).

Within cluster analysis
• For third-party initiated trades, cluster 4 has a relatively high trade amount and the largest volatility.
Cluster 1 has similarly high trade amounts but less volatility.Clusters 3 and 5 have very similar trade 9: A dendrogram of the clustering result with a heat map.Each attribute value is scaled to lie in the interval [0, 1], where the minimum attribute value is scaled to zero and maximum value scaled to one.Larger values (more white) indicate a larger relative value to other members in the same attribute.amounts and volatilities that smaller on average than the trade amounts and volatilities of clusters 1 and 4. Cluster 2 has the lowest average trade size and volatility.
• For systematic trades, a similar pattern to third-party initiated trades is reflected.Clusters 1 and 4 are again similar in the trade amount and volatility, with cluster 4 having slightly larger amounts on average except in June.Clusters 3 and 5 have almost identical average trade amounts except in August, and cluster 2 has the smallest average trade amount.An interesting aspect of all clusters is the peaks for the average trade amount evident in January and June.
• Cluster 1 dominates the periodic trade amounts, while cluster 2 has almost zero periodic trade amounts on average with very little volatility.Clusters 3 to 5 have similar trade amounts and volatilities, except in February and March when there is a slight peak before trending down for clusters 3 and 5. Clusters 3 to 5 all have an uptick in the average trade amount in July.There is a clear scale difference compared to the previous two trade types.
Figure 12 shows the inferred risk tolerance (RT) score distributions for clients of each cluster.The majority of clients in each cluster's distribution (top four and bottom left panels) have a RT score close to three.Furthermore, each distribution appears quite similar, with smaller upticks at RT scores of two and four.The panel in the bottom right shows the overlaid translucent densities of each cluster, where the reddish-brown area is the shape that all clusters share.
We investigated the similarity of these distributions using a parametric ANOVA comparison of client RT score means and a nonparametric Kruskal-Wallis test comparison of means (Kruskal and Wallis, 1952;McKight and Najab, 2010), for which both tests' null hypothesis were rejected with P -values ≤ 2 × 10 −16 and 3.23×10 −79 , respectively.A post hoc analysis of a comparison of individual groups with adjusted P -values for multiple comparisons was conducted using Tukey's test (Tukey, 1949) for ANOVA and the nonparametric Dunn's test (Dunn, 1964) for Kruskal-Wallis test.The results of these tests are shown in Appendix C.These results suggest that clusters 3 and 4 have significantly different distributions from the rest.We investigated the difference in the distributions using the histogram density estimators (Figure 12) in a a pairwise symmetric Kullback-Liebler (KL) plug-in estimator (Kullback and Leibler, 1951;Ramírez et al., 2004;Wang et al., 2005).The KL estimator shows that the difference between the unlike-clusters' divergences (3,4) is not much larger than the like-clusters (1,2,5) divergences.The results of the symmetric KL estimators are shown in Appendix C.
From these analyses between the clusters in terms of the distribution of inferred RT scores, we can conclude that the distributions are similar, although there exists a statistically significant difference between the Behaviourally, cluster 1 appears to pursue a riskier trading strategy and we would, therefore, have expected to see a strong weighting towards observations in the 4.0 to 5.0 RT score range.In fact, 14.8% of cluster 1 clients fall into the 4.0 to 5.0 RT score range.

From data to -Personas
The cluster memberships are determined by the similarity of individuals, and we are interested in studying how the groups differ from each other.Using the plots and information presented heretofore, we summarize how the clusters differ using the most important variables to their cluster classification.We note that individuals from two different groups may appear similar, but they are classified based on subtle differences determined by the clustering algorithm.
Using our understanding of investors and finance, we have created 'personas' for clients to ease discussions and help understand the groups as real people and not just data.The five personas are as follows: • Cluster 1: Active Traders (19% of investors) trade frequently (weekly and monthly) and in large amounts.The pattern of trades is seemingly random and initiated manually.These investors had investments across a spectrum of accounts (mainly registered savings plans (RSPs) and TFSAs), and were of an "average" age distribution and demographic.They had a derived risk tolerance rating that averaged 3.19 with standard deviation 0.63, where 1 is a low or preservative risk tolerance and 5 is high or aggressive.
• Cluster 2: Early Savers (36%) never actively trade and instead rely on systematic transactions (autowithdrawal, pre-authorized contribution, asset allocations).This group tended to have investments in cash accounts and to be younger.They had a derived risk tolerance rating that averaged 3.18 with standard deviation 0.75.
• Cluster 3: Just-In-time (27%) initiate trades manually but far less frequently than Cluster 1 and in smaller amounts.These investors had investments across a spectrum of accounts (RSPs, TFSAs etc.), and were of an "average" age and demographic.they had a derived risk tolerance rating that averaged 3.12 with standard deviation 0.73.
• Cluster 4: Older Investors (7%) trade infrequently and the trades were either initiated systematically or from a third-party (pre-authorized withdrawals, dividends and other disbursements).This cluster had an above average concentration of RIFs, and tended to be older.They had a derived risk tolerance rating that averaged 2.95 with standard deviation 0.71.
• Cluster 5: Systematic Savers (12%) trade recurrently (every 60, 90, or 120 days), in small amounts driven by systematic processes (dollar cost averaging) and periodic trading.These investors had investments across a spectrum of accounts (RSPs, TFSAs etc.), and of an "average" age and demographics.
They had a derived risk tolerance rating that averaged 3.19 with standard deviation 0.76.

Discussion Future Plans
We have conducted a variety of approaches to analyze the client dataset to extract financial behaviours.We have constructed data summaries and extracted features that we believe capture financial behaviours, and included those summaries and features in a descriptive analysis.The features engineered from our data will directly affect the performance of future predictive models we are developing.We conducted a k-prototypes clustering algorithm on extracted features, where the cluster memberships were determined by minimizing a similarity cost function.We evaluated our clustering method using a Silhouette coefficient and a DB score, and analyzed the clustering results using the centroids generated by the algorithm and t-SNE visualizations.
The ultimate goal of our research is to provide enhanced advice to clients and their advisors using both traditional and digital approaches.The projects described herein are a path to attain that goal, providing the necessary algorithms to give information and advice in good faith.The projects not only support digital advice, but the results can be used to report to regulatory committees on how data-driven results can aid regulators in promoting financial wellness policies.
Moving forward, we will examine the behaviours of the clusters against the suitability and KYC protocols noted in this paper and then attempt to determine if those behaviours have a constructive or destructive impact on client outcomes.We also plan to examine the impact that advisor behaviours have on the analysis noted above while looking for evidence for whether we can change or nudge any or all of the noted behaviours.Previous research has determined that traditional characteristics explain only 12 percent of an investor's portfolio allocations (Foerster et al., 2014;Grace, 2014;Foerster et al., 2017;Linnainmaa et al., 2018).Our goal is to use new, sophisticated technologies to help examine the remaining 88 percent of unexplained investor behaviour (Grace, 2019).

Trade and Asset Mix
At the root of modern portfolio theory is the assumption that portfolio asset mix drives the portfolios inherent risk.The determination of suitability, based on the KYC, extends through portfolio construction to ensure that the portfolios asset mix is consistent with the investors risk tolerance.In our next phase of the project, we will use the same statistical techniques and dataset above to examine whether the trading behaviour identified in each cluster is "suitable"-as defined by the prescribed regulations.We will complete this analysis by looking at the asset mix exhibited by each cluster.We will evaluate the security risk in the context of the client risk derived from the attributes of the cluster analysis.We will use security risk ratings (SRR) that are defined by industry for each of the securities bought and sold and held by the client.These risk ratings are by regulators under the Know Your Product protocols (Ontario Securities Commission, 2019).We will examine the trading behaviour and trade mix at specific points in time and then along a longitudinal continuum to see if the relationship changes over time.From this analysis, we will be able to determine if investor behaviour is suitable.We will examine how the trading behaviour exhibited by each cluster impacts their portfolios and the probability of achieving their desired outcomes.We will also look for evidence of whether the investors trading behaviour leads to unintended changes in the portfolios asset mix and risk characteristics over time.

Portfolio Returns
Where the analysis noted in the previous projects examine risk and the probability of success, we also plan to examine returns.We will analyze the assumption that higher risk should lead to higher returns (in the long run) and presumably faster portfolio growth .Likewise, lower risk will presumably lead to more modest returns and preservation of capital.During this examination, we will use multiple methods to calculate returns including industry best practices and regulatory guidance.

Advice
This project recognizes that investor behaviour is a complex event with a number of variables influencing behaviour.Spouses, family, friends, media and events, for example, can all influence the timing, characteristics and trajectory of behaviour.However, it is widely acknowledged that the investment advisor acts as the gate keeper for most investment trades and therefore, presumably, the trading behaviour (Marsden et al., 2011;Montmarquette et al., 2012;Investment Funds Institute of Canada, 2012;Kinniry et al., 2014).In this project, we will look for evidence to see if the advisors behaviour is influencing trading behaviour consistent with the KYC and suitability requirements.

Investor Outcome Improvements
In this project, we will take advantage of a second unique data set to examine whether it is possible to change or influence investor behaviours through new, systematic technologies.Using the same methodologies above, and the same set of investors, we will examine investor behaviour before and after a significant system enhancement implemented in November 2019 -leading into the market events of March 2020.We will make use of control charts to help determine the key variables that drive risky behaviour over time.We will use this analysis will help assess the viability of potential new algorithms in the digital advice space.

Figure 1 :
Figure 1: The downstream footprints of KYC regulations.

Figure 2 :
Figure 2: Distribution of client ages, where each bin contains one year.

Figure 4 :
Figure 4: The relative percentage of transactions sizes from the three behavioural metrics versus time (January to August 2019).Top, middle, and bottom panels correspond to third-party initiated, systematic, and periodic trades, respectively.

Figure 5 :
Figure 5: A t-SNE's 2-D projection a small sample of client data.
Two clustering performance evaluation methods are used to determine the optimal number of clusters: the Silhouette coefficient and the Davies-Bouldin (DB) score.The Silhouette coefficient (Rousseeuw, 1987) compares the cluster membership classification of each client by comparing their similarity within and between clusters and indicates how well clients are assigned.The Silhouette coefficient of client i in cluster N is 9 See Section 3.2 for discussion on perplexity for the t-SNE method 6: t-SNE visualization for the full data set projected onto two embeddings.

Figure 9
Figure 9 shows a tree-structured dendrogram with a heat map to visualize the pattern within and between clusters' attributes.A sample of 53 clients from the dataset is selected by stratified random sampling, where each cluster represents a stratum and the relative number of selected individuals is proportional to the cluster size.Each row of the dendrogram shows an individual client's attributes, and the columns show the features used in clustering.The first column is the clustering labels from Figure 8.For each remaining column, a

Figure 10
Figure 10 shows the clustering results for categorical features.For the residency and gender features, there are no obvious differences between clusters.For the age feature, cluster 4 a high average age, and the distribution is left-skewed and appears almost bimodal.Clusters 1, 3 and 5 have similar age distributions.The cluster 2 age distribution appears shifted left and has younger clients compared to other clusters.The bottom right panel shows the percentages of the six account types in different clusters.Clients in clusters 1, 3 and 5 have similar account proportions.Cluster 2 has more cash accounts and cluster 4 has more RIF accounts.

Figure 10 :
Figure 10: Categorical and numerical distributions of clusters.Top left panel shows the residency distributions, top right shows the gender distributions, bottom left shows the age distributions, and bottom right shows the account type distributions for each cluster.

Figure 12 :
Figure 12: RT score distributions by cluster.The top four and bottom left panels are each cluster's distribution of the number of clients by inferred RT score.The bottom right panel is each of the clusters' risk score density overlaid.
Zhexue Huang.Clustering large sets with mixed numeric and categorical values.In The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 21-34, 1997.Zhexue Huang.Extensions to the k-means algorithm for clustering large data sets with categorical values.Data mining and knowledge discovery, 2(3):283-304, 1998.Zhexue Huang and Michael K. Ng.A note on k-modes clustering.Journal of Classification, 20(2):257, 2003.Investor Economics Investment Funds Institute of Canada.

Table 1 :
KYC demographics and trading behaviours compared to expected risk tolerance and anticipated risk tolerance for each cluster.
On the remaining data, imputation is conducted for each numeric and categorical feature based on existing values.For example, missing values in categorical variables such as 'residency' are filled with mode value 'Ontario' since more than 67% of clients are from Ontario; missing values in numerical variables such as 'annual income' are filled with mean income based on the job categories from KYC. See Table8in Appendix B for more details on missing data.Table2shows the details of the pertinent objective KYC information.The distribution of client age is shown in Figure2.The client age distribution is unimodal, centred at 58.1 years, has a standard deviation of 14.1 years, and is slightly left-skewed.The minimum age is 18 years-the legal age to open an account in Canada-and the maximum is 98.
g., recording typos), transforming values into categories (e.g., grouping occupations into classifications), removing irrelevant, anonymized (e.g., contact information), or repeated (e.g., postal code in place of residence region) data.Any variable containing over 6 10 percent missing values or errors (e.g., '*' or 'unknown') is removed to avoid excessive bias from imputation in our analysis.

Table 3 :
Distribution of residency for client accounts.Location 8 ON BC AB MB NS Other (CA) Unknown USA UK Percentage 65.19 14.63 12.00 3.94 2.59 0.92 0.41 0.26 0.06

Table 4 :
The number of clients by number of accounts.

Table 5 :
The RFMP features engineered from the dataset

Table 6 :
Mean values of the features of the optimal cluster centroids for each cluster