Guided Semi-Supervised Non-negative Matrix Factorization on Legal Documents

Classification and topic modeling are popular techniques in machine learning that extract information from large-scale datasets. By incorporating a priori information such as labels or important features, methods have been developed to perform classification and topic modeling tasks; however, most methods that can perform both do not allow for guidance of the topics or features. In this paper, we propose a method, namely Guided Semi-Supervised Non-negative Matrix Factorization (GSSNMF), that performs both classification and topic modeling by incorporating supervision from both pre-assigned document class labels and user-designed seed words. We test the performance of this method through its application to legal documents provided by the California Innocence Project, a nonprofit that works to free innocent convicted persons and reform the justice system. The results show that our proposed method improves both classification accuracy and topic coherence in comparison to past methods like Semi-Supervised Non-negative Matrix Factorization (SSNMF) and Guided Non-negative Matrix Factorization (Guided NMF).


Introduction
Understanding latent trends within large-scale, complex datasets is a key component of modern data science pipelines, leading to downstream tasks such as classification and clustering.In the setting of textual data contained within a collection of documents, non-negative matrix factorization (NMF) has proven itself as an effective, unsupervised tool for this exact task [LS99, LS01, AGM12, KCP15, XLG03], with trends represented by topics.Whilst such fully unsupervised techniques bring great flexibility in application, it has been demonstrated that learned topics may not have the desired effectiveness in downstream tasks [CGW + 09].In particular, certain related features may be strongly weighted in the data, yielding highly related topics that do not capture the variety of trends effectively [JDIU12].To counteract this effect, some level of supervision may be introduced to steer learnt topics towards being more meaningful and representative, and thus improve the quality of downstream analyses.Semi-Supervised NMF (SSNMF) [LYC10, CRDH08, HKL + 20] utilizes class label information to simultaneously learn a dimensionality-reduction model and a model for a downstream task such as classification.Guided NMF [VHRN21] instead incorporates user-designed seed words to "guide" topics towards capturing a more diverse range of features, leveraging (potentially little) supervision information to drive the learning of more balanced, distinct topics.Despite a distinction between the supervision information and goals of Guided NMF and SSNMF, there are certainly mutual relationships: knowledge of class labels can be leveraged to improve the quality of learnt topics (in terms of scope, mutual exclusiveness, and self-coherence), whilst a priori seed words for each class can improve classification.
With these heuristics in mind, we introduce Guided Semi-Supervised NMF (GSSNMF), a model that incorporates both seed word information and class labels in order to simultaneously learn topics and perform classification.The goal of this work is to show that utilizing both forms of supervision information concurrently offers improvements to both the topic modeling and classification tasks, while producing highly interpretable results.
2 Related Work

Classical Non-negative Matrix Factorization
Non-negative matrix factorization (NMF) is a powerful framework for performing unsupervised tasks such as topic modeling and clustering [LS99].Given a target dimension k < min{d, n}, the classical NMF method approximates a non-negative data matrix X ∈ R d×n by the product of two non-negative low-rank matrices: the dictionary matrix W Both matrices W and H can be found by solving the optimization problem argmin where A 2 F = i,j a 2 ij is the Frobenius norm of A and the constant of 1 2 is to ease the calculation when taking the gradient.In the context of topic modeling, the dimension k is the number of desired topics.Each column of W encodes the strength of every dictionary word's association with a learned topic, and each column of H encodes the relevance of every topic for a given document in the corpus.By enforcing non-negativity constraints, NMF methods can learn topics and document classes with high interpretability [LS99, XLG03].

Semi-Supervised NMF
Beside topic modeling, one variant of classical NMF, Semi-Supervised NMF (SSNMF) [LYC10, CRDH08, HKL + 20], is designed to further perform classification.SSNMF introduces a masking matrix , where p is the number of classes and n is the number of documents.The masking matrix L is defined as: where Note that the masking matrix L splits the label information into train and test sets.Each column z i is a binary encoding vector such that if the document x i belongs to class j, the j th entry of z i is 1 and otherwise it is set to be 0. The dictionary matrix W , coding matrix H, and label dictionary matrix C can be found by solving the optimization problem argmin where µ > 0 is a regularization parameter and A B denotes entry-wise multiplication between matrix A and B. Matrices W and H can be interpreted in the same way as classical NMF.Matrix C can be viewed as the dictionary matrix for the label matrix Z.

Guided NMF
Due to the fully unsupervised nature of classical NMF, the generated topics may suffer from redundancy or lack of cohesion when the given data set is biased towards a set of featured words [CGW + 09, JDIU12, VHRN21].
Guided NMF [VHRN21] addresses this by guiding the topic outputs through incorporating flexible userspecified seed word supervision.Each word in a given list of s seed words can be represented as a sparse binary seed vector v ∈ R d , whose entries are zero except for some positive weights at entries corresponding to the seed word feature.The corresponding binary seed matrix Y ∈ R d×s ≥0 can be constructed as For a given data matrix X and a seed matrix Y , Guided NMF seeks a dictionary matrix W , coding matrix H, and topic supervision matrix B by considering the optimization problem argmin where λ > 0 is a regularization parameter.Matrices W and H can be interpreted in the same way as classical NMF.Matrix B can help identify topics that form from the influence of seed words.

Proposed Methods
Recall that SSNMF is able to classify different documents through given label information, while Guided NMF can guide the content of generated topics via a priori seed words.We propose a more general model, Guided Semi-Supervised NMF (GSSNMF), that can leverage both label information and important seed words to improve performance in both multi-label classification and topic modeling.Heuristically, we see that for classification, the user-specified seed words aid SSNMF in distinguishing between each class label and thus improve classification accuracy.For topic modeling, the known label information enables Guided NMF to better cluster similar documents, improving topic coherence and interpretability.In particular, GSSNMF optimizes argmin where matrices W , H, Y , B, L, Z, and C and constants λ and µ can be interpreted in the same way as in SSNMF and Guided NMF.Note that if we simply set either λ or µ to be 0, GSSNMF reduces to SSNMF or Guided NMF respectively.
To solve (5), we propose a multiplicative update scheme akin to those in [Lin07,LYC10].The derivation of these updates is provided in Appendix A, and the updating process is presented in Algorithm 1.
Algorithm 1: GSSNMF with multiplicative updates We will demonstrate the strength of GSSNMF by considering different combinations of the parameters λ and µ, and comparing the implementations of this method with SSNMF and Guided NMF through real-life applications in the following section.

Experiments
In this section, we evaluate the performance of our GSSNMF on the California Innocence Project dataset [BCH + 21].Specifically, we compare GSSNMF with SSNMF for performance in classification (measured by the Macro-F1 score, i.e the averaged F1-score which is sensitive to different distributions of different classes [OB21]) and with Guided NMF for performance in topic modeling (measured by the C coherence score [MWT + 11]).

Data and Pre-Processing
A nonprofit, clinical law school program hosted by the California Western School of Law, the California Innocence Project (CIP) focuses on freeing wrongfully-convicted prisoners, reforming the criminal justice system, and training upcoming law students [BCH + 21].Every year, the CIP receives over 2000+ requests for help, each containing a case file of legal documents.Within each case file, the Appellant's Opening Brief (AOB) is a legal document written by an appellant to argue for innocence by explaining the mistakes made by the court.This document contains crucial information about the crime types relevant to the case, as well as potential evidence within the case [BCH + 21].For our final dataset, we include all AOBs in case files that have assigned crime labels, totaling 203 AOBs.Each AOB is thus associated with one or more of thirteen crime labels: assault, drug, gang, gun, kidnapping, murder, robbery, sexual, vandalism, manslaughter, theft, burglary, and stalking.
To pre-process data, we remove numbers, symbols, and stopwords according to the NLTK English stopwords list [BKL09] from all AOBs; we also perform stemming to reduce inflected words to their original word stem.Following the work of [R + 03, LZ + 07, BCH + 21], we apply term-frequency inverse document frequency (tf-idf ) [SB88] to our dataset of AOBs and generated the corpus matrix X with parameters max df = 0.8, min df=0.04,and max features = 700 in the function TfidfVectorizer.
For our topic modeling methods, we also need to identify the number of topics that potentially exist in our data, which corresponds to the rank of corpus matrix X.To determine the proper range of the number of topics to generate, we analyze the singular values of X.The magnitudes of singular values are positively related to the increment in proportion variance explained by adding on more topics, or increasing rank, to split the corpus matrix X [GHH + 20].In this way, we use the number of singular values to approximate the rank of X. Figure 1 plots the magnitudes of the singular values of corpus matrix X against the number of singular values, which is also the approximated rank.By examining this plot, we see that a range for potential rank is between 6 to 9, since the magnitudes of the singular values start to level off around this range.

GSSNMF for Classification
Taking advantage of Cross-Validation methods [Sto74], we randomly split all AOBs into a 70% training set with labeled information and a 30% testing set without labels.In practice, 70% of the columns of the masking matrix L are set to 1 p for the training set, and the rest are set to 0 p for the testing set.As a result, the label matrix Z is masked by L into a corresponding training label matrix Z train and a corresponding testing label matrix Z test .We then perform SSNMF and GSSNMF to reconstruct Z test , setting the number of topics, or rank, equal to 8. Given the multi-label characteristics of the AOBs, we compare the performance between SSNMF and GSSNMF with a measure of classification accuracy: the Macro F1-score, which is designed to access the quality of multi-label classification on Z test [OB21].The Macro F1-score is defined as where p is the number of labels and F1-score i is the F1-score for topic i.Notice that the Macro F1-score treats each class with equal importance; thus, it will penalize models which only perform well on major labels but not minor labels.In order to handle the multi-label characteristics of the AOB dataset, we first extract the number of labels assigned to each AOB in the testing dataset.Then for each corresponding column i of the reconstructed Z test , we set the largest j i elements in each column to be 1 and the rest to be 0, where j i is the true number of labels assigned to the ith document in the testing set.We first tune the parameter of µ in SSNMF to identify a proper range of µ for which SSNMF performs the best on the AOB dataset under the Macro F1-score.Then for each selected µ in the proper range, we run GSSNMF with another range of λ.While there are various choices of seed words, we naturally pick the class labels themselves as seed words for our implementation of GSSNMF.As a result, for each combination of µ ∈ [0.0005, 0.0006, • • • , 0.0012] and µ, λ ∈ [0.0005, 0.0006, • • • , 0.0012], we conduct 10 independent Cross-Validation trials and average the Macro F1-scores.The results are displayed in Figure 2. We can see that, in general, when incorporating the extra information from seed words, GSSNMF has a better Macro F1-score than SSNMF.As an example, we extract the reconstructed testing label matrix by SSNMF and GSSNMF along with the actual testing label matrix from a single trial.The matrices are visualized in Figure 3.As we can see from the actual testing label matrix, murder is a major label.Without the extra information from seed words, SSNMF tends to focus on the major label, leading to the trivial solution of classifying all cases as murder ; however, through user-specified seed words, GSSNMF can better evaluate the assignment of other labels, achieving an improved classification accuracy by the Macro F1-Score.

GSSNMF for Topic Modeling
In this section, we test the performance of GSSNMF for topic modeling by comparing it with Guided NMF on the CIP AOB dataset.Specifically, we conduct experiments for the range of rank identified in Section 4.1, running tests for various values of λ and µ for each rank.To measure the effectiveness of the topics discovered by Guided NMF and GSSNMF, we calculate the topic coherence score defined in [MWT + 11] for each topic.The coherence score C i for each topic i with N most probable keywords is given by In the above equation (7), P (w) denotes the document frequency of keyword w, which is calculated by counting the number of documents that keyword w appears in at least once.P (w, w ) denotes the co-document frequency of keyword w and w , which is obtained by counting the number of documents that contain both w and w .
In general, the topic coherence score seeks to measure how well the keywords that define a topic make sense as a whole to a human expert, providing a means for consistent interpretation of accuracy in topic generation.
A large positive C coherence score indicates that the keywords from a topic are highly cohesive, as judged by a human expert.Since we are judging the performance of methods that generate multiple topics, we calculate coherence scores C for each topic that is generated by Guided NMF or GSSNMF and then take the average.Thus, our final measure of performance for each method is the averaged coherence score C avg : where k is the number of topics (or rank) we have specified.
As suggested by Figure 1, a proper rank falls between 6 and 9. Starting with the generation of 6 topics from the AOB dataset, we first find a range of λ in which Guided NMF generates the highest mean C avg over 10 independent trials.In our computations, we use the top 30 keywords of each topic to generate each coherence score; then for each trial, we obtain an individual C avg , allowing us to average the 10 C avg from the 10 trials into the mean C avg .Based on the proper range of λ, we then choose a range of µ for our GSSNMF to incorporate the label information into topic generation.Again, for each pair of (λ, µ), we run 10 independent trials of GSSNMF and calculate C avg for each trial to generate a mean C avg .With these ranges in mind, we work towards the following goal: for a given "best" λ of Guided NMF, we improve topic generation performance by implementing GSSNMF with a "best" µ that balances how much weight GSSNMF should place on the new information of predetermined labels for each document.We then repeat the same process for ranks 7, 8, and 9, and plot the mean C avg against each λ in Figure 4.The corresponding choice of µ can be found in the Appendix B. We can see that most of the time, for a given λ, we are able to find such µ that GSSNMF can generate a higher mean C avg than Guided NMF in topic modeling across various ranks.Ultimately, we also see that a GSSNMF result always outperforms even the highest-performing Guided NMF result.
In Table 1, we provide an example of the outputs of topic modeling from Guided NMF using λ = 0.4 and from GSSNMF using λ = 0.3 and µ = 0.006 for a rank of 7. Note that we output only the top 10 keywords under each identified topic group for ease of viewing, but our coherence scores are measured using the top 30 keywords.Thus, while the top 10 probable keywords of the generated topics may look similar across the two methods, the coherence scores calculated from the top 30 probable keywords reveal that GSSNMF produces more coherent topics as a whole in comparison to Guided NMF.Specifically, GSSNMF demonstrates an ability to produce topics with similar levels of coherence (as seen from the small variance in individual coherence scores C of each topic), while Guided NMF produces topics that may vary in level of coherence (as seen from the large variance in individual coherence scores C for each topic).This further illustrates that GSSNMF is able to use the additional label information to execute topic modeling with better coherence.

Conclusion and future works
In this paper, we analyze the characteristics of SSNMF (Section 2.2) and Guided NMF (Section 2.3) concerning the tasks of classification and topic modeling.From these methods, we propose a novel NMF model, namely the GSSNMF, which combines characteristics of SSNMF and Guided NMF.SSNMF utilizes label information for classification; Guided NMF leverages user-specific seed words to guided topic content.Thus, to carry out classification and topic modeling simultaneously, GSSNMF uses additional label information to improve the coherence of topic modeling results while incorporating seed words for more accurate classification.Taking advantage of multiplicative updates, we provide a solver for GSSNMF and then evaluate its performance on real-life data.
In general, GSSNMF is able to out-perform SSNMF on the task of classification.The extra information from the seed words contributes to a more accurate classification result.Specifically, SSNMF tends to focus on the most prevalent class label and classifies all documents into that class label.Unlike SSNMF, the additional information from choosing seed words as the class labels can help GSSNMF treat each class label equally and avoid the trivial solution of classifying every single document into the most prevalent class label.Additionally, GSSNMF is able to generate more coherent topics when compared to Guided NMF on the task of topic modeling.The extra information from the known label matrix can help GSSNMF better identify which documents belong to the same class.As a result, GSSNMF generates topics with higher and less variable coherence scores.
While there are other variants of SSNMF according to [HKL + 20], we developed GSSNMF only based on the standard Frobenius norm.In the future, we plan to make use of other comparable measures like the information divergence, and derive a corresponding multiplicative updates solver.In addition, across all the experiments, we selected the parameters λ and µ, which put weight on the seed word matrix and label matrix respectively, based on the experimental results.In our continued work, we plan to conduct error analysis to determine how each parameter affects the other parameter and the overall approximation results.Particularly, for a given parameter λ or µ, we hope to identify an underlying, hidden relationship that allows us to quickly pick a matching µ or λ, respectively, that maximizes GSSNMF performance.

A GSSNMF Algorithm: Multiplicative Updates Proof
We begin with a corpus matrix X ∈ R d×n ≥0 , a seed matrix Y ∈ R d×s ≥0 , a label matrix Z ∈ R p×n ≥0 , and a masking matrix L ∈ R p×n ≥0 .From these, we hope to find dictionary matrix W ∈ R d×k ≥0 , coding matrix H ∈ R k×n ≥0 , and supervision matrices B ∈ R k×s ≥0 and C ∈ R p×k ≥0 that minimize the loss function:

Figure 1 :
Figure 1: First 20 singular values of the corpus matrix X.

Figure 2 :
Figure 2: The heatmap representation of Macro F1-score averaged over 10 independent trials for a proper range of µ.(a) is the set of highest SSNMF Macro F1-score.(b) is how GSSNMF improves on SSNMF results in (a) using a proper λ.(c) is GSSNMF Macro F1-score for a proper range of λ tested with each µ.Note that (b) is the row maximums of (c).

Figure 3 :
Figure 3: Actual and reconstructed crime testing label matrix using SSNMF and GSSNMF (µ = 0.0011, λ = 0.0007), where a light pixel indicates that the case is assigned to the corresponding crime label on the y-axis, while a dark pixel indicates no assignment.

Figure 4 :
Figure 4: Comparison of Guided NMF mean C avg score (over 10 independent trials) and highest GSSNMF mean C avg score (over 10 independent trials) for each λ tested.

Table 1 :
Topic Modeling Results of Guided NMF and GSSNMF for Rank 7.

Table 2 :
Mean of averaged coherence scores from 10 independent trials (mean C avg ) of Guided NMF and GSSNMF given λ and the best-performing µ for each λ, by rank.