Counterfeit drugs are a very big challenge for the pharmaceutical industry worldwide. One question arises here: what does Counterfeit drug mean? According to WHO, “The drug which is manufactured fraudulently, mislabeled, with low quality, hiding the source detail or identity and does not follow the defined standard is considered to be fake or Counterfeits” [1
]. The survey provided from the WHO stated that in underdeveloped countries every 10th drug use by the consumers is counterfeit and has low-quality [2
]. This WHO claim may not be perfect or have more than this because there is no accurate estimation or statistics of counterfeits drugs. The use of this low quality and non-standard medicine causes a negative impact and may increase the death rate. Counterfeit drugs may contain some active or genuine ingredients but the amount of these ingredients is not proper and may be low or high, it contains some toxic impurities at the production level and it can cause some serious health issues in humans while using. Sometimes the manufacturer of these fake medicines used the logo of some reputed and popular pharma company to enter their drugs easily in the market without facing hurdles. Therefore, these drugs affect the sales of popular and expansive medicines like antibiotics, cancer, painkillers, many other cardiac medicines, etc., and also have a lot of side effects and cause more serious health problems. As we mentioned above, almost 10–15% of the drugs are fake worldwide but in the case of developing countries, the ratio of these drugs are more around 30%. The annual deaths due to malaria disease are around 0.7 million and death due to counterfeit drugs are almost 0.2 million from them [3
]. As you know, the increasing use of technologies also provides more ways to supply these fake drugs in the market. Due to technology emergence, the distribution of these fake drugs is also increased day by day. The FBI and International Anti-Counterfeiting Coalition (IACC) reported that the counterfeiting is one of the largest criminal businesses of the 21st century and growing rapidly every day by introducing new fake pharma manufacturers into the market [4
In the present time, everyone has the basic right to get better health facilities. The various numbers of drugs have been introduced in the market with a new name and labels every day because of the increasing number of new diseases everywhere. These medications help the patient get moment alleviation from unbearable pain. Instead of benefits from these drugs, there are many disadvantages because the authenticity of these manufacturer organizations is not registered or known they are unknown and do not follow the specified standards. Many deaths are reported from developing countries due to the usage of these fake drugs and more victims of these drugs are children according to WHO [3
]. As indicated by the statistics, yearly business loss of around $
200 billion dollars is reported by US pharma companies due to these counterfeit drugs [1
]. These drugs may not help the patients to recover from the disease but have many other dangerous side effects. These medicines are a serious threat to human health. These fake drugs are delivered through different complex networks; that is the reason it is difficult to detect these counterfeit drugs. The US allocates around a $
3.2 trillion budget for the health care sector and a quarter of this budget is utilized only for the management of a secure supply chain process [7
]. There are few expert opinions and suggestions to overcome and prevent counterfeit drug problems, which include the secure and transparent drug delivery and supply chain process, enhance control and management of drug market at the pharmacy, distributors and hospital level and the use of latest technologies to continuously track the drugs at every level of supply chain [9
]. From the above-mentioned points, the secure drug supply chain management is very important to prevent this issue. For this problem, the best system which can trace and track the drug delivery at every phase staring from supplier’s raw material, manufacturing product, distribution stage, pharmacy, clinics and consumers, respectively is needed to prevent counterfeit drugs. Furthermore, the latest innovation in the field of computing that can handle the supply chain and track the product is Blockchain. The Blockchain can secure the supply chain process and keep track of the delivery very efficiently [11
Blockchain is a revolutionary technology that is based on the consensus protocol introduced by Satoshi Nakamoto in 2008; basically, it is specially designed to store the transactions log for a well-known cryptocurrency named as Bitcoin [13
]. The blockchain provides a very rich digital ledger software to store the data records, executed transaction logs in the form of arranged numbers of blocks. More directly, it is a secure distributed kind of database. The digital information related to each transaction, its time, date, price and also the participants involved in this transaction are stored in a block. The stored information is distributed within the blockchain network; where many independent nodes participate to validate the transactions without knowing each other and without any kind of trust among them. There are two hash codes inside each block in the network namely prior and current hash codes. The prior hash code is for the preceding block and the current code is for the block itself. Moreover, if one block information is changed then all the information about that block should be changed accordingly. All the blocks inside the network are strongly linked together and protected with transaction codes and crypto codes as well. The other important feature is strong mathematical algorithms which give the ability to miner nodes to validate these blocks without affecting the data of miner nodes and after validation blocks could be added to the blockchain network. That is the reason the blockchain system ensures security and transparency [14
]. The blockchain network is an increasing chain of multiple blocks and stored the information according to defined rules. In the network, there are many miner nodes for adding new blocks in the chain to form a transaction; they work independently; they operate and control with a single protocol. The blockchain network is a distributed system that contains all the information about the transactions and participants and keeps track of all the record history. By the functionality, blockchain network is divided into three types; private, public and consortium blockchain, respectively. There is no admin node to check and control the transactions but all the miner nodes or participants can check and validate the transactions in permission-less or public blockchain networks [15
]. The miner nodes are also capable to participate in consensus process and validity between the nodes is accomplished by consensus [16
]. For example, Ethereum and Bitcoin networks, etc. The data and transactions are handled by one central entity known as the admin node in the consortium blockchain network. Admin can control the data by using public and private access. Some of the data could be public and some are shown to the specific type of private participants based on business terms. These kinds of networks are not fully decentralized and support both public and private data. e.g., Hyper ledger fabrics platform, etc. In the private blockchain network, all the data and transactions stored in the network are strictly private. All the information and data is only shared with the authorized members of the network. The only admin node can add the authorize members to the network which is quite similar to the consortium network in one aspect. e.g., Hyperledger, multichain networks, etc. [18
Blockchain technology is the best option for handling and securing the process of the pharmaceutical drug supply chain. Some of the pharma companies are using blockchain in their supply chain and others are focusing to shift toward blockchain due to its diverse features [11
]. The reason why most of the industries want to shift toward blockchain technology is because it provides a distributed decentralized ledger electronically where all the pair nodes in the network can see and validate the transactions related information. One more thing is its consensus algorithm which empowers the network to store the only validated information in the repository and eliminate the problem of the duplicate transaction. Furthermore, the network failure probability is very low because more nodes from the failure threshold are functioning and besides, fault tolerance is very strong.
This manuscript introduces blockchain using Hyperledger fabric for the secure drug supply chain management. It is capable of continuous monitoring and tracking of the drug delivery process to resolve counterfeiting issues. The goal is not only to resolve secure drug delivery, but also provide the best suitable drug to the users. To this end, we introduced a machine learning-based recommendation system that is trained using previously collected drug user reviews data-set: the data-set was collected through web crawling and contains the comments and ratings based on the disease condition of the consumers. Furthermore, machine learning models N-gram and LightGBM are trained on that dataset to recommend the best and effective drugs to the consumers of the system. The brief working of our proposed system is as follows: firstly, a blockchain-based novel drug supply chain system is developed in which the participants, supplier, manufacturer, distributor, pharmacy, hospitals and clinics of the drug delivery smart pharmaceutical network are used to store and share the information through the blockchain network efficiently and securely. The system provides a user-friendly WEB interface for handling and tracking of the drugs. Furthermore, we have utilized the Hyperledger composer REST API for the connectivity of web applications to the blockchain network. In the proposed solution, the validation and authentication of transaction requests are achieved by defining the rules in access control policy and history. In this work, Couch-DB is used for storing a large number of transaction records and is the best option to eliminate the data redundancy problem and provide separate storage for each node in the blockchain network. The final product enables the consumers e.g., patients to track the medicine origin with authenticity. The other advantage of our system is to recommend the best and top-rated drugs to the customers of the system with the help of machine learning algorithms. Further, the machine learning module is integrated with the blockchain network and the participants of the system can use both features of the proposed system together. Afterward, the models are also able to learn from the comments and ratings provided at the client web application and updates the recommendation results accordingly.
The rest of the paper is organized as follows. Section 2
contains the background literature of previously developed systems. In Section 3
, the overview of the blockchain-based drug supply chain management system and detail design and architecture of the proposed system is explained. Further, a smart contract and transaction procedure of the blockchain network are also discussed. The machine learning-based drug recommendation analysis and is well explained in Section 4
. Section 5
explains the implementation of the proposed system and its working. Furthermore, the experimental results of both modules: blockchain-based drug supply chain management and system and machine learning-based drug prediction and recommendation are discussed in Section 6
. Finally, the paper is concluded.
2. Related Work
Blockchain has emerged as a revolutionary technology that stores and transmits data in a secure, fault-tolerant and transparent manner. This is possible due to distributed ledger-based technology. The blockchain has the full potential to make any organization secure, efficient, transparent and decentralized. Since the blockchain entered the public eye through Bitcoin [13
], researchers have never stopped striving to extend blockchain applications to non-financial fields. Among these non-financial fields, the healthcare industry is one of the industries that have shown more impact on the blockchain. Although, the research on developing blockchain assisted applications is quite new and developing rapidly. As a result, researchers in the healthcare industry have been working hard to keep up with research fronts in this field. We summarize the blockchain applications which are currently used in the healthcare field, to give a glimpse into the mist-filled blockchain technology application field, which belongs to the future development direction of the healthcare industry [17
Nowadays blockchain applications are not only restricted to cryptocurrency but also used in many other areas like agriculture [21
], healthcare [22
], finance [23
], education [24
], transportation [25
], supply chain [26
], etc. In 2018, Ref. [27
] a blockchain-based agriculture supply chain management and traceability system were developed: named as agriBlockIoT. This system can manage the supply chain of food products and trace where it comes from. AgriBlockIoT system is developed with two separate blockchain network platforms; Hyperledger Fabrics and Ethereum. Furthermore, both networks are compared with each other in terms of latency and transaction per second. The system ensures that the data and information generated with different sensors are stored in secure, immutable, transparent manners. In the same year, Ref. [28
] combine radio frequency identification (RFID) sensor technology and blockchain technology to achieve a blockchain architecture Internal wood chain traceability system, this system will trace information and wood quality Relevant information is integrated into an online system, using blockchain technology to remove features and distributed storage secure storage of data information and transaction records. In 2019,authors [29
] have been proposed a blockchain and electronic product code (EPC) based information service food safety traceability system, this system uses blockchain features of technology traceability, time stamping and tamper resistance knowledge mechanism to accurately record valid data throughout the food supply chain, sharing and specific tracking to prevent data tampering and trust in the process of information interaction, information leakage and other issues for effective detection and prevention of food safety issues. Compared to the above emphasis on traceability of agricultural products, wood, food and medicine. A pseudo traceability system is required. In 2016 [30
], traditional Chinese medicine (TCM) quality traceability system, which can track changes in external information such as manufacturing and distribution processes and the inherent quality of traditional Chinese medicine retrospective. In 2018, Ref. [31
] comparative research proposes a data collection, storage, analytical and visual data management system that can ensure different supply chain parameters system interoperability among participants.
The traceability of medicine systems have their advantages, but can not meet the data storage decentralization at the same time. Three requirements: informationization, comprehensive and tamper-proof information and information privacy. Blockchain technology has unique advantages in the application of traceability systems: the decentralization and distributed storage characteristics of blockchain technology can ensure data reliability. Information transparency and complete reliability of data flow; the characteristics of technology traceability and data tamper resistance can effectively solve the problem in the supply chain; counterfeiting and shoddy issues; anti-tampering and time stamping of blockchain technology data. The characteristics can be used to provide evidence and accountability and resolve disputes between various parties. Not combining the technical advantages of the blockchain and the shortcomings of the current medical anti-counterfeiting traceability system, in this study [26
], a blockchain assisted medicine traceability system is proposed to prevent counterfeit drug problems in the market.
Electronic medical records (EMR) systems are developed with the help of a blockchain network for securing the healthcare industry. Blockchain has the characteristics of decentralization, anti-tampering, data traceability, reliability, smart contracts, security and privacy, which are in line with the requirements of electronic medical records [32
]. Guardtime [34
], a Dutch company for data security, developed a blockchain assisted system for the patient’s identities verification. All country individuals will get a card, which uses the blockchain to correlate EMR data with each other. A hash code is assigned to each update whenever a transaction is triggered in the blockchain network. Due to blockchain, the audit records in the EMR are tamper-proof and beyond modification. Tamper-resistant logs could be used to store information status in existing healthcare databases. Any updates in the database (such as registered records, etc.) will be assigned a timestamp and password while writing the transaction in a block. The blockchain system ensures the integrity and safety of medical data [35
The second application related to EMR is MedRec [36
], a project co-sponsored by MIT Laboratories and Beth Israel Medical Center. MedRec provides health care a secure and transparent way to share and transmit the data between each other. This blockchain application is designed to protect the proxy rights of each patient and the right to know others’ access to their healthcare data. Patients can share these permissions on the blockchain to share the data for future research purposes. Although all the locations and data transaction logs are maintained in the blockchain network. At the same time, if you want to achieve true interaction, supporting software is also necessary. Another example is the Gem Health Network developed by Gem [37
], an American startup, using the Ethereum blockchain platform. it permits various medical practitioners to use the same data which is stored in the network. There are many similar ones, such as Healthbank, which continuously works to enable patients to use the blockchain platform to fully control their data; Medicalchain has built a blockchain-based platform and promotes sharing patient medical records and working hard; Healthcoin [38
] plans to build a global EMR system. In addition, other participants in different plans and projects centered on the blockchain include Factom, HealthCombix, Patiententory, SimplyVital, IBM Watson, BurstIQ, Bowhead, QBRICS, Nuco and more. Problem-solving methods are indeed necessary, but at the implementation level, these methods face the following problems: the interactivity between different EMR solutions based on the blockchain security, scalability and privacy of data are the major issues in EMR systems. The pharmaceutical industry generally faces a problem that can have direct consequences for patients: the circulation of fake or inferior drugs. Blockchain technology has proven capable of solving this problem [39
]. Remote patient monitoring [42
] is possible due to the emergence of internet of thing (IoT) sensors and smartphone devices. These devices can collect a large amount of biomedical data and medical professionals can easily check the condition of the patients, instruct in terms of medicine, etc. so, blockchain provides enough features to achieve this [43
In this work, authors [44
] have proposed blockchain based framework for securing the delivery of biopharmaceutical products. This system utilizes the stochastic simulation for guiding the blockchain network where the system can be able to protect the biopharmaceutical products from counterfeiting, fraud, theft and temperature diversion as well. Although, the proposed system improves the efficiency, transparency, security and reliability of the supply chain process. The smart contracts for the system were defined based on the proof of authority (PoA) mechanism. Finally, the system can provide real-time monitoring for distributed data streams within the blockchain network. The results prove that the performance of the system is satisfactory. In this paper, the authors [45
] have been proposed information management and tracking system for drug delivery in the pharma industry. They have been used different technologies for securing the drug supply chain such as RFID, electronic product code information service (EPCIS). The research environment was developed by using two theoretical models collective action theory and transaction cost theory. They defined many rules for all the stakeholders that are participating in the drug delivery process. The business of counterfeiting drugs is a major problem for all pharmaceutical manufacturers and health organizations. All the international health organizations want to overcome this issue but the sales of these drugs increase day by day. The author of this paper [46
] proposes a system with the help of blockchain technology for the traceability of drugs and also controls the process of the supply chain as well, where each participant of the system can be able to track and check the drug delivery at each stage. The proposed system has been based on a case study only.
In this work, authors [47
] have proposed an Ethereum-based blockchain network for tracking the shipment of the products. It can manage and track the supply chain of different items while shipping in smart containers. They have developed this system based on smart contracts between the product sender and receiver without the involvement of the third party. This paper [48
] explains the complete survey and existing developed approaches related to the medicine supply chain. It can also explain the multiple procedures of how counterfeits drugs enter into the system and how these medicines affect financially to original manufacturers. One of the important reasons for drug counterfeiting is improper supply chain systems. After that, they have developed a blockchain-based system for secure end to end (E2E) drug delivery but no details are given about the real-time results and implementation. The work presented in this paper [49
] encounters how to eliminate the drug counterfeiting issue and provides the consumers with original medicines. The blockchain is the solution to this issue because it ensures the features of traceability, security and transparency. They have designed a blockchain-based system and also used QR code security for medicine security. After that, the original medicines which are manufactured with the authentic brand has been delivered to the consumers safely.This is another work [50
] related to blockchain applicability in the supply chain procedure. The supply chain is the supply of goods from the manufacturer to consumers, so while supplying these products quality of the product should be maintained. There are some frauds, corruption occurs which enters the fake products into these supply chain process. This system encounters the fake product issue and provides a way of the secure and transparent supply chain by using blockchain technology.
In this paper, authors [51
] have proposed and developed a software which connects Ethereum-based blockchain network with different information systems of enterprises. This system can be able to share information among their partners securely. The blockchain enables the system to provides a different level of data access, data integrity, data visibility and authenticity support to partner companies while sharing the information. Moreover, a simulation environment has been designed in the supply chain scenario and performs some statistical operations for testing the validity of the system. The results of this simulation show that the blockchain system is the best instrument for securing the supply chain process and also establish trust between different companies. It can also increase the performance of the system and reduced fraudulent or misconduct occurs in companies.
A blockchain-based system for the traceability of the drugs has been proposed in this paper [52
] named as a Drugledger. This system ensures data authenticity, security, privacy and traceability as well. The architecture of the system is based on a service-oriented mechanism and better than peer to peer (P2P) traditional architecture. Drugledger could provide stable and efficient storage for the drug supply chain process. Overall the work presented in this paper provides a systematic and practical theme on how the blockchain system could tracks and manage the drug supply chain process. In this work [53
], another blockchain-enabled drug supply chain management system to overcomes counterfeiting drug problems namely CounterChain is presented. it can extend the level of visibility of drugs at each level of the delivery process and provides the consumers with authentic and original products. It can also increase the level of trust between vendors, manufacturers, retailers and distributors because they are the major stakeholders of the CounterChain. Due to this, the system can be able to eliminate counterfeit drugs to enter the chain.
Hence, blockchain technology is not only used in healthcare, finance, agriculture, cryptocurrency, etc., but also used in every filed. Many researchers are developing the blockchain applications in many fields such as smart grid, vehicle-to-grid, connected vehicles, edge computing and cloud computing and cyber–physical systems [54
4. Machine Learning-Based Drug Prediction and Recommendation Module
Nowadays, the growing number of software-based applications for the health care industry continuously generates a huge amount of patient health Record data. There are many healthcare-related applications developed on the basis of these datasets for monitoring and controlling the patient’s health condition. The medical health care analysts and data scientists are interested to use these datasets and develop an automated system for the healthcare industry. The medicine selection and recommendation for an automated system is a difficult task. It is possible after understanding the effects of medicines, depending on patients’ conditions and symptoms, one can easily select and recommend which drugs are better and popular. However, there are many positive or negative reviews about each available medicine, the researchers are still struggling to develop a system which recommends one best medicine on the basis of these reviews.
This section, explains the second part of our system named as automated drug recommendation system which is developed using natural language processing and machine learning techniques. First of all, we use the publicly available UCI drug review [56
] dataset to train the models and predict the most effective and top-rated drugs to the customers of the pharmaceutical industry. The dataset consists of public comments or reviews and ratings at different pharmaceutical websites about the medicines; either it may be positive or negative. We will explain about the dataset in detail later. The steps of our recommendation system are data analysis and preprocessing, Machine learning models and recommendation is shown in Figure 5
. The dataset contains reviews, ratings from the users of the medicines on the basis of patients’ medical conditions i.e., acne, pain, birth control, anesthesia, etc. After that, our data analysis and prepossessing module consist of two parts; data analysis and preprocessing. In the data analysis phase, we completely understand and analyze the data by doing some visualization plots. On the other hand, in the second preprocessing part, we used data processing techniques to remove all the missing values from the dataset and make it clean for further operations. The next module of the recommendation system is machine learning and natural language processing module where the two models N-gram and Lightgbm are used with the help of sentimental analysis techniques. These models are trained on the UCI Drug review dataset and one open-source emotion sentimental library is also used for better training purposes. After training, our models recommend the best and top-rated drugs for the specific patient conditions i.e., acne, anxiety, pain, etc. Moreover, our recommendation system is able to train itself by getting new reviews and comments from the client application of our system and update the recommendation results accordingly. when customers query in our DSCMR system for top-rated and relevant drugs according to condition; our recommendation module got active and recommends the best medicine among others. Finally, our system is able to recommend the most effective and best drug to the customers of the pharmaceutical industry i.e., pharmacist, doctors, hospitals and patients.
In this drug recommendation module of our proposed system, we are using one of the best publicly available datasets of the drugs form UCI Machine learning datasets repository [56
]; which contains the reviews and ratings provided from patients on medicines experiences. This dataset is provided from well known pharmaceutical web sites: Drug.com and Druglib.com. Hence, we used the Drug.com dataset in our experiment. According to the professionals, Drug.com is one of the large and most visited pharmaceutical related information providing websites for the customers and healthcare experts as well. It provides reviews about specific drugs on the basis of condition i.e., pain, acne, anxiety, blood pressure, etc., the user can also give the star rating points according to satisfaction level to these drugs starting from 0 to 10. The provided reviews dataset is not a big dataset but it is well structured and easy to use for the training of any deep learning models. The available reviews are categories into three groups side effects, benefits and overall comments. The dataset is gathered from these resources which contain user ratings and reviews by using an automatic web crawler and python soup library. After the crawling results, the total dataset has 215,063 drug reviews, the total number of targeted drugs are 541 [57
]. Afterward, the total number of target websites to collect this dataset are 63,376. Moreover, we used this dataset for the training and testing of our machine learning models. after that, once the models are trained and used with blockchain framework with the help of REST API and recommends the medicines according to condition; after that our model will be used the customer’s reviews and ratings on our own client application of our proposed framework and models will learn with the passage of time and recommends the top rating and best drugs to customers. Table 1
shows some records of the used dataset with seven attributes (unique ID, drug Name, condition, review, rating, date and useful count) for a better understanding of data. Moreover, the attribute drug name present, the names of the drugs; review attribute contains the reviews written in the form of text about the drugs; condition attribute represent the condition for the drugs, for example, blood pressure, acne, pain, etc.; star rating up to 10; the date is the date of the review entry and the last attribute useful count presents the number of persons found that comment useful.
4.2. Data Analysis and Preprocessing
To better understand the dataset we have used some methods of data analysis and visualization in which important features of the dataset are well explained using graphs. As can be seen in Table 1
, our dataset has a total of seven attributes, including uniqueID, drugName, condition, review, rating, date and usefulcount. As the drugName and condition are plotted to check how many unique values there are in the dataset of these two attributes. Figure 6
illustrates the number of drugs present in the dataset on the basis of each condition and how many drugs and their weight for each condition in the dataset. You can see that birth control, pain, depression, acne and high blood pressures related medicines are more in the dataset. The amount of not listed drugs also has many records in this dataset because these drugs are presented with no condition, these values affect the training accuracy of the model. We will eliminate these drug records in the data preprocessing phase.
For checking the positive and negative review effect we have used the different emotion extractions techniques but the 4-gram classifier as compared to other grams is a better option to classify the emotions from comments and shows better results whether it’s a positive and negative review. Afterward, the rating provided in the dataset is analyzed by counting the weight of all the rating options: suppose how many times each value comes from 1 to 10. The rating weight according to mean is shown in Figure 7
, where we can see that most of the customers gives four rating values 10, 9, 1 and 8 as well. Furthermore, most of the consumers give 10 rating points almost double than the others; it means that almost 69,000 drugs got more rating points from customers. Hence, the values of the positive comment are more than the negative. On the other hand, after analyzing the dataset, we have used preprocessing operations data cleaning, removal of missing values and sentimental analysis to eliminate some typos and errors in the dataset. After validation, all the missing values, typos and errors are removed from the dataset and its completely clean and ready to use for the machine learning module.
4.3. Deep Learning Models
In this section, we are going to explain the models used for the recommendation of drugs in our proposed system; natural language processing model (NLP) N-gram for predicting the probability of a word in a sequence [58
], LightGBM for reducing the low gradient features and sentiment analysis dictionary adding for the purpose of emotional analysis [59
Models that are used to derive probabilities of words from a word sequence are called language models. Basically, the N-gram model is a type of language model that is used to calculate the probability distribution of any word from a sequence. It is a probability oriented model that is trained on a collection of text named as corpus. The N-gram model estimates the probability by counting how many times word sequences occur in the given corpus. Suppose an N-1 word sequence is given to an N-gram model, it will predict those words that have a high probability of occurrence in the sequence. N-gram means the sequence of N words: a bigram is a sequence of two words like “don’t disturb”, “my car” and “your notebook”, similarly trigram is a sequence of three words like “please don’t disturb”, and “close the door”.
For example, how the N-gram model predicts the probability of a word in a corpus? Consider we have two sequences “heavy flood” and “heavy rain”; for these kinds of sentences N-gram model will predict the probability of word heavy with rain is more than with flood in the training collection of text. However, the probability of second-word rain will be more and selected by the model [60
]. It is useful in many natural language processing applications like speech recognition, word similarity comparison, part of speech tagging, predictive word input, natural language generation, grammar application, machine translation and sentiment extraction. The training accuracy and loss of the N-gram model in our DSCMR framework is presented in Figure 8
4.3.2. Light Gradient Boosting Machine (LightGBM)
The Gradient Boosting Tree in the machine learning domain has a very popular algorithm and there are some effective ways to use it, including extreme gradient boosting (XGBOOST) and parallel gradient boosting regression tree (pGBRT). There are many engineering optimization methods used in these algorithms but when the dimensions of the features are high and the data is too large then the efficiency and scalability of these models are not satisfactory. One of the major reasons for this issue is that for any single feature it checks all the data instances for information gain which is a time-consuming process. To eliminate this issue, Microsoft has provided a novel solution that contains two new techniques, named as gradient-based one side sampling (GOSS) and exclusive feature bundle (EFB).
In the GOSS model, small gradient data instance will be excluded because only data instances with high gradients are important for the estimation of information gain. A large amount of data portion is eliminated for training the model. Hence, the data instance with a low gradient will not affect the model estimation accuracy. In our case, only data instances are used with high gradients for information gain. However, the GOSS can provide accurate estimation with small data size. On the other side, EFB is used for feature reduction, by grouping mutually exclusive features together. In EFB the greedy algorithm can be utilized to reduce the features in an effective manner without affecting the accuracy of information gain. Microsoft named these GOSS and EFB based implementations as lightGBM. Lastly, lightGBM is the more fast and accurate model than the other GBDT. They claim that it is 20 times faster than other GBDT models [59
]. Some important features extracted by lightGBM model are shown in Figure 9
4.4. Sentimental Analysis
For the third sentimental analysis part, we used a third-party open-source library provided form Harvard University for adding emotional analysis in our proposed recommendation system. This library contains the definitions of words either positive or negative. In our solution, we calculate the ratio of positive and negative if the estimated ratio is above 0.5 then the count is positive otherwise negative.