Mapping cigarettes similarities using cluster analysis methods.

The aim of the research was to investigate the relationship and/or occurrences in and between chemical composition information (tar, nicotine, carbon monoxide), market information (brand, manufacturer, price), and public health information (class, health warning) as well as clustering of a sample of cigarette data. A number of thirty cigarette brands have been analyzed. Six categorical (cigarette brand, manufacturer, health warnings, class) and four continuous (tar, nicotine, carbon monoxide concentrations and package price) variables were collected for investigation of chemical composition, market information and public health information. Multiple linear regression and two clusterization techniques have been applied. The study revealed interesting remarks. The carbon monoxide concentration proved to be linked with tar and nicotine concentration. The applied clusterization methods identified groups of cigarette brands that shown similar characteristics. The tar and carbon monoxide concentrations were the main criteria used in clusterization. An analysis of a largest sample could reveal more relevant and useful information regarding the similarities between cigarette brands.


Introduction
Global consumption of cigarettes has been rising from the beginning of the 20 th century. China, United States of America, Japan, Russia, and Indonesia are the top five countries on cigarette consumption [1]. The smoking prevalence in Romanian adults was of 21.4% (33.2% for male and 10.3% for female) between 2002 and 2005 [2]. In 2004, Romania was on the top three worldwide cigarette consumptions [3].
The compounds of the cigarette smoke and the effect of smoking are well known today. The cigarette smoke contains a number of four thousand compounds with different actions on human body [4,5]. Today it was recognized that tobacco smoking is the major etiologic factor associated with cancer (lung cancer [6,7], pancreatic cancer [8], gastric cancer [9], oral cancer [10], renal cancer [11], and breast cancer [12]). It is also known that the main addictive component, nicotine, is not a carcinogenic by itself [13].
Some counties introduced regulations on cigarette tar, nicotine and carbon monoxide in order to reduce the effect of these substances on human body. In the United States of America for example, the Federal Trade Commission has published standardized tar and nicotine ratings since 1967. The carbon monoxide rating has been introduced since 1980 for all cigarettes sell on American marked [14,15].
In Romania, the cigarette market suffered some changes since 1 st January 2007, when Romania becomes a European Union state, and some regulations entered into force. The tar, nicotine and carbon monoxide concentrations are printed on cigarettes packs as the World Health Organization Framework Convention on Tobacco Control recommended [16] and the Romanian laws and regulations [17][18][19] imposed. The maximum concentration of tar (10 mg/cig), nicotine (1 mg/cig) and carbon monoxide (10 mg/cig) of the cigarettes sell or fabricated in Romania are also regulated through a number of laws [17][18][19]. The health warnings and explanatory health messages could be found on the front and back sides of cigarette packs according with Romanian regulations [17,20]. The advertising of tobacco products in cinema halls and of selling tobacco products to minors [21][22][23] is under interdiction in Romania. Smoking in closed public places, educational and medical establishments [17,18,22,23] is prohibiting in Romania.
Regarding the public information of smoking effects, two types of warning messages are imposed: general ("Smoking kills" with two variants: "Smoking can kill"abbreviated as HW 1 and "Smoking harms yourself and people around you" -abbreviated as HW 2 ) and additional or explanatory health messages. The additional explanatory health message could be one of the following: • "Smokers died younger" -abbreviated as EHM 01 • "Smoking blocks blood circulation in arteries, induce heart and stroke attacks" -EHM 02 • "Smoking cause fatal lung cancer" -EHM 03 • "Smoking when pregnant harms your baby" -EHM 04 • "Protect children: don't let them breathe your smoke" -EHM 05 • "Your physician or pharmacist can help you to quit smoking" -EHM 06 • "Cigarettes are addictive. Do not start to smoke!" -EHM 07 • "Quitting smoking decrease the risk of cardiac and lung fatal diseases" -EHM 08 • "Smoking can cause a painful and fatal dead" -EHM 09 • "For smoking cessation consult your physician or pharmacist" -EHM 10 • "Smoking slow down blood circulation and induce impotence" -EHM 11 • "Smoking induce skin aging" -EHM 12 • "Smoking can decrease sperm quality and fertility" -EHM 13 • "Cigarette smoke contains benzene, nitrosamines, formaldehyde and cyanides" -EHM 14 The aim of the study was to investigate the relationship between chemical composition information (tar, nicotine, carbon monoxide), market information (brand, manufacturer, price), and public health information (class, health warning) by using multivariate and clustering techniques.

Material and Method
The material of the present research was represented by a sample of cigarette. The inclusion of the cigarette in the sample was performed based of the following inclusion criteria: filtered cigarettes; tar concentration printed on the packet; nicotine concentration printed on the packet; health warnings and explanatory health messages printed on the front and the back of the cigarette packet. A supermarket from Cluj-Napoca was chosen randomly from the total number of supermarkets and all cigarettes which accomplished the inclusion criteria were bought. The following quantitative and qualitative variables were collected: cigarette brand, manufacturer, price, (as market information); tar, nicotine concentration, and carbon monoxide concentration (as cigarette chemical composition); class, health warning, and explanatory health messages (as public health information).
The cigarettes were classified by applying the order no. 919/1997 specification. Based on the sugars, polyphones, tar (mg/cigarette) concentrations, free burning speed (mm/minute), cigarette length (mm), and filter length (mm) three classes of cigarettes were defined: superior, medium, and inferior [24].
Thirty cigarette packs manufactured by four manufacturers, British American Tobacco (BAT) -on Romanian market since 1997 [25], Philip Morris (PM)on Romanian market since 2001 [26], Japan Tobacco (JT) -on Romanian market since 1995 [27], and Continental Tobacco [28], were included into analysis. The characteristics of the studied cigarette are presented in Table 1.
The relationship between chemical composition information (tar, nicotine, carbon monoxide) has been investigated by using multiple linear regression technique. Two-steps clustering and hierarchical cluster analysis techniques were used in characterization of market and public health information, as well as in clusterization of the entire collected data. The data were analyzed and summarized using SPSS 12.0 software. The confidence intervals at a significance level of 5% associated with the binomial distributed frequencies were calculated using dedicated software [29].

Cigarettes Chemical Composition
The composition of tar, nicotine and carbon monoxide were investigated and descriptive characteristics are presented in Table 2.
A multivariate regression analysis has been performed in order to identify and to quantify the link between carbon monoxide as independent variable, and tar and nicotine as dependent variables. The regression analysis was applied on 29 sample data (the carbon monoxide information was not specified on pack no. 16 - Table 1). The following equation with associated statistics has been obtained: [CO] mg = 1.78 + 1.08*[Tar] mg -3.06*[Nicotine] mg r = 0.9594; r 2 = 0.9205; r 2 adj = 0.9144; s = 0.68; n = 29; F = 150 (p = 5.06·10 -15 ) (Eq.1) where, • [CO] mg = carbon monoxide concentration (mg); [Tar] mg = tar concentration (mg), [Nicotine] mg = nicotine concentration (mg), • r = correlation coefficient, r 2 = determination coefficient; r 2 adj = adjusted correlation coefficient, s = standard error of estimate; n = sample size; F = Fisher parameter (significance). Statistical parameters of the model presented in Eq.1 are presented in Table 3. The given (as were printed on the packs) and expected values (as it resulted from regression model based on Eq.1), as well as the estimated value for omitted data (pack no. 16) are presented in Table 4. Averaging the last column of the table, the estimation error in terms of truncated values of 0.38 is obtained.
Based on the experience learned from Table 4, the carbon monoxide concentration was expressed in terms of a confidence interval (see Table 5). The probability was calculated as a continuous uniform variable between lowest and highest values.

Market Information
The studied cigarettes were produced by four manufacturers. The absolute and relative frequency of cigarette manufacturer and the 95% confidence interval associated to relative frequency are presented in Table 6.
The cigarette price varied from 2.25 RON (Pannonia) to 5.77 RON (Virginia) with an average of 5.02 (standard deviation of 0.74) for the entire sample. The mean of price for the cigarette produced by BAT was statistical significant higher comparing with the mean price of the cigarette produced by JT (m BAT = 5.23, n BAT = 17, StdDev BAT = 0.5; m JT = 4.64, n JT = 4, StdDev JT = 0.47; t = 2.14, p = 0.04 -m = mean, StdDev = standard deviation).   The two-steps cluster technique has been applied on market information. Two clusters were identified. The manufacturer variable was the criteria used in clusterization: the first cluster comprised all cigarettes produces by British American Tobacco (seventeen), the second cluster comprised all cigarettes produced by Philip Morris (eight), Japan Tobacco (four), and Continental Tobacco (one). The price centroids on each cluster where as followed: • mean = 5.23 and standard deviation = 0.50 (first cluster) • mean = 4.74 and standard deviation = 0.92 (second cluster). The manufacturer variable had a significant statistic cluster wise importance.

Public Health Information
Almost seventy-seven percent of studied cigarettes (twenty-three out of thirty, 95%CI [17 -27]) were classified as superior according with the quality, while twenty-three percent where classified as medium (seven brand, 95%CI [3 -13]). The distribution of the health warnings and explanatory health messages found on the studied cigarettes are presented in Table 7.
In order to identify the link between public health information variables the Spearman correlation coefficient was applied. A good correlation (r = 0.6042, p < 0.05) was obtained between class and explanatory health messages EHM 13 (three cigarettes classified as medium class had this message printed on the pack, all being produced by BAT). No other significant correlations between public health information were obtained (r < 0.5, p > 0.05). f a = absolute frequency; f r = relative frequency; 95% CI no = 95% confidence interval expressed as number of packs

Two-Step Cluster Technique: Overall Analysis
All cigarette brands were valid for cluster analysis (the carbon monoxide concentration for Viceroy Ultra Light -BAT, see Table 1 was estimated based on Eq.1).
One cluster has been identified. The centroids characteristics of the cluster are presented in Table 8. Note that there was not identified any categorical or continuous variable which to had statistical significant importance into clusterization. Almost ninety-seven percent of the cigarette brands included into the study were valid and thus were included into the hierarchical cluster analysis (twenty-nine out of thirty). One brand has been excluded from this analysis due to the absence of carbon monoxide concentration printed on the pack (pack no. 16, see Table 1). The resulted average agglomeration schedule is presented in Table 9. The horizontal icicle plot obtained on studied sample is presented in Figure 1 and the associated dendrogram in Figure 2.

Discussion
The aim of the research was reached. The chemical composition, market information, and health care information were analyzed. The results indicated that hierarchical cluster analysis is a multivariate method useful in analysis of investigated cigarettes as similar behaviors in terms of studied variables.
Regarding chemical composition of studied cigarettes the following observation can be made: (a) the tar concentrations varied from 1 mg to 12 mg (Pannonia, see Table 1), with a single exception greater than 10 mg (10 mg being the highest concentration accepted); the average of tar concentration was of 7.23 mg with a standard deviation of 2.69 mg; (b) the concentration of nicotine varied from 0.1 mg to 0.9 mg with an average of 0.63 mg and a standard deviation of 0.19 mg; (c) the concentration of carbon monoxide varied from 0.5 mg to 12 mg, with an average of 7.57 mg and a standard deviation of 2.84 mg. Thus, it can be say that the concentration of the tar and carbon monoxide were similar in terms of minimum value, maximum value, average and standard deviation. At least two explanations are plausible regarding the tar and carbon monoxide concentration of pack no. 18 (see Table  1, the single brand produce by Continental Tobacco) which are of 12 mg: the packet came from an older stock and when was produced the actual laws, ministerial orders and ordinances were unknown, or the Continental Tobacco produce and sell cigarettes with a higher tar and carbon monoxide concentrations than those imposed.
Regarding the carbon monoxide concentration, as it can be observed from Table 1, the Viceroy -Ultra Light brand did not had it printed on the packet. The Viceroy Ultra Light could be a less preferred brand by the Romanian smokers, and the packet could be from a previous stock in trade. As it can be observed from Eq.1, the concentration of carbon monoxide is in a linear relationship with tar and nicotine concentration. Almost ninety-two percent of carbon monoxide variation can be explained by its linear relationship with tar and nicotine variation. In other terms it can be say that the cigarettes with a high carbon monoxide concentration have also a high tar and nicotine concentration. Based on Eq.1 the carbon monoxide concentration was estimated for pack no. 16 - Table 1. The analysis of the Eq.1 shown a very good agreement between given values and expected values, taking into account that all values are discretized (being given with one digit precision, see Table 4). The estimated values gave the possibility to include all cigarettes in clusterization analysis.
Regarding the market information, it can be observed that the majority of the cigarettes included into the study were produced by the British American Tobacco (more than a half, seventeen out of thirty, see Table 6). These cigarettes proved to have a significant higher price comparing with Japan Tobacco manufacturer, result sustained also by the two-steps cluster analysis. Summarizing, it can be say that British American Tobacco has a monopoly on the cigarette market of the supermarket included into the study.
Analyzing the public health information regarding the health warnings and exploratory health messages printed on studied cigarette packs, twenty-nine packs were in conformity with the Romanian laws. The Viceroy -Ultra Light brand did not had any of the imposed general text on the packet but has a hybrid of the "Smoking harms yourself and people around you" as "Smoking harm sever your health". Another observation that can be made refers the carbon monoxide concentration. If we analyzed the explanatory health messages (see Table 7) it can be observed that most frequent messages refer the child protection and the information regarding the cigarette smoke composition. However, a question which can be investigated in further researches is "What is the impact of the explanatory health messages on smoker's behavior?", "Which is the significance of the explanatory health messages from the smokers' point of view?".
An interesting result was obtained in the analysis of correlation applied on explanatory health messages and cigarette class: the message "Smoking can decrease the sperm quality and fertility" was significant correlated with cigarette class (all cigarette packs manufactured by BAT and classified as medium class had this explanatory health message). It can be say that the cigarettes produced by BAT and classified as medium on quality class had the above mentioned explanatory health message.
The two-steps cluster analysis applied to entire data set revealed that this is not a useful technique in characterization of variables included into analysis. The hierarchical cluster analysis applied on entire set of data revealed some interesting information about the studied variables. This analysis was applied on the valid sample of cigarettes (twenty-nine out of thirty -the pack no. 16 was excluded due to the missing carbon monoxide concentration) and took into consideration four continuous variables (tar, nicotine, and carbon monoxide concentrations, and price).
Analyzing the agglomeration schedule revealed that there is no difference between Marlboro -Gold and Marlboro -Menthol (Philip Morris) or between Dunhill -Silver and Dunhill -Light Blue (British American Tobacco). Note that these two brands (Marlboro -Gold and Marlboro -Silver, see Table 1) are identical in terms of tar, nicotine and carbon monoxide concentrations and have the same price. The same observation is true for Dunhill -Silver and Dunhill -Light Blue (see Table 1). Thus, it can be expected to found the Marlboro -Gold and Marlboro -Menthol in one cluster, and Dunhill -Light Blue and Dunhill Silver in another cluster (which it was happen, see Figure 2).
The difference between cigarettes brands as resulted from the agglomeration schedule were in eleven cases gave by the differences between prices. The ascending order according with the value of coefficients presented in agglomeration schedule (see Table 9 Looking at the icicle plot ( Figure 1) it can be analyzed what happen at each clusterization step. At the start step (the one that is not represented on icicle plot, Table 9), each cigarette brand was a cluster unto itself (the number of clusters at the start point being equal with sample size). After the first step, the cigarette brands were ordered in the icicle plot according with their combination into clusters. Marlboro -Gold & Marlboro -Menthol form the first cluster, then Dunhill -Light Blue form other cluster, and so on, until all the clusters were formed.
The dendrogram presented in Figure 2 shown the ensemble picture of what really happed in clusterization. Six clusters were formatted at first step: • 1.1. One cluster that brings together ten cigarettes brands by including five British American Tobacco For all clusters obtained in the first step the prices of the cigarette brands were different, shown that this variable was not took into consideration into clusters construction.
In the second step, at a short distance, two clusters were formed: one by joining the cluster 1.4. with the cluster 1.5.a. (2.1.), and other by joining the cluster 1.5.b. with Virginia Superslims brand (2.2.). Going further, at a distance less than five three clusters were formed based on the similarities between the brands: one by joining the cluster 2.1. with the cluster 2.2. (3.1.), one by joining the cluster 1.1. with the cluster 1.3. (3.2), and one by joining the cluster 1.2. with LM -Neo Slims brand (3.3). At this level just one cigarette brand was not joined with others (Pannonia -Blue, produced by the Continental Tobacco) at this level. At a rescaled distance of ten on dendrogram, the cluster 3.1. was joined together with the cluster 3.3 (4.). At a distance less than fifteen, the Pannonia -Blue brand was joined together with cluster 3.1 (5.). At the highest distance as possible, the cluster 4 was joined together with the cluster 5.
The forth cluster included the cigarette brands with a tar concentration less than or equal with 7 mg, a nicotine concentration less than or equal with 0.8 mg, and a carbon monoxide concentration less than or equal with 8 mg. The fifth cluster included the cigarette brands with a tar concentration greater than or equal with 8 mg, a nicotine concentration between 0.6 and 0.8 mg, and a carbon monoxide concentration greater than or equal with 9 mg.
The study has some limitations. The main limitation refers the number of the cigarette brands and manufacturer included into the study. The financial resources were limited and this lead to a limitation of the number of cigarette brands and/or manufacturers included into the analysis. The small numbers of cigarette brands of medium class and the absence of the cigarette brands of inferior class represent other limitation of the study (23 brands were classified in superior class, and 7 brands is medium class, as resulted from the application of research protocol).
Based on the obtained results the following concluding remarks can be done: • The carbon monoxide concentration is linear dependent by the tar and nicotine concentration.
• A monopoly of the British American Tobacco was identified in the choused supermarket.
• The prices of the cigarettes produced by the British American Tobacco were significantly statistic higher comparing with the prices of the cigarettes produce by the Japan Tobacco.
• The most frequent warning message printed on the investigated cigarettes was "Smoking can kill".
• The most frequent explanatory health messages printed on the investigated cigarettes were "Protect children: don't let them breathe your smoke" and "Cigarette smoke contains benzene, nitrosamines, formaldehyde and cyanides".
• Hierarchical cluster analysis proved to be a useful technique in investigation of similarities on studied variables. The main criteria used in clusterization were tar and carbon monoxide concentrations. Even if the hierarchical cluster analysis could offer information about alikeness between cigarette brands produced by different manufacturer, a wide analysis which to include a large number of cigarette brands and other variables collected from smokers could reveals more relevant information. This research could be considered as a first step in grouping similar cigarette brands in such way that their interrelationships to be relevant. According with the future availability of funds and human resources, our desire is to carry on a deepen research by including of a larger number of cigarette brands and of data obtained from a survey of population which to analyze the impact of health warnings and explanatory health message on their smoking behavior.