How to Use Biomechanical Job Exposure Matrices with Job History to Access Work Exposure for Musculoskeletal Disorders? Application of Mathematical Modeling in Severe Knee Pain in the Constances Cohort

Introduction: Musculoskeletal disorders related to work might be caused by the cumulative effect of occupational exposures during working life. We aimed to develop a new model which allows to compare the accuracy of duration of work and intensity/frequency associations in application to severe knee pain. Methods: From the CONSTANCES cohort, 62,620 subjects who were working at inclusion and coded were included in the study. The biomechanical job exposure matrix “JEM Constances” was used to assess the intensity/frequency of heavy lifting and kneeling/squatting at work together with work history to characterize the association between occupational exposure and severe knee pain. An innovative model G was developed and evaluated, allowing to compare the accuracy of duration of work and intensity/frequency associations. Results: The mean age was 49 years at inception with 46 percent of women. The G model developed was slightly better than regular models. Among the men subgroup, odds ratios of the highest quartile for the duration and low intensity were not significant for both exposures, whereas intensity/duration were for every duration. Results in women were less interpretable. Conclusions: Though higher duration increased strength of association with severe knee pain, intensity/frequency were important predictors among men. Exposure estimation along working history should have emphasis on such parameters, though other outcomes should be studied and have a focus on women.


Introduction
Musculoskeletal disorders (MSDs) related to repetitive and physically demanding working conditions continue to represent one of the largest occupational disease in industrialized countries [1,2]. MSD related to work are caused by non-traumatic injuries, with a possible cumulative effect of occupational exposures during working life, mainly for degenerative disorders, such as osteoarthritis, and severe knee pain [3]. Actually, knee pain was considered because it is a very common pain in occupational setting with clear work exposure [4].
The evaluation of biomechanical exposures can be done in different ways and expressed around two main dimensions, intensity/frequency per day and duration over the year [5]. The exposure assessment can be obtained by estimations based on subjective judgments (self-reports, expert judgments), systematic observations (observations at the workplace, video recording), and direct measurements (at the workplace or in laboratory). However, these techniques are problematic when past exposure evaluations are needed.
Job exposure matrices (JEMs) are commonly used in occupational epidemiology research for the evaluation of past exposures [6]. Indeed, JEMs allow estimating participants exposures to occupational factors based on job titles, industry sector, and population exposure data. Several biomechanical JEMs have become available recently [7][8][9][10][11]. A cumulative exposure index is commonly used to assess cumulative work exposure by multiplying duration and intensity/frequency. However, it is not clear how to consider the combination of intensity/frequency when assessing exposure over the years.
Thus, before optimizing models using relevant statistical methods, we first aimed to determine if low level exposure with high duration is equivalent to high level exposure with low duration in the example of severe knee pain and two occupational exposures: carrying heavy loads and kneeling/squatting.
We then aimed develop a new model and compare the accuracy of duration of work and intensity/frequency associations in application to the knee disorders using severe knee pain as an outcome in a large cohort study, by developing an innovative model.

Population
The CONSTANCES study is a French general population-based cohort [12]. More than 200,000 participants, aged 18-69 years, were recruited between 2012 and 2020 in 23 health screening centers across France. The recruitment was limited to people affiliated to the French National Health Insurance Fund who correspond to active or former salaried workers and their families and excludes agricultural and self-employed workers [12]. At enrolment, self-administered questionnaires were sent to participants to collect data, including lifestyle, life events, health, and occupations. Variables of interest were collected from the baseline self-administered questionnaire and the medical interview.
For this work, we used French CONSTANCES clean data extracted in 2020. Subjects from this cohort were active at their inclusion with work trajectory coded.

Variables of Interest
Participants' sex, age at inception, known inflammatory disease of the joints, regular leisure activity (sports, gardening yes/no), Center for Epidemiologic Studies-Depression Scale (CES-D) into two categories (yes/no), were retrieved from the baseline questionnaire, and body mass index from the medical examination.
JEM Constances, which is based on self-reported exposure was used to evaluate occupational exposure [13]. In the JEM, occupational exposure is rated from 0 to 4 for intensity/frequency of heavy lifting ("lifting") and 1 to 4 of kneeling of squatting ("kneeling") based on reported job titles. The JEM Constances was combined with participants reported work trajectory that were coded at baseline retrospectively.
The main outcome was reporting severe knee pain, collected from the self-reported questionnaire at inception: yes if knee pain intensity >5/10 or having knee pain for more than a month per year.

Mathematical Modeling
In order to study the influence of duration and intensity/frequency on the onset of the disease, several logistic regression models were built for the two exposures separately: heavy lifting and kneeling.
The method we propose in this article is based on a generalization of the logistic regression approach. Formally, we define (Y, X i1 , . . . , X ip ) [1,n] an i.i.d. sample in {0, 1} × R p of size n ∈ N * where Y is the response variable and corresponds to the illness status of each subject (if the subject is sick Y = 1, else Y = 0) and X i = (X i1 , . . . , X ip ) are the variables X i1 , . . . , X ip−6 corresponding to a score based on occupation times and levels exposures and X ip−5 , . . . , X ip corresponding to others adjustment variables.

Statistical Models
For the occupational health data from the cohort CONSTANCES, several logistic models were possible, depending on the total duration value of the careers and the average exposures of the individuals. We describe them in our cohort via the variables T i = (T i1 , . . . , T ik i ), N i = (N i1 , . . . , N ik i ), and a 1i , . . . , a 6i , respectively, for occupation times, levels of exposures for each k i jobs held by the subject i, as well as six adjustment variables (a 1i = sex, a 2i = age, a 3i = imc, a 4i = leisure, a 5i = arthrite, a 6i = depression). We must take into account that the number of jobs k i can be very different. The models A, B, and C are defined through the three following transformations: Thus, Y i follows a Bernoulli distribution, such that P( i or X C i depending on the variable considered. We specify that the model B differs, from the model A, because in the computation of its transformation the exposure levels between 0 and 1 are confounded. The model C differs from the model A, because in the computation of its transformation the lowest exposure levels are nullified. Actually, models B and C are byproducts of the classical model A, standard in the literature dealing with cumulative exposure, such as smoking habits or cumulative exposure index at work [14]. More precisely, in Model B, in the sum of products "duration by exposure", we neglect small exposures by thresholding to zero exposures smaller or equal to 1 [15]. In Model C, we just consider the total duration restricted to jobs physically very demanding (exposure larger than 3) [16]. In order to define the model G we need to introduce two transformations φ 1 and φ 2 : T ij corresponding to the total duration of the career range.
average of the exposure level. This weighting was considered to emphasise intensity, especially for short duration exposure.
The construction of the design matrix (X ij ) (1≤i≤n,1≤j≤p) of the model G is based on a semi-coding of the variables {φ 1 (T i )} 1≤i≤n and {φ 2 (T i , N i )} 1≤i≤n . More precisely, the membership in a group is determined by the belonging of the values φ 1 (T i ) and φ 2 (T i , N i ) to different given intervals. For this study, we consider the classes of intervals where q j ∈ N * defines fixed empirical quantiles of the sample {φ 1 (T i )} 1≤i≤n . When c = 1, the exposure considered is carrying a heavy load, and when c = 2 the exposure considered is kneeling. The set of groups is defined as the class G :

Selection of the Design Matrix for Model G
We propose to choose the design matrix X for the model G by a recent model selection procedure, introduced and described in the article "Model selection in logistic regression" by Kwemou et al. [17]. The mathematical guarantees for this model selection method are based on oracle inequalities from Birgé and Massart [18]. This model selection is performed using penalized maximum likelihood estimators which will allow us to choose the best design matrix.
Let F be a family of design matrices X (m) , m ∈ {1, . . . , M}. For each m ∈ {1, . . . , M}, we defineβ m as the estimator obtained by minimizing γ n over R p m where p m is the number of columns of X (m) , namely:β m = arg min β∈R pm γ n (β). We use a data driven strategy that selects the best matrix among the family F . For this purpose we use a penalized maximum likelihood criterion for choosing the index m associated to the appropriate design. Here, we consider the Akaike information criterion where p m corresponds to the number of parameters in the model (1), i.e., p m is the number of columns of the matrix X (m) and Hence, minimizing this criterion allows us to find the best design matrix X m .
The "lifting" and "kneeling" quartiles of high exposure are considered in terms of duration and intensity/frequency:

α-Divergence and Statistical Tests
The logistic regression formula given in (1) depends on the model A, B, C, and G. We wish to compare these statistical models, via a differentiation criterion, in order to identify the most appropriate model for this formula. Let r be the unknown probability that a randomly selected subject from the population is sick. It is natural to estimate the unknown parameter r by the proportion of patients observed in our dataset: One can also estimate r from the logistic regression formula via the estimatorr 2 = 1 n ∑ n i=1 π β * (X i ). Note that unliker 1 , the estimatorr 2 depends on the model. For an appropriate model it is natural to choose the model for which the estimatorsr 1 andr 2 are close.
We here chose to rely on the work of A. Basu et al. [19] who developed a robust estimator for the density function. The criterion used in this article is based on the αdivergence between two densities f and g (relative to a measure µ) defined for an α > 0 as follows: Here, we need to use the α-divergence between two Bernoulli laws with parameters r 1 ∈ [0, 1] and r 2 ∈ [0, 1] which we define by The parameter α here defines a trade-off between the estimation efficiency and the variability robustness. Note that when α is close to 0 we retrieve the Kullback-Leibler divergence and when α = 1 we have d 1 (r 1 , r 2 ) = 2(r 1 − r 2 ) 2 . Next, we compared the p-value of the Wald's test (with Bonferroni correction) and odds ratios (OR) of the highest quartile for the duration with low intensity/frequency and the highest quartile for the intensity/frequency with low duration. Stratification on sex was performed as sensitivity analysis. For both exposures, logistic models were built adjusted on relevant variables. We compared the p-value of the Wald's test (with Bonferroni adjustments) and odds ratios of different quartile of duration and intensity/frequency and with low duration and lowest intensity as reference.

Analysis Plan
After a brief description of the sample and the available adjustment variables, we assessed the performance score based on the α-divergence allowing the comparison of different models with α varying with each relevant occupational exposure. For a small value of parameter α, the decision of the model choice can be questioned but when α increases the choice of model becomes easier and the model G is selected.
Then, we were able to compare Odds ratios with the Wald's test. Sex stratification has been performed as primary analysis (whereas 45 years of age stratification has been considered in secondary analysis).

Results
The sample included 62,620 subjects (Table 1), with 29,064 women (46.4%) and with a median of 48.1 years old (20 years of employment). The subjects reported regular primary leisure activities in 43.3% of cases, with half of sample who were overweight/obese, and an important part (22.5%). who had a positive CES-D, suggesting a depression. In order to study intensity and duration of exposure, we first selected the best model between A to G. We compared this family of models by increasing α. For any value of parameter alpha, the model G is selected (Tables 2 and 3).
For heavy lifting as well as kneeling, intensity/frequency was the most important predictor. Duration increased risks only for men with a dose-response relationship. For women, only intensity and frequency in low duration for both exposures seemed associated with the knee pain (Tables 4 and 5). Although the statistical power was lower, the pattern of association was similar for participants aged less and more than 45 years (Tables 6 and 7).

Discussion
This is the first study that used a developed mathematical model to compare the effect of duration and intensity during working life, on the association with severe knee as a proxy of degenerative musculoskeletal disorders. The new model G was found to be better than the usual ones, though the difference was minor. For men, we found that the OR of the highest quartile for the duration and low intensity is not significant for both exposures, whereas intensity/duration is significant for every duration, with a dose-response relationship. Results for women were limited. As expected, there was as an important effect of the intensity of heavy lifting and kneeling on severe knee pain. Both exposures are known to be associated with knee disorders [20][21][22]. The dose-response relationship has been described previously. Jensen [23,24] calculated an equivalent of our model A using an individual exposure from the number of knee-straining activities and the number of years in the trade within a collective of floor layers, carpenters, and compositors. The ORs for knee complaints and radiographically determined knee osteoarthritis were 3.0 (95% CI, 0.5 to 17.2) in the low-exposure group, 4.2 (95% CI, 0.6 to 27.6) in the mediumexposure group, and 4.9 (95% CI, 1.1 to 21.9) in the high-exposure group compared with the zero-exposure group [23]. There is an important difference in the strength of the associations compared to our work but it should be explained by the large population design with JEM exposure methods. Indeed, high exposure is considered using the proxy of job title with a large variation inside job categories.
Similarly expected, in a previous review on occupational exposure and knee osteoarthritis, ref. [24] lifting and carrying of loads was significantly associated with severe knee pain. Knee osteoarthritis was also found associated with lifting and carrying of loads with a dose-response relationship: OR of 2.0 (95% CI, 1.1 to 3.6) in the exposure group 630 to <5120 kg-hours over life, up to an OR of 2.6 (95% CI, 1.1 to 6.1) in the highest exposure group (>37,000 kg-hours over life) in men [23]. We included in our study the category "Inflammatory osteoarthritis" even if its frequency was low (1.4%). Indeed, such disorders are known to be associated with knee pain [25]. In our study, the proportion of subjects with such diseases suffering from knee pain is high: 31.0% whereas, in the entire population, the proportion of subjects having knee pain is only 13.6%; when testing whether the regression coefficient of the covariate "Inflammatory osteoarthritis" is null we obtain a p-value, with the Wald's test, smaller than 10 −9 with 95% confident for the odds ratio is [2.36, 3.19].
The lack of clear association for women was also found by D'Souza et al. who reported on an analysis of the US national survey, where they describe relationships between work activities and symptomatic knee osteoarthritis [26]. A significant exposure-response relationship was only found between symptomatic knee osteoarthritis and kneeling in men but not women. Different explanations might be suggested: since our model included adjusting factors, such as BMI and depression, there might be a more complex causal pathways than in men, such as considered in back pain [27]. JEM Constances is not gender stratified and applying a specific JEM for sex could be a focus for another study. Furthermore, a selection effect similar to healthy workers effect is also possible. The main strength of our study was the possibility to use JEM with working life course on a large cohort study. Limitations might also be raised by using a large but not representative population of French workers, and specific jobs in agriculture (not included by design) and mining (almost disappeared in France) should be considered since they are known factors related to lower limb MSD [22]. Second, the outcome was focused on severe knee pain. However, it is self-declared and might correspond to heterogenous disorders, work-related or not. It was used as a good example of a proxy of degenerative musculoskeletal disorders, and the use of pain intensity and severity is recommended [28]. We have previously shown that working in a kneeling or squatting position was significantly associated with severe knee pain [21]. More recently, similar trends of associations between severe knee pain and knee arthroplasty groups were showed in the same cohort [29]. This result is also found elsewhere, with non-managerial jobs associated with higher prevalence of knee osteoarthritis and knee symptoms [30]. Third, as we already mentioned, the use of JEMs might also be questioned since it is a global average evaluation that does not consider the differences inside job [31]. However, assessing exposure during long periods of time and for a big number of subjects is challenging and JEMs are appropriate tool to consider. Furthermore, assessment of carrying heavy loads exposure using JEMs was found to be valid compared to a self-administrated questionnaire [32]. Furthermore, biomechanical JEM used through the working life was already studied and even when the work environment have changed, application of a 4-scale at the individual level did not change regardless of period of time considered [33]. Fourth, activities at work can be varied and diverse, even if the gradations used can allow us to obtain a general idea of exposures. This could be more specific for each job and introduce some complementary variable to adjust the variability between mechanical exposure into different jobs. Fifth, since leisure activities can be numerous, the mechanical exposure coming from these activities and their intensity can train subjects and make them more resilient to disorders.

Conclusions
This innovative approach using mathematical modeling of working history and a JEM, shows that duration in years has a smaller impact than frequency/intensity and should be considered at least among men. Our new model G seems to be an interesting approach though the improvement is slight. Implications for potential policymakers and human resource management might been considered to achieve prevention of such pain during with the help of occupational practitioners. However, further studies should be completed on other outcomes, and have a focus on women.

Conflicts of Interest:
The authors declare no conflict of interest (paid by their affiliations. Authors are paid by their institution, and A.D. is also paid as editor of the Archives des Maladies professionnelles et de l'Environnement (Elsevier). The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.