Why Randomized HR Evaluations May Mislead Managers: The Role of Treatment–Trait Interactions in Outcome Construction

Hamori, Shigeyuki

doi:10.3390/businesses6020023

Open AccessArticle

Why Randomized HR Evaluations May Mislead Managers: The Role of Treatment–Trait Interactions in Outcome Construction

by

Shigeyuki Hamori

^1,2

¹

Faculty of Political Science and Economics, Yamato University, Suita 5640082, Japan

²

Graduate School of Economics, Kobe University, Kobe 6578501, Japan

Businesses 2026, 6(2), 23; https://doi.org/10.3390/businesses6020023

Submission received: 22 March 2026 / Revised: 17 April 2026 / Accepted: 20 April 2026 / Published: 7 May 2026

Download

Browse Figures

Versions Notes

Abstract

This paper examines whether randomized evaluations can fail to identify causal effects when outcomes include interactions between treatment and unobserved characteristics. We show that even under random assignment, standard regression estimators do not necessarily recover the structural causal effect if outcomes contain non-separable interaction terms between treatment and latent characteristics. When outcomes contain such non-separable interaction terms, the estimated treatment effect reflects interaction components embedded in the outcome construction and may fail to recover the structural policy parameter. We derive conditions under which unbiased identification is restored, highlighting the critical role of additive separability. The results provide a theoretical foundation for understanding when randomized evaluations may yield misleading conclusions in managerial and policy contexts.

Keywords:

randomized evaluation; HR interventions; evaluation bias; workforce composition; performance measurement; HR analytics; managerial decision-making; causal inference; randomized experiments

1. Introduction

Managers increasingly rely on experimental evidence to guide decisions about which human resource (HR) programs to adopt, expand, or discontinue. In many organizations, randomized evaluations are used to compare mentoring schemes, training programs, or job assignment policies, and the resulting estimates often inform resource allocation and performance benchmarking. Such evaluations typically involve randomly assigning employees to different HR interventions (e.g., training, mentoring, or job assignments) and comparing their subsequent performance outcomes. Throughout the paper, we assume that randomized evaluations rely on the assumption that treatment assignment is statistically independent of workers’ unobserved characteristics, ensuring that treatment assignment is exogenous in the standard sense. Despite their growing importance, these evaluations do not always provide clear guidance. Randomized HR evaluations can sometimes be misleading for managers, especially when performance measures reflect interactions between treatment and worker characteristics. In particular, these interaction terms may arise from non-separable functional relationships in the outcome equation, complicating the interpretation of estimated effects. In such cases, estimated treatment effects may not isolate the intrinsic effectiveness of the program, as they also reflect interaction-driven variation that depends on workforce composition, potentially biasing managerial decisions. Our analysis is also related to broader concerns regarding the external validity and context dependence of experimental evidence, as treatment effects may vary with underlying workforce characteristics.

While treatment assignment may be fully randomized and independent of worker characteristics, the estimated treatment effect may nevertheless fail to recover the structural policy parameter when outcome construction embeds interaction terms between treatment and latent characteristics.

The central point of this paper is not that randomization fails, but that randomization alone is insufficient for identifying the structural policy effect when outcomes include non-separable interaction components, a feature that is common in practical organizational settings.

In practice, results from randomized HR evaluations are often used not only to assess whether a program works but also to compare alternative interventions, rank organizational units, and guide resource allocation. For example, a firm might expand mentoring programs that appear highly effective or discontinue training schemes with weak estimated impacts. Implicit in these decisions is the assumption that randomized evaluation delivers unbiased estimates of program effectiveness.

This paper examines a precise identification problem: whether standard regression estimators in randomized evaluations recover the structural policy parameter (i.e., the direct effect of the intervention) when outcome construction embeds interactions between treatment and latent worker characteristics.

Our analysis is also related to the literature on the interpretation of treatment effects and structural policy parameters (Heckman & E. Vytlacil, 2005; Heckman & E. J. Vytlacil, 2007), as well as to recent work on external validity and policy extrapolation (Deaton, 2010; Vivalt, 2020). While these studies emphasize the distinction between reduced-form treatment effects and economically meaningful parameters, we highlight a distinct mechanism: even under random assignment, regression coefficients may fail to recover the structural policy effect when outcome construction embeds interaction terms between treatment and latent characteristics.

Even when treatment is randomly assigned, the regression coefficient on treatment need not isolate the structural policy effect, because outcome measures may mechanically incorporate treatment–type interactions. Our objective is not to challenge the validity of randomized evaluations, but to clarify the conditions under which the estimated coefficient coincides with the underlying structural parameter.

Building on this identification perspective, we develop a theoretical framework to analyze a limitation that arises even within a single experimental setting. In particular, when performance metrics embed interactions between treatment and worker characteristics, conventional evaluation methods may not isolate the direct effect of an intervention. In reality, employee performance depends not only on policy exposure but also on how well the intervention matches an employee’s latent type—for instance, whether a mentoring style suits their learning preferences or whether a training program complements their existing skills.

When outcome measures incorporate such interaction-based match components, estimated treatment effects may reflect not only the intrinsic effectiveness of the program but also workforce composition. Consequently, differences in estimated impacts across firms, units, or time periods may stem more from variation in employee characteristics than from program quality. This raises the risk that managers could overestimate the superiority of a program simply because it aligns with a higher proportion of employees who benefit most.

To clarify this, we develop a simple analytical framework in which performance depends on both a direct policy effect and an interaction-based match component interacting with worker heterogeneity. Our analysis shows that while random assignment ensures independence between treatment and worker type, it does not guarantee independence between treatment and constructed outcome components. As a result, standard regression estimates may be biased when match quality is non-separable in treatment and worker characteristics. Importantly, unbiased recovery of the direct policy effect requires a stronger condition: additive separability between treatment and worker type in the outcome-generating process.

This paper contributes to the literature in economics and management science, while also providing insights for HR practitioners. To illustrate the practical magnitude of this problem, we present a simulation based on a targeted mentoring setting, in which employee performance serves as the outcome measure and incorporates both direct policy effects and interaction-based match components. The results indicate that estimated treatment effects can vary systematically with workforce composition, even when the underlying policy effect remains constant. From a managerial perspective, this suggests that program rankings based on estimated impacts can be unstable or misleading if differences in employee composition are not carefully considered.

This mechanism implies that managers may systematically favor programs that perform well under the current workforce composition rather than those that are intrinsically more effective, leading to a distortion in resource allocation.

The contribution of this paper is threefold. First, we show that even under perfect random assignment, the regression coefficient on treatment does not necessarily identify the structural policy parameter when outcome construction embeds non-separable interactions between treatment and latent characteristics. Second, we establish a transparent analytical condition under which the structural policy parameter is identified, thereby clarifying when conventional evaluation methods remain valid. Third, we demonstrate how evaluation bias translates into distorted managerial decisions, particularly in program comparison and resource allocation. Our objective is not to critique the use of randomized evaluations, but to clarify the conditions under which their results can be correctly interpreted in organizational settings.

2. Related Literature

2.1. Personnel Economics and HR Interventions

A large amount of the literature in personnel economics examines how compensation schemes, organizational practices, and assignment policies affect worker performance (Lazear, 1995; Lazear & Oyer, 2012). Empirical studies have documented that HR practices such as performance pay, team-based incentives, and structured management systems can substantially influence productivity (Huselid, 1995; Ichniowski et al., 1997).

More recently, randomized evaluations have been used to assess specific workplace interventions. Bloom et al. (2015) provide experimental evidence on the productivity effects of working from home. Hoffman et al. (2018) analyze how evaluation and managerial practices interact with worker characteristics in shaping performance outcomes. These studies demonstrate the growing role of experimental methods in personnel research.

However, although these contributions emphasize causal identification through random assignment, they typically abstract from the possibility that outcome measures themselves embed assignment-based match components, i.e., interaction-based components between treatment and latent worker characteristics.

While the literature establishes the importance of HR practices for productivity, it provides limited direction on how such evidence should be used in managerial decision-making, particularly when performance measures embed interaction effects.

Recent work in HR analytics and data-driven personnel management further highlights the increasing reliance of firms on quantitative evaluation tools to guide managerial decisions (Boudreau & Cascio, 2017; Levenson, 2018; Minbaeva, 2018). Marler and Boudreau (2017) provide a comprehensive review of HR analytics, emphasizing its role in linking workforce data to strategic decision-making. Angrave et al. (2016) critically examine the limitations of HR analytics practices, noting that data-driven evaluations may yield misleading conclusions when underlying assumptions are not carefully considered. Similarly, Rasmussen and Ulrich (2015) discuss how HR analytics transforms managerial decision processes by integrating data analysis into organizational strategy.

These contributions underscore the growing importance of evaluation frameworks in HR management but also point to potential pitfalls when performance metrics are not properly aligned with underlying causal mechanisms. In particular, they highlight that managerial reliance on quantitative metrics can lead to misinterpretation when evaluation systems embed complex interactions between interventions and workforce characteristics.

In particular, this paper contributes to the HR analytics literature by clarifying a previously underexplored identification issue: even when firms rely on randomized evaluation, the structure of performance metrics themselves may distort managerial inference.

The present paper complements the literature by providing a formal analytical framework that clarifies how such interaction structures affect identification and interpretation in HR evaluation. However, while the literature emphasizes the importance of data-driven decision-making, it has paid limited attention to how the structure of performance metrics themselves may affect causal interpretation, which is the focus of the present paper.

Importantly, the present paper contributes to the literature by highlighting a fundamental identification issue that arises when performance metrics—widely used in HR analytics and managerial decision-making—embed interaction structures between interventions and employee characteristics.

This perspective situates our contribution within the growing literature on evidence-based HR management, where the interpretation of evaluation results plays a central role in organizational decision-making.

2.2. Matching and Assignment Models

The importance of worker–firm and worker–task matching has long been recognized in labor economics. Jovanovic (1979) models job matching as a process of learning about match quality, while Sattinger (1993) surveys assignment models that explain wage and productivity dispersion. More recent work highlights how complementarities between worker ability and task characteristics shape equilibrium outcomes (Eeckhout & Kircher, 2011; Shimer & Smith, 2000).

These models emphasize that productivity depends on interactions between worker type and assignment. The present paper draws on this insight but focuses on a distinct question: how such complementarities affect empirical identification in randomized HR evaluations when interaction-based match components are embedded in outcome construction.

From a managerial perspective, these complementarities are often reflected in performance evaluation systems and HR analytics frameworks, where measured outcomes depend on the alignment between interventions and worker characteristics.

2.3. Interpretation of Treatment Effects and Structural Parameters

This paper is closely related to the literature on the interpretation of treatment effects and structural policy parameters. Heckman and E. Vytlacil (2005); Heckman and E. J. Vytlacil (2007) emphasize that reduced-form treatment effect estimates may not coincide with economically meaningful structural parameters. Related contributions further develop the structural interpretation of treatment effects (Carneiro et al., 2011). In a related spirit, we show that even under random assignment, regression coefficients do not necessarily recover the structural policy effect when outcome construction embeds interaction terms between treatment and latent characteristics.

Our analysis is also connected to the literature on external validity and policy extrapolation (Deaton, 2010; Vivalt, 2020), which highlights that estimated effects may depend on the context in which they are measured. Related concerns have also been raised regarding site selection and contextual dependence in program evaluation (Allcott, 2015). While the literature focuses on cross-context variation in estimated effects, we instead emphasize a within-context identification problem arising from the structure of outcome variables.

2.4. Heterogeneous Treatment Effects and Identification

The econometric literature has long recognized that treatment effects may vary across individuals. Imbens and Angrist (1994) formalize local average treatment effects under heterogeneous responses, while Heckman et al. (1997) analyze selection bias and heterogeneous impacts in program evaluation. More recently, Athey and Imbens (2017) and Chernozhukov et al. (2018) develop methods for estimating heterogeneous treatment effects using modern econometric tools. Recent advances in machine learning methods further extend the literature by enabling flexible estimation of heterogeneous effects (Athey & Wager, 2019; Knaus et al., 2021). Recent developments in difference-in-differences designs also highlight challenges in identifying causal effects under complex treatment structures (Callaway & Sant’Anna, 2021), particularly when treatment effects are heterogeneous or evolve over time.

The present paper differs from the literature in focus. Rather than proposing new estimators for heterogeneous effects, we show that when outcome construction embeds interactions between treatment and latent types, conventional regressions may yield biased estimates of the direct policy effect even under random assignment. Accordingly, the paper highlights a specification issue that arises from the structure of the outcome variable itself.

However, the literature does not address how such interaction structures affect the interpretation of evaluation results in practical HR settings. In such settings, evaluation results are used to inform managerial decisions.

Distinction from Heterogeneous Treatment Effects

At first glance, the mechanism analyzed here may appear closely related to the heterogeneous treatment effects (HTE) literature. However, the present argument is conceptually distinct. The HTE literature studies variation in causal effects across individuals and focuses on identifying or estimating average or conditional treatment effects under heterogeneity (Athey & Imbens, 2017; Imbens & Angrist, 1994). In contrast, this paper highlights a specification issue arising from the structure of the outcome variable itself. Even when the average treatment effect is well-defined and consistently estimated under random assignment, the regression coefficient on treatment need not recover the direct structural policy parameter when outcome construction embeds interaction-based match components. The bias identified here therefore does not stem from selection or non-random assignment, but from mechanical covariance between treatment and interaction terms embedded in performance measurement.

In this sense, the paper addresses a decomposition problem rather than an estimation problem under heterogeneity. Importantly, the bias identified here arises even when the average treatment effect is well-defined and consistently estimable, and therefore cannot be resolved by standard heterogeneity-robust methods, including approaches designed to estimate heterogeneous treatment effects.

Formally, the average treatment effect can be written as

E [Y_{i} (1) - Y_{i} (0)] = γ_{0} + δ E [m (1, θ_{i}) - m (0, θ_{i})],

where

Y_{i} (1)

and

Y_{i} (0)

denote the potential performance outcomes of worker i with and without the HR intervention, respectively. Here,

γ_{0}

represents the direct policy effect,

m (T_{i}, θ_{i})

captures interaction-based match quality as a function of treatment

T_{i}

and worker type

θ_{i}

, and

δ

scales the contribution of match quality.

This expression implies that the regression coefficient on

T_{i}

does not isolate

γ_{0}

when the interaction term is embedded in the outcome measure.

Taken together, the existing literature provides important insights into HR practices, matching, and treatment heterogeneity, but offers limited guidance on how firms should interpret evaluation results when performance measures embed interactions between policy exposure and worker characteristics. In practice, organizations increasingly rely on HR analytics and composite performance metrics to evaluate interventions, compare programs, and guide managerial decisions. In such settings, the distinction between intrinsic policy effectiveness and composition-driven effects becomes critical. This paper contributes to bridging this gap by linking econometric identification to practical evaluation design and managerial decision-making.

3. HR Assignment Framework

3.1. Targeted HR Interventions and Worker Heterogeneity

This section develops a simple framework to illustrate how evaluation bias can arise in practical HR settings.

Many HR policies are explicitly targeted. Firms frequently design mentoring programs, training modules, or job assignments intended to benefit workers with particular latent characteristics, such as career orientation, skill profile, or motivational type. Let

T_{i} \in {0, 1}

denote assignment to such an HR intervention for worker i. We assume that treatment is randomly assigned, as in a standard randomized evaluation design.

Workers differ in unobserved characteristics

θ_{i}

, representing latent type. These characteristics may capture dimensions such as learning aptitude, task preference, or compatibility with particular forms of supervision. We assume

T_{i} ⊥ θ_{i},

implying that treatment assignment is statistically independent of worker type, consistent with random assignment.

In many HR environments, performance outcomes depend not only on the direct effect of policy exposure but also on the quality of the match between the intervention and the worker’s latent type. For example, the effectiveness of a mentoring program may depend on whether the mentor’s style aligns with the employee’s learning orientation, or whether a training module complements the worker’s skill base.

In practice, such heterogeneity is often unobserved by the analysts but implicitly reflected in performance evaluation systems used by firms.

3.2. Outcome Construction with Interaction-Based Match

To formalize this structure, suppose that worker performance is given by

Y_{i} = α + γ_{0} T_{i} + δ m (T_{i}, θ_{i}) + ε_{i},

(1)

where

$γ_{0}$ denotes the direct policy effect;
$m (T_{i}, θ_{i})$ represents interaction-based match quality;
$δ$ scales the contribution of match quality;
$E [ε_{i} ∣ T_{i}, θ_{i}] = 0$ .

The key feature of Equation (1) is that match quality may depend jointly on policy exposure and worker type. In targeted HR interventions, this is natural: the intervention is often designed precisely to interact with particular worker characteristics. This formulation is consistent with common HR evaluation settings, where measured performance (e.g., composite evaluation scores) combines direct productivity effects with assessment components that depend on worker–intervention fit.

Suppose that the researcher estimates the conventional evaluation regression derived from Equation (1), omitting the interaction-based match component:

Y_{i} = α + γ T_{i} + u_{i},

(2)

3.3. Evaluation Bias Under Targeted Assignment

The probability limit of the ordinary least squares estimator in Equation (2) is

plim \hat{γ} = γ_{0} + δ \frac{Cov (T_{i}, m (T_{i}, θ_{i}))}{Var (T_{i})} .

(3)

Random assignment ensures statistical independence between

T_{i}

and

θ_{i}

. However, as shown in Equation (3), it does not imply independence between

T_{i}

and the composite term

m (T_{i}, θ_{i})

, because match quality may be partially defined by treatment itself.

Evaluation bias arises whenever

Cov (T_{i}, m (T_{i}, θ_{i})) \neq 0 .

This covariance can be positive in targeted environments where the intervention is particularly effective for a subset of worker types, leading to systematic overestimation of program effectiveness. In such cases, the estimated treatment coefficient reflects not only the intrinsic effectiveness of the policy (

γ_{0}

), but also the prevalence of worker types for whom the intervention generates strong match gains.

From a managerial perspective, this implies that estimated program effectiveness may be systematically overstated or understated depending on workforce composition.

3.4. Identification Condition

A transparent condition ensures consistent estimation of the direct policy effect. If match quality satisfies the additive separability condition in Equation (4),

m (T_{i}, θ_{i}) = a (T_{i}) + b (θ_{i}),

(4)

then

Cov (T_{i}, m (T_{i}, θ_{i})) = 0,

and the conventional regression consistently estimates

γ_{0}

. Equation (4) therefore provides a transparent sufficient condition for unbiased identification of the direct policy effect.

Additive separability implies that the incremental contribution of match quality does not depend on the interaction between policy exposure and worker type. In contrast, non-separability corresponds to complementarity between intervention and latent characteristics—a feature that is common in targeted HR policies.

Thus, independence between treatment and worker heterogeneity is not, by itself, sufficient to guarantee unbiased policy evaluation in practical HR settings. What matters is whether the outcome construction embeds interactions between assignment and type.

This condition is unlikely to hold in many practical HR settings where interventions are explicitly designed to target specific worker types.

4. Simulation of Managerial Evaluation Outcomes

This simulation represents a stylized HR evaluation environment in which employee performance is used as the primary outcome measure and treatment effects may depend on latent worker characteristics. This structure captures typical HR evaluation systems in which performance metrics combine objective output measures and subjective assessments, both of which may depend on the match between interventions and worker characteristics.

To illustrate how randomized HR evaluations can mislead managerial decision-making, we simulate a mentoring program targeted at high-potential employees. The objective is not only to quantify evaluation bias, but also to examine how standard evaluation practices may lead managers to draw incorrect conclusions about program effectiveness when workforce composition varies.

4.1. Data-Generating Process

Match quality is specified as

m (T_{i}, θ_{i}) = T_{i} 1 {θ_{i} = 1},

capturing the idea that mentoring improves performance primarily for high-potential employees. The performance equation follows

Y_{i} = α + γ_{0} T_{i} + δ m (T_{i}, θ_{i}) + ε_{i},

with

γ_{0} = 0.5

,

δ = 1.0

, and

ε_{i} \sim N (0, 1)

.

Let

π_{1} = Pr (θ_{i} = 1)

denote the share of high-potential employees in the workforce. The remaining three types each have probability

π_{2} = π_{3} = π_{4} = (1 - π_{1}) / 3

. We vary

π_{1}

from 0.25 to 1.0 to examine how workforce composition affects evaluation bias.

The theoretical bias equals

Bias = δ π_{1},

implying that the estimated treatment effect increases proportionally with the prevalence of high-potential employees.

The simulation uses

N = 2000

observations and

R = 3000

replications.

4.2. Results

Table 1 illustrates how managerial assessments of program effectiveness vary with workforce composition. It reports the theoretical bias (

δ π_{1}

), the simulated bias of the naive estimator that omits match quality, and the root mean squared error (RMSE) of both the naive and correctly specified estimators. The results confirm the analytical characterization: for each level of workforce composition, the simulated bias of the naive estimator closely aligns with the theoretical prediction

δ π_{1}

. The close alignment between analytical and simulated bias indicates that the distortion arises directly from the interaction structure rather than from small-sample artifacts.

As the share of high-potential workers (

π_{1}

) increases, the magnitude of bias rises proportionally, even though the intrinsic policy effect

γ_{0}

remains constant across scenarios. As a result, identical mentoring programs would appear systematically more effective in organizations with a higher share of high-potential employees. Consequently, managers relying on standard evaluation estimates may incorrectly rank programs or organizational units, leading to misleading performance comparisons.

The RMSE of the naive estimator also increases with

π_{1}

, reflecting the growing covariance between treatment and interaction-based match quality. In contrast, the correctly specified estimator remains centered around zero bias and exhibits substantially smaller RMSE across all composition levels. This further implies that evaluation uncertainty increases with workforce composition, making decision-making more difficult precisely in settings where the intervention appears most effective.

Figure 1 shows that the estimated effect systematically overstates program effectiveness as the share of targeted workers increases, leading to a misleading ranking of program performance across organizations. The simulated bias of the naive estimator closely follows the analytical prediction

δ π_{1}

, increasing linearly with the share of high-potential workers. The correctly specified estimator remains centered around zero across all scenarios. These results confirm that the bias arises directly from the interaction structure embedded in outcome construction.

Figure 2 displays the corresponding RMSE results. As the share of high-potential workers increases, the dispersion of the naive estimator rises substantially, whereas the correctly specified estimator remains comparatively stable.

The results are robust to alternative treatment assignment probabilities and to different variance levels of the disturbance term (not reported for brevity).

More broadly, concerns about outcome measurement and interpretation have been emphasized in recent work (Baker et al., 2022; Bond & Lang, 2019).

4.3. Illustrative Example: Composite Performance Evaluation

To illustrate the practical relevance of the mechanism, consider a firm that evaluates employees using a composite performance score. Let

S_{i}

denote the annual evaluation score for worker i, constructed as a weighted index of output quantity, project completion, and mentor assessment:

S_{i} = w_{1} Q_{i} + w_{2} P_{i} + w_{3} M_{i},

where

Q_{i}

denotes output quantity,

P_{i}

denotes project completion or task performance,

M_{i}

denotes mentor or supervisor assessment, and

w_{1}, w_{2}, w_{3}

are the corresponding weights.

Composite evaluation systems of this kind are common in practice, where firms frequently aggregate multiple performance dimensions into a single score used for promotion or compensation decisions.

Consider a setting in which participation in a mentoring program directly improves productivity by

γ_{0}

, while also affecting the mentor assessment component

M_{i}

through an interaction with worker type. For instance, mentors may provide higher ratings to workers whose learning style aligns with the mentoring format.

Under this structure, the evaluation score can be written as

S_{i} = α + γ_{0} T_{i} + δ T_{i} \cdot g (θ_{i}) + ε_{i},

where

T_{i}

denotes participation in the mentoring program,

γ_{0}

denotes the direct productivity effect,

g (θ_{i})

denotes type-dependent responsiveness in evaluation (e.g., mentor assessment), and

δ

scales the contribution of this interaction.

Even under random assignment of the mentoring program, the composite score embeds an interaction between treatment and latent type. A regression of

S_{i}

on

T_{i}

alone therefore combines the direct productivity effect with the match-dependent evaluation component. The resulting estimate may reflect the aggregation of heterogeneous responses induced by the evaluation formula rather than the intrinsic effectiveness of the policy.

As a result, managers may attribute higher evaluation scores to superior program design, when they actually reflect the interaction between the evaluation system and workforce composition.

5. Practical Implications for HR Evaluation

The analysis yields several implications for managerial decision-making in HR evaluation. Although the framework developed in this paper is theoretical, it directly maps into managerial decision-making in practical HR evaluation settings.

Beyond interpretation, the bias identified in this paper has direct implications for organizational decision-making. When estimated treatment effects reflect workforce composition rather than intrinsic program effectiveness, managers may systematically favor interventions that align with the current workforce rather than those that are inherently more productive.

This may lead to inefficient resource allocation, as programs targeted toward more prevalent worker types receive disproportionate investment. In dynamic settings, such misallocation may also distort organizational learning, as firms update beliefs based on biased performance metrics. From an economic perspective, this implies that evaluation bias can translate into systematic misallocation of organizational resources.

In light of these considerations, managers should carefully assess whether performance measures incorporate interaction-based match components. In many practical settings, evaluation metrics such as composite performance scores implicitly embed interactions between interventions and worker characteristics. When such interactions are present, estimated treatment effects may not reflect intrinsic program effectiveness.

Managers should avoid relying solely on average treatment effects when comparing HR interventions. Differences in estimated effects across units or firms may be driven by workforce composition rather than differences in program quality.

Evaluation designs should explicitly account for treatment–type interaction. In practice, this may involve incorporating observable proxies for worker heterogeneity, conducting subgroup analysis, or testing the robustness of results across different workforce compositions.

Managers should exercise caution when using evaluation results for resource allocation decisions. Programs targeted at more prevalent worker types may appear more effective and thus receive disproportionate investment, even if their intrinsic impact is not superior.

Finally, firms should recognize that randomized assignment, while essential for causal identification, does not by itself guarantee reliable managerial inference when outcome construction embeds interaction between assignment and worker heterogeneity. Careful evaluation design and interpretation are therefore critical in HR analytics.

6. Conclusions

This paper addresses a subtle yet consequential identification issue in evaluating targeted HR policies under random assignment. In practice, performance outcomes depend not only on direct policy exposure but also on interaction-based match components that vary with latent worker characteristics. When such interactions are embedded in outcome construction, conventional evaluation regressions may fail to isolate the direct policy effect. Importantly, this paper does not challenge the use of randomized HR evaluations but rather clarifies how their results should be interpreted when outcome measures incorporate such interaction effects.

Our analysis confirms that random assignment ensures independence between treatment and underlying worker heterogeneity. However, it does not guarantee independence between treatment and composite outcomes that embed assignment-dependent components. We identify a transparent condition—additive separability between policy exposure and worker type within the match function—that allows consistent estimation of the intrinsic policy effect. When this condition does not hold, estimated treatment effects can partially reflect workforce composition rather than the true effectiveness of the intervention.

Monte Carlo simulations illustrate that the magnitude of evaluation bias can grow systematically with the prevalence of targeted worker types, even when the intrinsic policy effect remains unchanged. These simulations are designed to mimic a stylized HR evaluation setting in which employee performance is used as the primary outcome measure and treatment effectiveness depends on latent worker characteristics. This underscores the importance of accounting for match structures when comparing HR interventions across firms, units, or workforce compositions. Misinterpretation in these contexts can have real managerial consequences, such as misallocation of resources or suboptimal program selection.

The framework presented here is deliberately parsimonious, abstracting from dynamic adjustments, equilibrium responses, and learning effects. It provides a theoretical foundation that complements practical evaluation settings commonly encountered in HR management. Its aim is not to question the value of randomized evaluations but to clarify conditions under which standard specifications may fail in targeted assignment settings. By highlighting the role of treatment–type interactions in outcome construction, this work seeks to improve both the empirical design and interpretation of randomized HR policy evaluations.

From an economic perspective, the identification problem highlighted in this paper implies that evaluation bias can lead to systematic misallocation of resources when managerial decisions are based on composition-driven performance metrics. In environments where firms rely heavily on data-driven evaluation, such distortions may accumulate over time, affecting both productivity and organizational efficiency. This may also distort organizational learning and long-run resource allocation.

From a managerial perspective, the key takeaway is that evaluation results should be interpreted in the context of workforce composition and performance measurement design. In this sense, the paper contributes not only to econometric identification but also to the HR analytics literature by clarifying how performance measurement design affects managerial interpretation of evaluation results. Managers comparing HR programs across units or over time should recognize that observed differences in estimated treatment effects may reflect composition-driven interactions rather than genuine differences in program efficacy. Careful interpretation, coupled with attention to outcome construction, is essential to avoid systematic misallocation of organizational resources.

Funding

This work was supported by JSPS KAKENHI, grant number 25K05043.

Data Availability Statement

No external data were used. All data were generated through simulation.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HR	Human resources
RMSE	Root mean squared error

References

Allcott, H. (2015). Site selection bias in program evaluation. The Quarterly Journal of Economics, 130, 1117–1165. [Google Scholar] [CrossRef]
Angrave, D., Charlwood, A., Kirkpatrick, I., Lawrence, M., & Stuart, M. (2016). HR and analytics: Why HR is set to fail the big data challenge. Human Resource Management Journal, 26(1), 1–11. [Google Scholar] [CrossRef]
Athey, S., & Imbens, G. W. (2017). The state of applied econometrics: Causality and policy evaluation. Journal of Economic Perspectives, 31, 3–32. [Google Scholar] [CrossRef]
Athey, S., & Wager, S. (2019). Estimating treatment effects with causal forests: An application. Observational Studies, 5, 37–51. [Google Scholar] [CrossRef]
Baker, A. C., Larcker, D. F., & Wang, C. C. Y. (2022). How much should we trust staggered difference-in-differences estimates? Journal of Financial Economics, 144, 370–395. [Google Scholar] [CrossRef]
Bloom, N., Liang, J., Roberts, J., & Ying, Z. J. (2015). Does working from home work? Evidence from a Chinese experiment. The Quarterly Journal of Economics, 130(1), 165–218. [Google Scholar] [CrossRef]
Bond, T. N., & Lang, K. (2019). The sad truth about happiness scales. Journal of Political Economy, 127, 1629–1640. [Google Scholar] [CrossRef]
Boudreau, J., & Cascio, W. (2017). Human capital analytics: Why are we not there? Journal of Organizational Effectiveness: People and Performance, 4(2), 119–126. [Google Scholar] [CrossRef]
Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225, 200–230. [Google Scholar] [CrossRef]
Carneiro, P., Heckman, J. J., & Vytlacil, E. (2011). Estimating marginal returns to education. American Economic Review, 101, 2754–2781. [Google Scholar] [CrossRef]
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21, C1–C68. [Google Scholar] [CrossRef]
Deaton, A. (2010). Instruments, randomization, and learning about development. Journal of Economic Literature, 48(2), 424–455. [Google Scholar] [CrossRef]
Eeckhout, J., & Kircher, P. (2011). Identifying sorting: In theory. The Review of Economic Studies, 78, 872–906. [Google Scholar] [CrossRef]
Heckman, J. J., Smith, J. A., & Clements, N. (1997). Making the most out of programme evaluations and social experiments: Accounting for heterogeneity in programme impacts. The Review of Economic Studies, 64, 487–535. [Google Scholar] [CrossRef]
Heckman, J. J., & Vytlacil, E. (2005). Structural equations, treatment effects, and econometric policy evaluation. Econometrica, 73(3), 669–738. [Google Scholar] [CrossRef]
Heckman, J. J., & Vytlacil, E. J. (2007). Econometric evaluation of social programs. In Handbook of econometrics (Vol. 6, pp. 4779–4874). Elsevier. [Google Scholar] [CrossRef]
Hoffman, M., Kahn, L. B., & Li, D. (2018). Discretion in hiring. The Quarterly Journal of Economics, 133(2), 765–800. [Google Scholar] [CrossRef]
Huselid, M. A. (1995). The impact of human resource management practices on turnover, productivity, and corporate financial performance. Academy of Management Journal, 38, 635–672. [Google Scholar] [CrossRef]
Ichniowski, C., Shaw, K., & Prennushi, G. (1997). The effects of human resource management practices on productivity: A study of steel finishing lines. The American Economic Review, 87, 291–313. [Google Scholar]
Imbens, G. W., & Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62, 467–475. [Google Scholar] [CrossRef]
Jovanovic, B. (1979). Job matching and the theory of turnover. Journal of Political Economy, 87, 972–990. [Google Scholar] [CrossRef]
Knaus, M. C., Lechner, M., & Strittmatter, A. (2021). Machine learning estimation of heterogeneous causal effects. The Econometrics Journal, 24, C134–C161. [Google Scholar] [CrossRef]
Lazear, E. P. (1995). Personnel economics. MIT Press. [Google Scholar]
Lazear, E. P., & Oyer, P. (2012). Personnel economics. In R. Gibbons, & J. Roberts (Eds.), Handbook of organizational economics (pp. 479–519). Princeton University Press. [Google Scholar]
Levenson, A. (2018). Using workforce analytics to improve strategy execution. Human Resource Management, 57, 685–700. [Google Scholar] [CrossRef]
Marler, J. H., & Boudreau, J. W. (2017). An evidence-based review of HR Analytics. The International Journal of Human Resource Management, 28(1), 3–26. [Google Scholar] [CrossRef]
Minbaeva, D. (2018). Building credible human capital analytics for organizational competitive advantage. Human Resource Management, 28, 453–467. [Google Scholar] [CrossRef]
Rasmussen, T., & Ulrich, D. (2015). Learning from practice: How HR analytics avoids being a management fad. Organizational Dynamics, 44(3), 236–242. [Google Scholar] [CrossRef]
Sattinger, M. (1993). Assignment models of the distribution of earnings. Journal of Economic Literature, 31(2), 831–880. [Google Scholar]
Shimer, R., & Smith, L. (2000). Assortative matching and search. Econometrica, 68, 343–369. [Google Scholar] [CrossRef]
Vivalt, E. (2020). How much can we generalize from impact evaluations? Journal of the European Economic Association, 18(6), 3045–3089. [Google Scholar] [CrossRef]

Figure 1. Bias of the naive estimator as a function of the share of high-potential workers (

π_{1}

). The dashed line represents the theoretical bias (

δ π_{1}

), and the markers denote Monte Carlo averages. The naive estimator exhibits increasing bias as

π_{1}

rises, consistent with the analytical prediction

δ π_{1}

, while the correctly specified estimator remains centered around zero across all configurations. These results illustrate that, even under random assignment, estimated treatment effects can be systematically distorted by interaction structures embedded in the outcome measure.

Figure 1. Bias of the naive estimator as a function of the share of high-potential workers (

π_{1}

). The dashed line represents the theoretical bias (

δ π_{1}

), and the markers denote Monte Carlo averages. The naive estimator exhibits increasing bias as

π_{1}

rises, consistent with the analytical prediction

δ π_{1}

, while the correctly specified estimator remains centered around zero across all configurations. These results illustrate that, even under random assignment, estimated treatment effects can be systematically distorted by interaction structures embedded in the outcome measure.

Figure 2. Root mean squared error (RMSE) of the naive and correctly specified estimators as a function of the share of high-potential workers (

π_{1}

). The RMSE of the naive estimator increases as

π_{1}

rises, whereas the correctly specified estimator remains comparatively stable. This indicates that estimation error increases as the share of workers who benefit from the intervention rises.

Figure 2. Root mean squared error (RMSE) of the naive and correctly specified estimators as a function of the share of high-potential workers (

π_{1}

). The RMSE of the naive estimator increases as

π_{1}

rises, whereas the correctly specified estimator remains comparatively stable. This indicates that estimation error increases as the share of workers who benefit from the intervention rises.

Table 1. Monte Carlo performance under targeted mentoring.

Share of High- Potential Workers ( $π_{1}$ )	Theoretical Bias ( $δ π_{1}$ )	Simulated Bias (Naive)	RMSE (Naive)	RMSE (Correct)
0.25	0.250	0.250	0.254	0.049
0.32	0.320	0.320	0.323	0.049
0.39	0.390	0.391	0.393	0.051
0.46	0.460	0.460	0.462	0.053
0.53	0.530	0.529	0.531	0.056
0.60	0.600	0.599	0.601	0.060
0.67	0.670	0.670	0.671	0.063
0.74	0.740	0.741	0.743	0.069
0.81	0.810	0.810	0.811	0.079
0.88	0.880	0.880	0.881	0.099
0.95	0.950	0.951	0.953	0.144

Notes: The table reports Monte Carlo simulation results based on the data-generating process described in Section 4. The sample size is

N = 2000

with

R = 3000

replications. The naive estimator omits the interaction-based match component, while the correct estimator includes it. Theoretical bias is given by

δ π_{1}

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hamori, S. Why Randomized HR Evaluations May Mislead Managers: The Role of Treatment–Trait Interactions in Outcome Construction. Businesses 2026, 6, 23. https://doi.org/10.3390/businesses6020023

AMA Style

Hamori S. Why Randomized HR Evaluations May Mislead Managers: The Role of Treatment–Trait Interactions in Outcome Construction. Businesses. 2026; 6(2):23. https://doi.org/10.3390/businesses6020023

Chicago/Turabian Style

Hamori, Shigeyuki. 2026. "Why Randomized HR Evaluations May Mislead Managers: The Role of Treatment–Trait Interactions in Outcome Construction" Businesses 6, no. 2: 23. https://doi.org/10.3390/businesses6020023

APA Style

Hamori, S. (2026). Why Randomized HR Evaluations May Mislead Managers: The Role of Treatment–Trait Interactions in Outcome Construction. Businesses, 6(2), 23. https://doi.org/10.3390/businesses6020023

Article Menu

Why Randomized HR Evaluations May Mislead Managers: The Role of Treatment–Trait Interactions in Outcome Construction

Abstract

1. Introduction

2. Related Literature

2.1. Personnel Economics and HR Interventions

2.2. Matching and Assignment Models

2.3. Interpretation of Treatment Effects and Structural Parameters

2.4. Heterogeneous Treatment Effects and Identification

Distinction from Heterogeneous Treatment Effects

3. HR Assignment Framework

3.1. Targeted HR Interventions and Worker Heterogeneity

3.2. Outcome Construction with Interaction-Based Match

3.3. Evaluation Bias Under Targeted Assignment

3.4. Identification Condition

4. Simulation of Managerial Evaluation Outcomes

4.1. Data-Generating Process

4.2. Results

4.3. Illustrative Example: Composite Performance Evaluation

5. Practical Implications for HR Evaluation

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI