Optimized Probes of the CP Nature of the Top Quark Yukawa Coupling at Hadron Colliders

We summarize our recent proposals for probing the CP-odd iκ˜t¯γ5th interaction at the LHC and its projected upgrades directly using associated on-shell Higgs boson and top quark or top quark pair production. We first recount how to construct a CP-odd observable based on top quark polarization in Wb→th scattering with optimal linear sensitivity to κ˜. For the corresponding hadronic process pp→thj we then present a method of extracting the phase-space dependent weight function that allows to retain close to optimal sensitivity to κ˜. For the case of top quark pair production in association with the Higgs boson, pp→tt¯h, with semileptonically decaying tops, we instead show how one can construct manifestly CP-odd observables that rely solely on measuring the momenta of the Higgs boson and the leptons and b-jets from the decaying tops without having to distinguish the charge of the b-jets. Finally, we introduce machine learning (ML) and non-ML techniques to study the phase-space optimization of such CP-odd observables. We emphasize a simple optimized linear combination α·ω that gives similar sensitivity as the studied fully fledged ML models. Using α·ω we review sensitivity projections to κ˜ at HL-LHC, HE-LHC, and FCC-hh.


Introduction
The interaction between the heaviest particles of the Standard Model (SM), the top quark t and the Higgs boson h, is an important target for the LHC experiments. It is precisely predicted within the SM. The measured top quark mass m t and the electroweak condensate value v precisely determine the on-shell scalar (P-and CP-even) coupling y t = √ 2m t /v, while P-and CP-odd interactions are absent. Beyond the SM, effective operators of dimension-6 can break this correlation and result in more general (pseudo)scalar tth couplings κ (κ) [1] L ht = − y t which reduce to the SM case at κ = 1,κ = 0. CP-violating h couplings, likeκ, are particularly interesting as any sign of CP violation in Higgs processes would constitute an indisputable New Physics (NP) signal. Existing data on Higgs production and decays is already precise enough to constrain any isolated modification of the top Yukawa to O(1) [2][3][4]. However, all existing measurements are based on CP-even observables with very limited sensitivity to CP-odd modifications of the top quark Yukawa. In principle, indirect collider bounds from Higgs decay and production (gg → h, h → γγ), and especially the low-energy bounds on electric dipole moments (EDMs) of atoms and nuclei that target specifically CP-odd effects [2,5,6], are currently more constraining than direct collider probes. However, these constraints are subject to assumptions about other Higgs interactions, and in particular in the case of EDMs also other contributions unrelated to the Higgs. At the LHC it is possible to probe these couplings directly with two of the particles in Equation (1) on-shell (Since m h < 2m t one cannot probe these couplings with all the three particles on-shell) in top-Higgs associated production processes pp → thj and pp → tth [2][3][4][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23] (The loop induced partonic process gg → h → tt depends on κ 2 , κ, andκ 2 already on the production side as it is dominated by the top quark loop [5]). The corresponding total cross sections scale as κ 2 ,κ 2 (thj also as κ), and are thus poorly sensitive to small nonzeroκ. Linear sensitivity toκ on the other hand can be achieved by measuring P-and CP-odd observables.
We have recently addressed this challenge by identifying observables with optimal sensitivity to a single CP-odd parameter in both th and tth associated production at the LHC, which can be realistically measured and exhibit close to optimal sensitivity to CP-odd interactions between the Higgs boson and the top quark [24]. The proposed observables in th are based on optimization of top-spin correlations previously studied in tt production [25]. Unfortunately, the overwhelming irreducible backgrounds make the thj channel impractical. On the other hand, in the case oftth this procedure becomes intractable in practice and our construction relies instead on CP-and P-symmetry arguments. In total, we can identify 21 different CP-odd observables that can be constructed out of 5 measurable final state momenta and an additional triple-product asymmetry [26]. Namely, assuming pp → tth production with semileptonically decaying tops, we combine the final state lepton momenta p ℓ + , p ℓ − , two b-jet momenta p b , pb from decaying top quarks (although without discriminating their charges (Efficient b-jet charge discrimination could allow to construct further CP-odd observables, see e.g., [6,20])) and the Higgs momentum p h in different ways to construct C-even, P-odd laboratory frame observables ω i [24].
Due to the high dimensionality and complexity of the phase-space in this process with top quarks decaying semileptonically, using the complete kinematical information accessible experimentally to construct an optimal CP-odd observable is challenging. To this end we have employed neural networks (NN) trained on Monte-Carlo generated samples to efficiently parametrize the weight function of events across the multi-dimensional phasespace in order to maximize the statistical sensitivity toκ [27]. We show how the required Pand CP-symmetry properties of the NN-based observables can be imposed a priori. Finally, we compare in terms of optimality, a general CP-odd NN function of the phase-space to a linear combination of manifestly CP-odd variables.
The present paper serves as a pedagogical review of the work first reported in Refs. [24,27].

Parton Level Wb → th Analysis
We begin by studying the effects ofκ on top spin observables in the idealized case of single top quark production in partonic process W(p)b(p ′ ) → t(k)h(k ′ ). Here the complete polarized scattering amplitudes can be found in a compact analytic form. This process can actually be connected to a more realistic pp → thj production in the high energy limit, where the W and b quark mass effects are negligible and the collinear emission of both initial state 'partons' can be described by the corresponding parton distribution functions. (See e.g., Section 3 of Ref. [28] for an extended discussion on the validity of this approximation) Three diagrams contribute to such parton level Higgs-top production in the SM, shown in Figure 1. Neglecting furthermore the mass (and thus the corresponding Yukawa coupling) of the bottom quark, we consider only the first two of the diagrams in Figure 1. The formalism presented here is based on Refs. [25,29,30]. First, we introduce the spin projection operator [31] where s µ is a top spin four-vector, defined in a general frame as Vector k is the top quark momentum andŝ is an arbitrary unit vector that represents polarization of the top quark. The physical significance ofŝ is revealed if we make a rotationfree boost to the top rest frame where we find s * µ = (0,ŝ). Note that spatial component of four-vector x µ in this case transforms as m t k upon boost to the top rest frame. Therefore s 2 = −1, s · k = 0, andŝ corresponds to the polarization of the top quark in its rest frame. Projection onto a well defined polarization of the top quark is achieved by placing the projector (2) in front of the bispinors at the amplitude level: Sum over top quark polarizations r then becomes and the cross-section is linear in s µ Here the polarization vectorŝ is arbitrary while b µ contains all the information about the polarization of the top quark in the given process. The parton level cross section can be written as where Φ in is the initial state flux normalization and dΓ th is the th phase space volume.
On the other hand, in the top rest frame it is convenient to introduce the spin density matrix as such that the unpolarized cross section is proportional to |M| 2 = Tr[ρ] = 2A. Here σ are the Pauli matrices. In the density matrix formalism, the expectation value of a generic operator is obtained as In particular, the polarized cross section alongŝ is obtained as the expectation value of the projector: One can determine the rest-frame coefficients A, B i from a, b µ by comparing the expressions for polarized |M| 2 , expressed via Equations (7) and (11). The result of this matching are explicit expressions: The rest-frame polarization of the top quark along a vectorŝ is given by the expectation value of Oŝ =ŝ · σ 2 , Oŝ = B ·ŝ. (13) This observable can be determined for example by measuring the angular distribution of the charged lepton in the semi-leptonic top decay t → b(W → ℓν) since the charged lepton in top decay is considered to be an almost perfect top spin analyzer, i.e., the angular decay distribution vanishes when the lepton momentum is opposite to the spin of t [32]. The angular distribution in the top rest frame then allows for experimental extraction of the B i coefficients [32]. Here θ ℓ is an angle between the lepton and the polarization axisŝ in the top rest frame. The above construction shows that the vectorŝ is an arbitrary unit vector defined in the laboratory frame. A particular choiceŝ = k/|k| implies that Oŝ measures the top quark helicity. Another natural choice forŝ is the W momentump, also known as the beam basis, which has to be carefully defined in actual pp collisions where the (symmetric) initial state does not allow to define forward and backward directions. Experimentally one has to reconstruct the top quark rest frame in order to be able to trace the angular distribution of the lepton with respect to the chosenŝ and gain access to the coefficients B i . In the following we will show how to optimize the choice ofŝ such that the sensitivity to the CP-violating parameterκ is maximized.
In the Wb center-of-mass frame we can define the W and t momenta aŝ p = (0, 0, 1), where θ is the angle between the direction of the top quark and the W boson. We have set the azimuthal angle φ = 0 without loss of generality. The polarization vector components B i in this case depend on x = cos θ. Evaluation of the two diagrams in Figure 1 leads to the coefficients B 1,2,3 (x) of the polarized cross-section. It turns out that in the coordinate system in Equation (15) the analytical expression for B 2 (x) is linear inκ, whereas B 1,3 do not contain linearκ terms. Effectively this means that we should choose the vectorŝ to be orthogonal to the plane spanned by the W and t momenta in order to probeκ with linear sensitivity. Similar results have been reported in Ref. [18]. In pursuit of maximal sensitivity toκ we choose the polarization vector in each event asŝ =p ×k/|p ×k|. In this case an interesting experimental quantity a two-fold differential cross-section where we have approximated the intermediate top quark as a narrow resonance and Σ(x,ŝ) = dσ/dx(Wb → t (ŝ) h) is the differential production cross section for the top quarks polarized in theŝ direction. Using Equation (11) and insertingŝ we have Σ(x, ±ŝ) = Φ in (A(x) ±κβ(x)), where Φ in is the initial flux normalization. Thus we can write Equation (17) as Treatingκ as a small perturbation we can integrate the distribution in Equation (18) with a phase-space dependent function f that would maximize statistical sensitivity of the integral toκ. It has been shown in Refs. [33,34] that such an optimal function should be the ratio of theκ-perturbation to the unperturbed distribution, in our case f (x, cos θ ℓ ) =

β(x)
A(x) cos θ ℓ . The optimal observable is thus where θ ℓ is the angle betweenŝ and the lepton momentum in the top center-of mass-frame, as defined in the preceding paragraph. The index i = 1, . . . , N labels individual events. The prediction scales as β 2 , where we have integrated over cos θ ℓ and left the bounds for x = cos θ unspecified. The function β(x) is plotted in Figure 2.
To carry over the presented formalism to the realistic case of pp collisions, we have to adapt the beam axis by referring only to experimentally accessible momenta. Using the reconstructed top momentum k as a reference, we define the positive z-direction as the parallel top quark momentum projectionk . The top quark is then by definition in the positive hemisphere,x = cosθ ≥ 0, whereθ is the angle between k andk . The optimal polarization direction with linearκ sensitivity now becomesŝ =k ×k ⊥ upon which an experiment should measure the lepton angleθ ℓ . The cross-section distributions inx and x are related via whereÃ The cos θ ℓ is flipped in the second term since forx = −x the polarization vector s =k ×k ⊥ flips the direction compared to the previous definition,ŝ ∼ p × k. The optimal observable in this case is finallỹ where the optimal weight function is again taken to be the ratio of theκ-perturbation to the unperturbed distribution (see Equation (21)), in line with Refs. [33,34]. In terms of experimental N data points with reconstructedθ ℓ andx the observable is obtained as the following average:Õ Wb→th opt.
. However in general theÕ Wb→th opt.
is expected to result in a weaker statistical significance due to our inability to determine the direction of the top quark with respect to the initial W. In Figure 2 we show that β(x) is large at negative x and we haveβ(x) ≈ −β(−x), for representative values of the center-of-mass energy √ s.

Hadronic Process pp → thj
Here we demonstrate how to carry over the optimal CP-violating observable from parton level to the realistic case of pp collisions, but still neglecting reconstruction efficiencies and backgrounds. The parton level observable defined in Equation (23) can be adapted to this case with an additional integration over the parton distribution functions (PDFs). Since the hadronic cross section is a convolution of partonic cross sections it can be split into aκ-independent piece and the small perturbation proportional toκ, similar to the partonic cross section in Equation (21). Assuming that the Higgs decays into visible states, the missing p T is only due to the neutrino originating from the top decay. Thus we can reconstruct the top quark momentum and kinematic quantities of Equation (21). For hadronic collisions one can express the cross section as and weigh the events with the optimal f opt. ∝ cosθ ℓ B/A. To demonstrate the procedure we have used the Monte Carlo event generator MadGraph5 [35,36] together with the Higgs Characterisation UFO model [37,38] (for an analysis of next-to-leading order QCD and next-to-next-to leading EW effects see Refs. [8,17,39], respectively) to incorporate the κ and κ couplings in the simulation of the pp → t(→ bℓν)hj signal. The procedure of extracting the optimal weight function B/A from MC simulations and using it to produce the optimal observable goes as follows: 1.

2.
Fixκ and extract from the MC simulation the mean cosθ ℓ in each of thex bins. The obtained value corresponds to the weight 1 3 B/A in this bin, see Equation (25).

3.
Use this information to weigh experimental events bin-by-bin with f opt. ∝ cosθ ℓ B/A.
The normalization of f opt. is fixed by the requirement dxB/A = 1.
This optimization procedure is independent of theκ value as long asκ is sufficiently small. The resulting optimal weight B/A is shown in Figure 3, where we compare it for different final states (thj orthj) and collision energies (14 or 27 TeV). To assess the stability of the proposed method against higher order corrections we have extracted the weight function from simulations at NLO in QCD to estimate the systematic uncertainty associated with QCD effects and found that the difference is within 10% of the LO extraction. Finally, we can compare our optimized approach to the naïveκ extraction through the measurement of O naïve = cos θ ℓ , which in turn corresponds to the case where the weight is independent ofx, i.e., B/A = 1. We define the signal significance of an observable O as In Figure 4 we show the improvement of the significance when the optimal weight function is applied on simulated signal events without showering or reconstruction effects at 14 TeV. TeV proton collision energies for pp → thj . All plots are obtained usingκ = 1 and with 10 6 MC events [24].

Limits in the (κ,κ) Plane from pp → thj at Event Reconstruction Level
In order to make closer contact with experiments, we can also include the effects of parton showering, detector response and background processes. In our analysis we have used MadGraph5 to generate events at leading order (LO) in QCD for the signal process pp → t(→ bℓν)h(→ bb)j plus the conjugate process witht at 14 TeV High-Luminosity LHC (HL-LHC) and 27 TeV High-Energy LHC (HE-LHC) center-of-mass energies. (Note that our procedure of obtaining an optimal observable does not depend on the h decay products, therefore this analysis should be taken as a proof of concept with potential for future improvements using e.g., multiple h decay channels) Event generation was performed for multiple values of (κ,κ). The parton level events were subsequently showered and hadronized with Pythia8 [40], and jets are clustered with the anti-k T algorithm using FastJet [41]. For detector simulation and final state object reconstruction (e.g., lepton isolation and b-tagging) we used Delphes v3.3.3 [42] with the default ATLAS parameters in delphes_card_ATLAS.tcl. The dominant background process in this analysis is tt production with additional associated jets. We included this background by generating pp → tt samples, with one of the tops decayed into the semi-leptonic channel and the other one decayed into the hadronic channel, produced in association with 0, 1, and 2 hard jets. In order to correctly model the hard jets' distributions, we merged the matrix element computations with the MC shower using the MLM [43] prescription. For the event selection we demand the following basic requirements: One additional (non-tagged) light jet exclusively in the forward direction with 2 < |η(j)| < 5 and p T (j) > 20 GeV, • One isolated light lepton ℓ ± = e ± , µ ± with |η(ℓ)| < 2.5 and p T (ℓ) > 10 GeV.
In addition, we further select events with one reconstructed Higgs and one reconstructed top quark as follows: first, we calculate the three possible invariant masses from the three reconstructed b-jets (m bb ) and only keep the event if at least one bb pair satisfies |m H − m bb | < 15 GeV. For such events, we select as the Higgs decay candidate h → bb for the pair of b-jets with the invariant mass closest to the Higgs mass. The remaining non-Higgs b-jet is then assumed to come from the top-quark decay. Next, we reconstruct the top-quark by requiring that the combined invariant mass m blν of the remaining b-jet, the lepton, and the neutrino (also reconstructed by assuming it to be the unique source of missing energy in the event) to fall inside the mass window of the top-quark defined by m t ± 35 GeV. In order to further reject the tt backgrounds, events with a reconstructed Higgs and top are selected if the combined invariant mass of the b-jets originating from the Higgs and the light jet satisfies the cut m bbj > 280 GeV [44]. The final selection efficiency for the thj signal in the SM is 0.32% (0.23%), while for the background it is 0.008% (0.006%) at 14 TeV (27 TeV).
As we fully reconstruct the th system and have access to the lepton momentum from the top decay we have all the necessary information for measuring the optimized spin observable. We use the optimal weight function B/A (Figure 3) extracted from the MC simulations to construct a χ 2 with an appropriately weighted signal process. Our results for pp → thj generated in the SM are given by the 2σ exclusion limits (shaded blue) shown in Figure 5 for the HE-LHC at a luminosity of 15 ab −1 . As can be seen in Equation (23) the observableÕ opt. is normalized to the cross section, which contains terms κ 2 ,κ 2 , as well as a linear term in κ and a constant term due to second diagram in Figure 1, whereas the numerator is proportional toκ(κ + c). The behaviour of O opt close to the SM point is thus linear inκ, whereas the cross section has a minimum in κ close to κ = 1. In the large coupling regimeÕ opt. converges to a small value which depends on the direction in which we make the limit κ 2 +κ 2 → ∞. The 2σ exclusion has an elliptic shape as shown by the blue contour in Figure 5. We also present the limit (given by the black elliptic contour) assuming a 2σ positive excess above the SM expectation corresponding to a measurement of the optimized spin observable ofÕ opt. = 0.06 ± 0.03 whose size and error are statistics-driven. Because of the nature of our observable, the signed fluctuation gives rise to asymmetric limits in theκ direction. In the κ direction the bounds are also not symmetric as pp → thj production contains linear κ terms. Finally, in order to include background effects, the same statistical analysis would have to be repeated including the tt background in the χ 2 fit. However, even with a large background rejection as implemented above, the irreducible background is simply too large and the signal is completely diluted leading to a signal significance of only S/ √ B ∼ 0.8 (3.2) at 14 TeV (27 TeV) at a luminosity of 3 ab −1 (15 ab −1 ). This effectively precludes any meaningful extraction of bounds onκ from a fit toÕ opt. . We leave the possibility of further optimizing the cuts in order to reduce the backgrounds or including other Higgs decay channels as a future challenge. In the remainder of this review we instead focus on the related but more abundant process of associated top quark pair and Higgs boson production. Figure 5. Bounds in the (κ,κ) plane using the optimized observableÕ opt for the single-top associated production with a Higgs boson. The blue shaded region corresponds to the 2σ (χ 2 > 6.18) exclusion zone assuming the measurement of the SM at the HE-LHC (15 ab −1 ). The black line and stripes shows the 2σ excluded region for a 2σ positive fluctuation at the HE-LHC (see text for details) [24].

CP-Odd Observables in the pp → tth Process
In this Section we introduce laboratory-frame accessible and phase-space optimized CP-odd observables in the process pp → tth in which both top quarks decays semileptonically: t → b l + ν and t → b l − ν. This process has a higher S/B ratio compared to pp → thj and has indeed been measured by the LHC collaborations [45,46] (For the state of the art predictions of the differential distributions see e.g., Ref. [39]) The top quarks in pp → tth are known to be unpolarized, independent of the value ofκ [2]. The information on the underlying κ andκ parameters is in principle contained in the correlations among the top quark spins, however due to the experimental difficulties of extracting the top quark polarizations in this process, this approach is unfeasible. This is the reason we focus in this Section on manifestly CP-odd observables, directly sensitive toκ and easily accessible by measurements of final-state momenta in the lab frame.

Laboratory Frame CP-Odd Observables in tth Production
The accessible final-state momenta [6] of this decay in the lab frame are 3-momenta of b-jet p b , b-jet pb, 3-momenta of leptons p ℓ + and p ℓ − , and 3-momenta of the Higgs p h . In practice differentiating between b-jet and b-jet is difficult (For recent attempts in extracting the charge of the b-jet see Refs. [47][48][49][50]), therefore we should consider only observables which are invariant under p b ↔ pb transformation. We construct these observables from 3-vector quantities with well defined C and P eigenvalues, that are given in Table 1. Table 1. Vector quantities with well-defined C and P eigenvalues in 3 dimensional Euclidean space. More complicated objects with well-defined C and P eigenvalues can be constructed using variables in this table.
We use quantities in Table 1 to construct CP-odd variables ω of mass-dimension 5, that are C-even and invariant under p b ↔ pb transformation. In order to systematically obtain all distinct ω's we proceed as follows. First, we construct variables of the form .., 5}. Doing so, we find 150 potential quintuple products. We symmetrize them with respect to C-conjugation and p b ↔ pb transformation. The non-zero quintuple products are ω variables, however they may be linearly dependent. Indeed some of the obtained ω's are connected via the following Euclidean identity which can be written as: where a, b, c and d are four arbitrary vectors in 3 dimensional Euclidean space. The sign of individual terms in the last expression corresponds to the sign of the cyclic permutation of the four vectors. The first class of ω's involves p ℓ + and p ℓ − in the mixed product, p ℓ − − p ℓ + in the scalar product. Both products are invariant under p b ↔ pb: The second class involves p b × pb and/or p b − pb in both mixed and scalar products: The third class involves mixed product of p h , p ℓ − ± p ℓ + , and p b ± pb: There are further possibilities with a mixed product p h × (p ℓ − + p ℓ + ) · (p b + pb) multiplied by one of 8 C-even, b ↔b even scalar products Since the mixed product itself already has the desired symmetry properties, those 8 quintuple products do not bring additional new information, with respect to a mixed (triple) product that is our final observable: Note that there are nonlinear relations between ω's, such as ω 1 ω 6 = ω 3 ω 4 , that we do not exploit to further reduce the set. Namely, ratios of ω's can contain singularities in the available phase-space and as such would be difficult to reconstruct by a neural network optimizer.
All the ω's are normalized by the lengths of the vectors that enter as factors in the scalar products, and the upper bound |ω i | ≤ 1 is generally valid. For cases when ω i has a vector a present both in the mixed and scalar products, e.g., (V 1 × V 2 · a) (V 3 · a), and furthermore with V 1 × V 2 · V 3 = 0, a stricter upper bound |ω i | ≤ 1/2 applies (for ω 1,6,7,11,13,14,15,16,20,21 ). Having constructed manifestly CP-odd variables ω, we now show how they can be used to extract information onκ in an optimal way. The tth production cross section can be written as where x are CP-even phase space variables, A is a manifestly CP-even and B is a manifestly CP-odd function of ω i : B(x, ω i ) = −B(x, −ω i ). The κκ dependence is due to the interference of scalar and pseudoscalar amplitudes. A simple CP-odd observable is an average of a single variable ω i where N is the number of experimental events. The standard deviation σ i of such an observable is given by For large enough N the distribution of O i is approximately Gaussian and the corresponding significance of this observable is: Equation (55) holds for ω 1 , ..., ω 22 and their linear combinations. By studying the behavior of all 22 such observables O i on MC simulations of pp → tth with semi-leptonically decaying top quarks we have found that O 6 and O 14 are the most promising in terms of their significance. Our next aim is to show how one can achieve better sensitivity toκ by introducing optimized observables in two directions: the phase-space optimization of a single ω i and a construction of an optimal CP-odd observable as a combination of all available ωs.

NN-Based Optimized CP-Odd Observables
Due to the high dimensionality and complexity of the phase-space in the tth process with top quarks decaying semileptonically, we rely on neural networks (NN) trained on Monte-Carlo generated samples to efficiently parametrize the weight function of events across the multi-dimensional phase-space in order to maximize the statistical sensitivity tõ κ. We note that existing general purpose ML inference tools are already able to optimize sensitivity to a given parameter (see e.g., Ref. [51]). The purpose of our work was to do so in an economic way that manifestly respects the symmetries of the problem.
We implemented the training and evaluation of neural networks using the TensorFlow framework [52]. In all cases, we used a sample of 10 7 pp → ht(→ bℓ + ν)t(→bℓ −ν ) events generated at LO using Madgraph5 [35] together with the Higgs Characterisation UFO model [37,38] with κ,κ = 1. We split the sample into separate training (7.5 M) and test (2.5 M) samples. After training at fixedκ = 1 we also tested the observables at other values ofκ and κ, both ranging from −1 to 1. In these tests 1 M events have been used.Unless stated otherwise the results are shown for events in pp collisions at 14 TeV. We randomly initialized the neural network weights using the default Glorot uniform initializer and used the Adam optimizer with a custom varying learning rate l(e) = l(e − 1)/(1 + 0.8 e ) where e is the current epoch and the initial learning rate is set to 0.1. We use relu for the activation function. We trained all networks using the novel loss function loss(α) = mean(F (X; α)) std(F (X; α) where the mean() and the standard deviation std() are to be calculated over all events in the sample. The loss corresponds to the inverse of the significance-squared of the observable mean(F (X; α)) that should be minimized in order to achieve optimal statistical sensitivity.
Here N is the size of the event sample, α are the free neural network weights and biases and X stands for the values of CP-even and/or CP-odd phase-space variables in the given event. We emphasize that such a choice of the loss function is unique to the problem at hand -we are using the optimization procedure to directly maximize the significance of each considered observable. We can avoid over-fitting of the training sample by stopping the training when at least 30 epochs have passed and one of the following two criteria is satisfied: either the running average of 20 training losses saturates to 0.5% or the running average of 20 test losses increases for 5 epochs in a row. We keep a model history and in the end choose the best model in terms of test loss. In practice we found that mostly the first condition terminates the training loop, and the best model is usually the model from the final epoch of training. In order to determine the optimal NN architecture we performed a scan over a set of possible NN configurations with up to 2 hidden layers and up to 9 nodes per NN layer. (We have also considered an automated algorithm to determine the optimal NN architecture (i.e., Hyperopt [53], see also Ref. [54] for one of its recent uses.). Here instead we present results of manual scans over a set of possible NN configurations in order to have better control over the NN parameters. We found the results of both approaches comparable. We choose this cutoff for representational purposes, however we have checked that our results do not change significantly when using larger networks, namely up to three hidden layers of 30 nodes each.)

Phase-Space Optimization of a Single ω
First we present the optimization of the ω 6 and ω 14 variables based on phase-space averaging. We do not follow the optimization procedure based on separating A and B in Equation (52) since this would require cumbersome multidimensional binning [33]. We use a vector of easily accessible CP-even Mandelstam variables x: Our goal is to find the optimal CP-even weight function f (x; α), which should be used to calculate the weighted average of ω i . The function f takes CP-even quantities x as inputs, therefore we expect its dependence onκ to be of the form Using (52) we can now express the observable with the usual definition of the average (The phase space average of a function is defined as # ≡ 1 σ dσ dxdω # dxdω). The integration region in x − ω i space is symmetric with respect to the transformation ω i → −ω i , therefore integrals of the arguments which are antisymmetric in ω i vanish. Due to this reason all contributions to the expected value of the observable f (x; α)ω i are proportional to odd powers ofκ. The large dimensionality of the phase-space suggests the parameterisation of the function f (x; α) by means of an appropriate NN. In terms of the loss function (56) To understand the impact of using different possible neural network architectures, we have performed a manual scan over a set of neural network configurations. The input layer has 5 nodes (one per each x component) and the output layer has one node resulting in a scalar f (x; α). We studied networks with a single hidden layer of 1-9 nodes and double hidden layer networks with 1-9 nodes each, constraining the number of nodes on the second hidden layer to be smaller than or at most equal to the number of nodes on the first hidden layer. The results of the converged test losses of 50 different random weight initializations per configuration are shown on Figure 6 in the purple box plot for the case of ω 6 and the orange box plot for the case of ω 14 . The plain ω 6 -based observable is shown in gray, with the dashed lines denoting its 1σ statistical uncertainty, while the same holds true for plain ω 14 in black. We found that the phase-space optimization of ω 6 gives a noticeable improvement over plain ω 6 when using a large enough network. Interestingly, ω 14 seems to be close to optimal on its own, as the phase space optimization does not introduce noticeable improvement. Moreover, the optimized ω 6 gives a similar performance to ω 14 , hinting that we have reached maximal performance achievable with a single ω.
To test how well the resulting networks generalize to other values ofκ we used the 50 converged {9, 9} models to calculate the dependence of the resulting observable significances with respect toκ on the aforementioned fresh sets of 1 M events perκ. This is shown on Figure 7 where a consistent improvement over simple ω 6 can be seen at all consideredκ, while again a marginal improvement is confirmed for ω 14 , with the optimized ω 6 hovering around ω 14 .

Nodes in hidden layer
Test loss  (60), is shown in red as described in Section 3.2.3 [27].
As the results of optimizing single ω 6 and ω 14 point to a maximal performance possible using a single ω and the chosen set of phase space variables, we now turn to the rest of the ω's. In the next subsection we consider a more general case where the CP-odd observable itself is parameterized with a neural network.

Neural Network as a CP-Odd Observable
We consider the case where the output of the neural network is a CP-odd quantity that defines our observable. To this end, we build a network with 22 inputs, one per each ω i , and one output F(ω; α), which is correctly anti-symmetrized so that F(ω; α) = −F(−ω; α). In terms of the loss function (56) we have F (X; α) = F(ω; α). Note that since we include the complete irreducible set ω in this non-linear construction, it effectively also covers the case of a simple phase-space optimization of any (linear combination of) ω i , since all relevant CP-even phase-space variables can be recovered by taking suitable products of ω i ω i ′ .
We again carried out the study of the dependence of the network size with respect to the test sample loss, including non-negligible uncertainties associated with random weight initializations. We scanned the neural network architecture parameter space in the same way as in the previous case, starting with a single hidden layer of 1-9 nodes, then adding an additional hidden layer with the number of nodes smaller than or equal to the number of nodes on the first hidden layer. For each configuration we ran 50 trainings with different random weight initializations. The results are shown in Figure 6 in the blue box plot. We find a considerable improvement over the phase-space optimizations of ω 6 or ω 14 . The improvement is consistent in the entire range ofκ ∈ [0.1, 1.0] and is most striking at largeκ.
Again we can check the generalizing power of the resulting observables to otherκ by fixing the model configuration to {9, 9} and calculating the significance of the resulting observables with respect toκ. The results are shown on Figure 7. We find a consistent improvement over the previous case across all consideredκ. A noticeable improvement in the significance can be seen. In order to better understand the physics underlying the optimization, we next consider this model in the leading order approximation in ω.

First Order Approximation of F(ω;α)
To address the arbitrariness of the neural network architecture choice and to better understand the underlying physics, we finally consider the first order approximation of the function F(ω; α), which can be expanded in a Taylor series F(ω; α) = ∑ j α j ω j + O(ω 3 ) for j ∈ {1, . . . , 22}, since for most events the values of the CP-odd variables are small: |ω i | ≪ 1, i ∈ {1, ..., 22}. In the first order approximation the optimal CP-odd observable can be written as where α j are parameters that satisfy a subsidiary condition |α| ≡ ∑ 22 j=1 α 2 j = 1. Their values can be found by maximizing the significance: The left side of the expression above can be expanded in the following way: where in the last step we take α outside the − , for example α · ω = ∑ j α j ω j . By assumingκ = 0 we have O α·ω = 0. From this, together with the condition in Equation (61), it follows that the expression in the curly brackets of Equation (62) must be equal to 0. Hence, we obtain a system of 22 quadratic equations (The problem is equivalent to a single neuron NN with 22 inputs and one output without the activation function and the bias term) The system of equations in Equation (63) can be solved numerically. The solution vector α is undetermined up to a real non-zero constant. We normalize it such that α · α = 1 and choose the solution with mostly positive values. We used this approach to extract the optimal weights α j from 10 7 events generated withκ = κ = 1 at 14, 27, and 100 TeV. We also estimated the uncertainty associated with the optimal weights using the following procedure: First we estimate the statistical spread of the significance obtained with optimal α. Next we allow a single α j to float in the intervals [α j − σ j , α j + σ j ], where σ j is chosen such that the decrease of the significance due to the change in α j corresponds to the statistical spread of the significance. We perform an efficient scan around the optimal vector α in its 22-dimensional neighborhood using spherical coordinates to trivially fulfill the normalization constraint ∑ j α 2 j = 1. We approximate the significance with a quadratic function around the extremum to find independent, uncorrelated directions in the α-space. With this procedure we determine how sharply the optimal α j are defined. In practice, we estimated the statistical error of the significance using 10 7 events. Clearly the uncertainties σ j are larger for smaller chosen sample size. The results of this approach are shown in Figure 8, where the upper (lower) panel shows the estimated error (significance) for each α j at 14, 27, and 100 TeV. A comparison of the observable α · ω to other approaches discussed previously is shown in Figure 7. We reach a similar level of improvement compared to the full F(ω; α) network with significantly fewer parameters.

Bounds in the (κ,κ) Plane
We could now produce the bounds in the (κ,κ) plane by varying both κ andκ at generator level and including showering and hadronization effects using Pythia8 and detector effects using Delphes with the default ATLAS simulation card. As the tth is followed by semileptonic top decays and h → bb decay, our signal is defined as 4 b-jets and two oppositely charged leptons ℓ. We included the main irreducible background pp → ttbb with both tops decaying semileptonically and used the event selection requirements: • ≥ 4 jets of any flavor with |η(j)| < 5 and p T (j) > 20 GeV. • ≥ 3 of the above jets are b-tagged. • 2 oppositely charged light leptons with |η(ℓ)| < 2.5 and p T (ℓ) > 10 GeV. Furthermore, in order to identify the b-jets from t,t decays we counted the number N b of tagged b-jets and performed the following selections: if N b ≥ 4, we compute the invariant masses m bb of all possible b-jet pairs and select the pair with invariant mass closest to the Higgs mass m h = 125 GeV. If the selected pair falls inside the Higgs mass window (|m bb − m h | < 15 GeV) we removed the pair from the list of b-jets and selected from this list the highest p T b-jets as our candidate top quark decay b-jets. However, if N b = 3 we computed all possible invariant masses m bj where j are non-b jets in the event. We selected as the h → bb candidate the bj pair that minimizes |m h − m bj | and falls inside the Higgs mass window m h ± 15 GeV. The remaining two b-jets are taken as the candidate top quark decay b-jets.
We present the bounds for HL-and HE-LHC, FCC-hh by using the optimal observable (60) with the weights shown in the upper plot of Figure 8. The bounds coming from a null result up to the expected statistical uncertainty for different luminosities at different energies are shown on Figure 9, where we also show the expected sensitivity to (κ,κ) from the tth production cross-section measurements using the projected uncertainties of δκ/κ ∼ (0.04, 0.02, 0.009) at HL-LHC, HE-LHC [55] and FCC-hh [56], respectively. For direct comparison with [24] we also show the expected bounds from using a single ω 6 and have checked that using ω 14 does not change the single ω bounds significantly. A consistent improvement of sensitivity compared to using single ω i can be achieved by using the optimized combination of ω's, constraining the parameter space significantly in orthogonal directions compared to the cross-section measurements. Interestingly the significance improvement is consistent between partonic events and after including shower, detector effects, as well as the dominant background and realistic object reconstruction, even though the optimization was performed at parton level only. This robustness is a welcome benefit of the method, since the computationally costly optimization procedure does not appear to be sensitive to modeling of the hadronic final states and detector effects. It is also reassuring that the optimization does not significantly rely on specific phase-space regions particularly affected by the background. We expect the results to be also robust against higher order QCD corrections as in the case of pp → thj, where we have checked this explicitly.

Conclusions
In order to establish, directly and with minimal additional assumptions, the presence of a CP-odd component of the top quark Yukawa (κ), we have studied manifestly CP-odd observables in th and tth production at the LHC and its prospective upgrades.
For the thj final states we have relied on the possibility of reconstructing the t quark momentum and accessing the t polarization. We have identified a particular polarization direction which is perpendicular to the th plane, where the top polarization along this direction would undoubtedly point to the presence of the CP-odd couplingκ. We have presented a method for optimizing the phase space dependent weight and shown its sensitivity at the HL-and HE-LHC for the semileptonic top and h → bb mode. The handful of signal events offer discriminating power, sensitive to the sign ofκ, however the irreducible background due to tt+jets severely dilutes the sensitivity of the proposed observable.
On the other hand, tth production has a considerably larger cross section at LHC energies compared to thj, while suffering more moderately from irreducible backgrounds. Due to the complexity of the final state kinematics with multiple undetected particles we have in this case proposed variables ω i that only depend on the lab-frame accessible momenta and are manifestly P-and CP-odd. Furthermore, we have studied the prospect of their phase-space optimization, parameterizing the optimal weight functions with neural networks. In particular, we have studied a general CP-odd observable, parameterized directly by a CP anti-symmetric neural network, which results in better performance compared to any individual ω i . Finally, we have studied the first order approximation of this network as a linear combination of the CP-odd observables, producing a simpler and more robust observable, α · ω. One benefit of using α · ω optimized at parton level is that it retains close to optimal sensitivity to the CP-odd coupling even after detector simulation, event selection and reconstruction, allowing to probeκ directly at HL-LHC, HE-LHC and FCC-hh. We have found that, at the end of Run 3, the LHC will exclude κκ ∼ 1 with 2σ confidence, while FCC-hh will be sensitive to κκ ∼ 0.01 . Note that these observables represent highly complementary probes of the top Yukawa sector compared to pp → tth cross-section measurements. In particular in their optimized form they would allow to break the degeneracy in the (κ,κ) plane and significantly reduce the allowed parameter space even at modest sensitivities accessible at the (HL)LHC.

Funding:
The authors acknowledge the financial support from the Slovenian Research Agency (research core funding No. P1-0035). This article is based upon work from COST Action CA16201 PARTICLEFACE supported by COST (European Cooperation in Science and Technology).
Institutional Review Board Statement: Not applicable.