Low-Order Spherical Harmonic HRTF Restoration Using a Neural Network Approach

Spherical harmonic (SH) interpolation is a commonly used method to spatially up-sample sparse head related transfer function (HRTF) datasets to denser HRTF datasets. However, depending on the number of sparse HRTF measurements and SH order, this process can introduce distortions into high frequency representations of the HRTFs. This paper investigates whether it is possible to restore some of the distorted high frequency HRTF components using machine learning algorithms. A combination of convolutional auto-encoder (CAE) and denoising auto-encoder (DAE) models is proposed to restore the high frequency distortion in SH-interpolated HRTFs. Results were evaluated using both perceptual spectral difference (PSD) and localisation prediction models, both of which demonstrated significant improvement after the restoration process.


Introduction
Virtual Reality (VR) and Augmented Reality (AR) technologies are on the rise, through the advent of commercially available and affordable VR headsets, with applications in gaming, education, therapy, social media and digital culture amongst others. To provide a high fidelity immersive experience in virtual environments requires good quality spatial audio. To achieve this, the VR/AR technology must be able to deliver to the ears the same binaural cues as would be experienced in real life. These are Interaural Time Difference (ITD), Interaural Level Difference (ILD) and spectral cues introduced by the ear pinnae and torso. Head Related Transfer Functions (HRTFs) are sets of binaural filters that encapsulate these cues from different angles relative to the head in three dimensions. Sound sources can be spatialised by direct convolution with the a given HRTF pair representing the intended sound source direction. Alternatively a virtual loudspeaker framework can be employed, where methods such as Vector Base Amplitude Panning (VBAP) [1] or Ambisonics [2] are used to render sources between virtual loudspeaker points formed from the HRTFs [3]. Both methods typically require a high number of HRTF measurements to ensure good spatial resolution in the rendered audio [4].
However, HRTFs are highly personalised due to different head and ear shapes. Using mismatched HRTFs can affect timbre quality, localisation performance and externalisation. Currently, the main way to obtain personalised HRTFs is through physical measurements, where microphones are placed at the ear canal of a subject and the loudspeakers positioned at different angles relative to head to measure the transfer functions [5]. This measurement process is often tedious and requires substantial setup and calibration. Recent developments have been made in HRTF based selection based on anthropomorphic data extracted from photographs of the ear [6] or HRTF simulation using 3D head models [7]. However, simulation is computationally expensive and usually requires a lot of processing time [8,9].
To simplify the measurement process, different HRTF interpolation methods have been proposed to acquire dense HRTF sets from sparse HRTF measurements. The current state of the art method is Spherical Harmonic (SH) interpolation, which leverages the spatial continuity in spherical harmonics (SH) and uses it as a bridge to spatially up-sample a sparse HRTF measurement set to a denser one. Depending on the number of sparse HRTF measurement points and SH order, high frequency information can be lost in this process. Consequently, listeners may perceive timbre difference and weakened localisation performance in practical use.
Recently, developments in machine learning have shown great improvement in neural style transfer and data restoration especially in the image domain [10,11]. This paper investigates whether similar models can be used to restore the distorted high frequency data in SH interpolated HRTFs.
This paper is organised as follows: Section 2 will cover relevant background information on HRTF interpolation methods, particularly in SH interpolation. It will also discuss the motivation for using a machine learning approach along with identification of some candidate models. Section 3 will discuss the method used in this study, including the data pre-processing workflow, a baseline model and different techniques investigated on top of the baseline model. Section 4 evaluates the performance of the model based on perceptual spectral difference and localisation performance. Section 5 discusses the results and potential directions for the work. Section 6 concludes the paper.

Spherical Harmonic HRTF interpolation
HRTF interpolation can be done in many different ways, including using inverse-distance weighting and spherical splines in the time or frequency domains or manifold learning [12][13][14][15][16]. However, SH interpolation is one of the more elegant methods which has a more standardised procedure and shows promising results [16]. HRTF sets are commonly described in the SH domain due to its simplicity and ease of use in Ambisonics for spatial audio reproduction [17]. Since the SH domain describes a continuous spatial representation of the HRTFs, interpolation can be readily achieved. A given HRTF H(`, OE) can be converted to the spherical harmonic domain at a given order M by using a re-encoding matrix C with K rows and L columns, where K is the number of SH channels calculated as K = (M + 1) 2 and L is the number of HRTF measurements from different angles, where L ≥ K. The coefficients Y σ mn in the re-encoding matrix C with SH order m and degree n are calculated by where σ = ±1, P mn (sin φ) are the Legendre functions of order m and degree n, δ n,0 is the Kronecker delta function: This paper uses Schmidt semi-normalisation (SN3D) in the computation of Y σ mn . For normal SH HRTF use, a mode matching decoding matrix D can be calculated from C with the following pseudo-inverse equation: However, for HRTF interpolation, another re-encoding matrixĈ and decoding matrixD with the desired target angles is required, whereL is replaced as the number of HRTF measurements to be interpolated from the SH HRTFs whilst K remains the same. The interpolated HRTFsĤ(`, OE) can be calculated with the following equation:Ĥ =D(C(H)) (4) The main issue caused by SH HRTF interpolation is that the interpolated HRTFs are only accurate up to the spatial aliasing frequency f lim , approximated by f lim ≈ cM 4r(M + 1)sin(π/(2M + 2)) ≈ cM 2πr (5) where c is the speed of sound, approximated as 343 m/s at 20°C in air and r is the radius of the human head [18]. For 1 st order, the spatial aliasing frequency is around 700Hz. The spectral distortions will not only affect the timbre, but will also degrade the localisation performance since the important cues for identification of source elevation are changed, as shown in Figure 1.
To challenge the full potential of the use of machine learning, this paper chooses to use 1 st order SH interpolation as it requires the least number of HRTF measurements. An octahedron configuration with 6 measurements is selected, which has a more stable energy distribution than other arrays for 1 st order SH [19]. When using SH interpolation, the number of outputs can be flexible.
We also employ dual-band Time Alignment (TA) in the encoding of the HRTFs [20,21]. Given ITD is only effective at low frequencies, high frequency ITD can be removed when undertaking SH encoding. This is achieved by time aligning (TA) the HRTFs at high frequencies. By doing this, lower order SH HRTFs are more effective at preserving high frequency information, which improves interpolation results. In this study, the crossover frequency was set at 2.5kHz, as suggested in [21].

Machine learning HRTF Restoration
Research domains like speech recognition, natural language processing and computer vision have demonstrated that a more general data-driven method often beats traditional knowledge based signal processing methods in the long run, as the data can keep growing in the future [22]. In image processing, recent developments in machine learning show great potential in noise reduction in images, image inpainting, colorising old photos or videos and neural style transfer [10,11,[23][24][25][26][27][28][29]. These tasks can be considered to be quite similar to restoring distorted SH interpolated HRTFs. Candidate models include variants of fully connected Neural Networks (NN), Auto-encoder (AE) Convolution Auto-encoder (CAE) [28], Residual Network (ResNet) [30] and Generative Adversarial Networks (GANs) [31].
Most machine learning models require large amounts of labelled data to produce excellent results. However, currently the number of available HRTF databases is very limited. There are only a total number of 233 HRTF datasets freely available combined in (Spatially Oriented Format for Acoustics) SOFA format at the time when we were working on this project [32,33]. Compared to the data size used to train image processing machine learning models, which can be in the region of hundreds of thousands of images, HRTF data is far too few to generalise well or to train sophisticated models. However even with such limited data, SH interpolated HRTF restoration is potentially achievable using machine learning algorithms and is now investigated.
An overview of the proposed method is shown in Figure 2. A subset of HRTFs are selected from a database to represent a sparse HRTF measurement set. These HRTFs are then interpolated in a traditional SH HRTF interpolation manner. After the interpolation, each HRTF measurement will feed into the machine learning model for restoration. This paper chooses the output size for the interpolation process based on the number of HRTF measurements of the original dataset, which is typically over 2000 measurements depending on the dataset. The restored HRTFs output from the machine learning model can then be easily compared with the true HRTF measurements. In this section, we will first discuss the data preparation and format, then introduce a baseline model used in this study before improving it with different enhancements techniques including weight decay, dropout, early stopping etc.
The SH HRTF interpolation process takes place in the time domain, as Head Related Impulse Responses (HRIRs) which are converted to the frequency domain Head Related Transfer Functions (HRTFs) after the interpolation process for input to the restoration model. Note that due to the randomness of machine learning models, there is a chance that a model could incorrectly give negative amplitude spectra in the output. To avoid this, the input and output data are scaled to decibels. The output data is then rescaled before converting back to the time domain.  This paper uses 6 measurements with an octahedron configuration for the sparse dataset, which is one of the more challenging cases for SH HRTF interpolation. There are two sets of configurations used, one used in both training and testing, the another one used only in training for data augmentation purpose. The angles are shown in Tables 1 and 2.
Amongst all the popular HRTF datasets, only the SADIE I [37], SADIE II [38], IRCAM Listen [39] and Bernschutz KU100 [40] databases can provide the measurements from these angles. In this paper, Subjects 19 and 20 from the SADIE II database are held out for testing and evaluation of the model and are not used in the training. SADIE II Subject 20 were tracked during the training process. This design tries to show how well the model copes with unforeseen HRTF measurements. Furthermore,the Bernschutz KU100 HRTFs were also excluded from the training and validation sets.This allows us to study the effect of alternate HRTF measurement methods of expected near-match datasets to the existing KU100 measurements in the training data.
Since only the SADIE, IRCAM and Bernshutz datasets have the required measurements for training and validation, it is challenging to produce an accurate model with such a limited variation of HRTF data. Consequently data augmentation of other HRTF datasets was undertaken to provide some extra data for training and validation. The ARI [34], ITA [35] and RIEC [36] datasets were included with modified angles -Positions with an elevation angle at -45°were changed to -30°. This modification was also undertaken for the RIEC data set, as well as positions with an elevation angle at 45°changed to 50°. The effect of this data augmentation is demonstrated in Section 3.2.
Once all the training and validation HRTFs were concatenated, 50000 measurements were randomly selected for the training and validation set with an 80:20 ratio, considering practical training time and the limitations of available computer memory (see Section 3.2).
To improve the speed and stability in the training process, it is considered good practice to standardise the input data before feeding it into the machine learning model. The standardisation equation as follows: where x is the input data, µ and σ are the mean and standard deviation of the training and validation data. Note that the same µ and σ should be used for test data. To summarise, in total there are 230 subjects used in the training, from SADIE I (18 subjects), SADIE II (18 subjects not counting the 2 hold out data), IRCAM Listen (51 subjects), ARI (60 subjects), ITA (45 subjects) and RIEC (38 subjects). Subjects 19 and 20 from SADIE II and the KU100 measurement from Bernschutz are held out as test sets. Data was standardised before training.

Baseline Model
There are numerous ML models throughout the literature that have the potential for SH interpolated HRTF restoration. Here we aim to find a model that has a simple architecture whilst able to produce viable results. The reason to use a simple model is based on the consideration of the limited number of HRTF datasets -a simpler model is less likely to over-fit the training data. For comparison, The proposed model can be seen as a simplified version of an Inception Network [41]. Separate models for left and right channels are trained individually, whilst the input of the model takes both channels to provide additional information which improves the results as shown in Table 3. The model input size is 2 × 256 (left and right channels of the interpolated HRTFs with the length of 256 samples per channel) and the output is 256 (either left or right channel).
The model proposed here uses a combination of a Convolutional Auto-Encoder (CAE) and a Denoising Auto-Encoder (DAE) [42]. Preliminary test results showed that the DAE is better with the main contour of the frequency response ( Figure 3) and CAE is better with the finer details ( Figure 4). The combination of the two yields positive results. Similar results are also observed in research with image [43].
The results from the convolutional CAE and DAE are concatenated and passed through a fully connected layer for the voting process.
The complete model is shown in Figure 5. Note that batch normalisation is performed after each convolution layer and transposed convolution layer, with the exception of the very last transposed convolution layer so the output of the CAE should have similar magnitude with the DAE. The models are built and trained with PyTorch [44,45] using smooth L1 loss with Adam optimiser (learning rate: 0.000001, beta 1: 0.9, beta 2: 0.999).
Different loss functions were compared and the results are shown in Table 4. The mean square error (MSE) loss also known as the L2 loss performs worse in the test data but slightly better in the training and validation data. L1 loss, also known as mean absolute error (MAE) showed key improvements with the SADIE Subject 20 dataset and slight improvements with the Bernschutz KU100 test data. The reason L1 loss out-performs MSE loss might be because L1 loss is usually less sensitive to outliers, which is the case when there are HRTFs from different databases. Smooth L1 loss is a combination of L1 loss and MSE loss. For an error below 1, it performs as the MSE loss function; and for the rest it performs as L1 loss. Compared to L1 loss, this method has a continuous derivative at zero so it provides a smoother gradient when the error gets smaller than one. The result of smooth L1 loss further improves the SADIE 20 dataset but there is a trade off on the Bernschutz dataset. Considering the real world application, it is more practical to optimise for unforeseen HRTF measurements of different human subjects instead of different measurement methods with the same artificial head model. The same principle holds in the further optimisation techniques in the upcoming tests. Therefore, the models in this paper use smooth L1 loss as the loss function.  The baseline model has been trained for 500 epochs with batch size of 8. The reason for using a small batch size is because prior research shows that a larger batch size may produce worse performance [46,47]. Figures 6 and 7 show the mean squared error (MSE) change during the training. The blue and orange lines are the training and validation results respectively, The green and red lines are the test sets of Subject 20 from SADIE II database and Bernschutz KU100 measurement accordingly. The results show that the training and validation results are trending downwards while the two test sets flatten out after the first 20 epochs. This indicates that the models are over-fitting the training data. Over-fitting is normal in this case considering the limited number of HRTF datasets in the training data (230 in total).
The effects of data augmentation are shown in Table 5, where it can be seen that there are some drawbacks with the training and validation data, and a slight drawback with the SADIE Subject 20 test data. However, the extra data provides significant improvement with the Bernschutz interpolation. This demonstrates that the extra variety of measurements helps the model generalise better across different measurement methods.
The most ideal way reduce over-fitting is to train with more data. However, given the limited HRTF measurements available, different regularisation techniques can be utilised to improve the baseline result, which will be discussed this section.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 10 July 2020 doi:10.20944/preprints202007.0209.v1 On the other hand, it is interesting to find that the results with the Bernschutz KU100 data also suffers from over-fitting. As there are different KU100 HRTFs in the training data, and the KU100 measurements in the SADIE II database are very similar to the Bernschutz KU100 measurements, it is unexpected to see the model perform quite poorly when comparing the training and validation results. More oddly, in the later sessions, it shows that the Bernschutz KU100 measurements do not seem to benefit from any regularisation methods. One plausible hypothesis is that the current model only Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 10 July 2020 doi:10.20944/preprints202007.0209.v1 trained with 6 HRTF databases which represents 6 different measurement setups. The model requires a larger variety of inputs to be able to generalise across different measurement setups and methods.

Model with weight decay
Weight decay is also known as L 2 parameter regularisation or ridge regression. This is a common regularisation method for reducing over-fitting in training. The idea of weight decay is to penalise the large weights in order to simplify the model and reduce over-fitting. This paper uses a weight decay rate of 0.001 as an experiment to see the effect of this regularisation method. The result does not seem to have any positive impact on the SADIE II Subject 20 dataset and there is a main drawback with the Bernschutz KU100 data as the error increased from 67.166 to 71.888 (table 7).

Model with dropout
Dropout randomly "drops out" a percentage of nodes in the neural network during training. The idea is to avoiding co-adaption between nodes by never guaranteeing that any pair will both be used  Table 6. MSE between different models during the training process to avoid the model over-relying on a few nodes within a layer. It can also been seen it as randomly sampling from the exponential number of possible narrow sub-networks during training, then provide an average the performance of all these combinations in test time or application. Note that dropout layer can only apply on fully connected layers but not convolutional layers. The model uses a 20 percent dropout ratio on the second to forth fully connected layer. This method produces worse results with the test data compared to the baseline model, especially with the hold out SADIE II data, but performs slightly better with the validation data. However, such slight differences may be introduced by the randomness of machine learning training. According to the result, dropout seems to have more negative impact on regularisation compared to weight decay. In theory it is possible to increase the dropout ratio or use a different configuration to increase the regularisation effect. However, finding the optimal architecture and hyper-parameters is beyond the scope of this paper and could be part of the future work.

Combining weight decay and dropout
Combining weight decay and dropout shows the best result in the hold out SADIE II data despite there being a noticeable trade off with the Bernschutz dataset. This result indicates that by combining weight decay and dropout, the model can generalise better across different unforeseen HRTF subjects (SADIE hold out) but not the measurement method (Bernschutz). As this model performs the best with the SADIE II test data, this will be the proposed model to be further analysed in Section 4.
It is interesting to find that the combination of the two different methods shows large difference in results, but not with either method individually. It is not clear whether the improvement comes from the combination of the techniques or it is from the cumulative regularisation power. This could be an individual research topic to be investigated in the future.

Early stopping
Early stopping is one of the regularisation methods that sometimes is not considered as good practise in machine learning training because it breaks the principle of orthogonalisation and makes hyper-parameter tuning difficult [48]. Another reason this method is controversial is because the result can be hard to reproduce and compare across different models. However, according to the learning curve from the model combining weight decay and dropout in Figure 8 and 9, early stopping should perform slightly better with the test data, especially with the test set from SADIE II. In order to demonstrate the effect, this paper retrained the proposed model and stopped training at 111 epochs.  To shorten the training time to compare across different methods and considering the limited size of RAM, the models discussed above were trained with 50,000 randomly sampled HRTFs from different angels of the training and validation HRTF sets. However, the best way to reduce over-fitting Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 10 July 2020 doi:10.20944/preprints202007.0209.v1 is to increase the size of the training set. To investigate what the model capable of with more data, a baseline model was trained with 633,000 HRTF measurements. Training this amount of data can take a lot of time per epoch. To speed up the process the batch size for training increased to 32, whilst the validation and test sets remained the same at 8 for better comparison. Two models, the baseline model and the model with weight decay and dropout were trained with extra data. The baseline model trained with extra data showed major improvement with the Bernschutz dataset, alongside the training and validation sets. However, there is also a noticeable trade off with the SADIE II Subject 20 test data. It is believed that the improvement in the Bernschutz dataset is the result of the model having more examples of the KU100 HRTFs with different measurements at different angles, so it can generalise the measurement method better. The trade off in the SADIE II Subject 20 test data may be caused by the increased batch size, which can induce worse performance [46,47]. Nevertheless, as the extra data is within the same distribution, it is quite unlikely it could provide any noticeable performance improvement with unforeseen HRTF measurement subjects.

Bigger model
Considering the current results and the limited number of labelled data, training a bigger model is against normal machine learning practices.
To demonstrate the potential capability of the proposed method and insight for future research, a slightly deeper model has also been trained with extra data. The goal here is to minimise the training and validation error as much as possible, neglecting the trade off in test datasets' results. To balance out the model size and training time, only the convolution neural network is changed. An extra convolution layer and transposed convolution layer pair was added in the convolution model.
As this model only focuses on the test and validation results, dropout and weight decay regularisation methods are lifted, which defeat the purpose of using a bigger neural network. The model was trained with 633,000 HRTFs with a batch size of 32 for training similar to section 3.3.5 as bigger models usually work better with more data.
Compared to the baseline model, the training time of each epoch from the bigger model increased from 6 minutes to 26 minutes. The model was trained with 500 epochs and the results are shown in Table 6. As expected, the model provides huge improvement in training and validation, but not much improvement in the SADIE II test data. On the other hand, it is interesting to note that it provides the best performance for the Bernschutz data. With extra data, the results seem to align with the training and validation data. The hypothesis is that more data with a bigger model is the key to generalise different measurement methods. However, more experiments with a wider variety of HRTF sets are needed to draw a conclusive result.

Evaluation
In this section, the results of the proposed model, including weight decay and dropout will be further analysed for perceptual difference and localisation performance. We utilise perceptual models based on these two criteria in order to provide more robust results for bench-marking.

Perceptual Spectral Difference
To formally estimate the perceptual performance, the results were further analysed with a Perceptual Spectral Difference (PSD) model [5]. This model calculates the difference between two binaural signals or HRTFs, and presents a more accurate perceptual comparison of spectral differences. This paper compares the difference between before and after the restoration process with the actual HRTF measurements.
The comparison between mean PSD before and after reconstruction with different HRTF datasets is shown in Table 7 and figure 11 with the minimum and maximum PSD plotted. The results show that the model provides significant improvement in PSD across all datasets. However, the minimum and maximum in Figure 11 shows that the model seems to introduce higher PSD error in some cases. As for most applications that use HRTFs, the smoothness across all angles is more crucial than the average performance. Further analysis with the box plot in Figure 12 shows that although the model may introduced more extreme outliers with unforeseen HRTF measurement subjects (SADIE Subject 19 and 20), the model still improves the majority of the HRTFs and reduces the interquartile range (IQR) in the result. According to figure 11 and 12, Subject 19 from the SADIE II database has a worse maximum PSD and many outliers. To further investigate the cause of the result, Figure 13 shows the PSD before and after comparison of different angles in Subject 19 from the SADIE II database. The left is the SH interpolated HRTFs before restoration and the right is the one after processed with the ML model. The figure shows that most of the high PSD results were introduced in the lower frontal region. It is unclear what caused the increased error, but one hypothesis is that the abnormality was caused by the shadow effect from the knees, as the SADIE II along with most of the other databases are measured with subjects sitting on a chair.
However, figure 14 shows the PSD before and after comparison with the holdout Subject 20 from SADIE II database. Besides a small area of minor PSD increment in the very low frontal region, the restored result shows there is no significant abnormality in any region. To have a deeper understanding of the cause of the abnormality in Subject 19 holdout data, extra tests with more HRTF sets are required. Figure 15 shows the PSD before and after comparison with the Bernschutz KU100 data. The model seems to perform better with unforeseen measurement method, as it shows improvement in the PSD at all angles. It is worth noticing that the lower frontal region in the figures does not have any oddly high PSD results, perhaps because KU100 is a head only dummy head model.

Localisation performance
The localisation performance was analysed with the May's model and Baumgatner's model in the Auditory Modelling Toolbox (AMT) [49][50][51]. The May's model is for frontal azimuth localisation on the horizontal plane and the Baumgatner's model is for the frontal sagittal plane. Table 7 showed the localisation results of the proposed model. The frontal azimuth localisation mean error from the May's model show improvement in all HRTF datasets. However, for frontal sagittal plane with the Baumgatner's model, results in RMS error and quadrant errors indicate all the reconstructed HRTFs seems to perform worse in the frontal sagittal plane localisation.
The current model suffers in localisation tasks may because it used smooth L1 loss as the loss function. The smooth L1 loss only focus on the magnitude difference at each frequency point. There may not be any meaningful connection between those magnitude difference and localisation performance. Therefore, the model failed to optimised the HRTFs for the localisation performance. Future model could add some types of localisation error into the loss function to improve the localisation results.

Discussion and future work
The goal of this paper is to prove the hypothesis that machine learning can be used to restored distorted interpolated HRTFs. To draw a convincing argument, this paper picked one of the more challenging situations based on 1 st order SH and 6 measurements. Models with higher order SH and more measurements should have better performance than the current one as less data needs to be restored.  The results show that a simple ML model can be used to restore distorted SH interpolated HRTFs, although the current state of this model is far from optimised for application. It is believed that there will be significant improvement if more HRTF measurements are available for training in the future. Under the current situation, one way to improve the model is through hyper-parameter tuning, including the parameters for regularisation.
An alternative method that may be possible to reduce over-fitting is to use data augmentation. HRTF measurements are expensive and tedious, therefore it is not very likely there will be a huge increase in HRTF measurement data in the near future. To augment the current dataset, one possible way is to use more different sparse HRTF configurations or different SH order to train the model. Table  5, shows that even if the extra data is not perfect, it is possible that it can still improve the model performance in some cases.
Similar to data augmentation, noise injection is different regularisation method that has be shown to work better than weight decay in some cases [52][53][54]. By picking the right parameters, it is believed that it could generalise better across measurement methods as the model could focus on the general information across various HRTF measurements as opposed to the artefacts introduced by different measurement methods.
Another problem that was observed from the current model is the localisation performance. Although it shows some improvement in the horizontal plane, the sagittal error needs improvement.   Discussed in Section 4, it is believed that the smooth L1 loss function only compares the difference in each frequency and fails to capture other useful metrics of HRTFs. Potentially, some custom loss functions can be implemented to improve the model. Gatys et al. [55] trained a separate machine learning model for content loss and implemented a special function for style loss. Similar methods like training a localisation model as localisation loss function may be able to solve the problem. Furthermore, with the recent success in generative adversarial networks (GAN), it should be possible to build a GAN based on localisation performance [29,31,[56][57][58]. However, as a GAN can be unstable to train and it usually requires a lot of tuning, it may not be the most effective way for SH interpolated HRTF restoration. This paper tries to show a general insight of using machine learning for HRTFs reconstruction. According to table 6, table 7 and the discussion in session 4, to apply the idea in real-life application, optimising the model for some narrative tasks should yield better performance. With a more specific application in mind, not just the parameters of the model can be changed, the model can also be trained Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 10 July 2020 doi:10.20944/preprints202007.0209.v1 with cleaner or more particular data specialised for the task. Alternatively, using transfer learning based on the current model can provide a head start for these applications.

Conclusion
HRTF interpolation in the SH domain often suffers from distortion in the high frequencies. With the recent development in machine learning algorithms, this paper has shown that it is possible to restore the distorted SH interpolated HRTFs with a ML model. Although the proposed method suffers from over-fitting, it still shows improvements in perceptual difference and localisation performance. It is believed that with more training data in the future, the model performance can be vastly improved. However, HRTF measurements can be difficult and time consuming to obtain. With the current size of available data, optimising the model with a narrative use case with some hyper-parameters tuning and data augmenting should have the best chance to put the model in real world applications.