Next Article in Journal
Perspective-Taking: In Search of a Theory
Previous Article in Journal
No Advantage for Separating Overt and Covert Attention in Visual Search
 
 
Article
Peer-Review Record

Effects of Spatial Frequency Filtering Choices on the Perception of Filtered Images

by Sabrina Perfetto 1, John Wilder 2 and Dirk B. Walther 2,3,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Submission received: 29 February 2020 / Revised: 13 May 2020 / Accepted: 22 May 2020 / Published: 26 May 2020

Round 1

Reviewer 1 Report

General Synopsis:  The current study explores the impact that various image filtering procedures (including contrast normalization and pixel sign) have on scene categorization performance (accuracy).  Specifically, the authors explore contrast normalization of filtered images, the type of filter used, and how the images are displayed (i.e., varying the sign of the pixels).  Their results show, as expected, that recognition accuracy can vary with different procedural choices in all of the above.

 

General Comments:  The issues addressed in this study are indeed crucial to any study that seeks to manipulate visual stimuli in a way that targets the various tuning profiles of neural populations within striate cortex and beyond.  Overall, the manuscript is well written and does a good job communicating digital signal processing concepts with as little jargon as possible.  That said, I wonder if the target audience of this journal is appropriate for the points that the authors aim to support.  The issues raised here have been known (and dealt with) in visual psychophysics dating as far back as the early 1960’s.  Nothing here will come as a surprise to anyone in the psychophysics community.  It seems that this work would be better suited for a more cognitive journal (research related or methods related)where the audience is less well versed in signal processing techniques to render stimuli that target different visual neurons.  Regardless, there are a number of crucial psychophysical and signal processing details that are in need of clarification (some potentially damaging to the validity of this work) that need to be addressed.  I have listed my specific concerns below in no particular order.

 

Specific Comments:

 

1)  I recognize that the authors are testing aspects of filter design that seem to be used frequently in the scene perception literature, but if we’re really going to talk about appropriate filter design, then none of the filters tested here are appropriate.  For example, those who insist on using lowpass and highpass filtering typically do not take into account that visual neurons exhibit bandpass tuning profiles that pass ~2.5 octaves worth of spatial frequency content at the lower frequencies to ~1 octave or less at the higher frequencies.  Further, the differential distribution of tuning profile peaks is disproportionately lumped over the higher SFs (e.g., more higher SF tuned neurons are needed to tile the visual field).  That said, if one is truly interested in testing the relative high-level contribution of different populations of neurons tuned to different SFs, then the filters used to process the images must be bandpass, and, have a tuning bandwidth that matches the underlying populations of interest.  The lowpass filters presented here actually cut the low SF tuned populations short (their bandpass functions are not completely filled with SF content), and, the highpass filters actually pass several octaves of spatial frequency content.  Therefore, from the start, there is an imbalance in in terms of how much ‘information’ is passed to each neural population, thereby confounding any comparative analysis of performance related to lowpass or highpass filtered images.  So, it would seem that a more fundamental question of appropriate filter design is missed in this study, which significantly limits the overall impact of this study.

 

2)  Related to the comment above, the highpass filters used in this study are not only heavily biased to high SFs (compared to the tuning of their lowpass filters), but they are also biased in the amount of spatial frequencies at the obliques that are passed (consider that the frequencies along the oblique radial axes in the Fourier domain increase with a factor of sqrt(2)).  The authors really should have used a circular window in the Fourier domain to balance that, which is standard practice.

 

3)  Another important filtering issue that the authors do not address is reporting the range of RMS contrast of the images prior to (or after) filtering.  This is an important issue as images that are already high in RMS contrast (and may already suffer from pixel value clipping) will result in images that have a high degree of pixel clipping after filtering.  Clipping is a problem because it introduces a nonlinearity into the images that may impact performance (depending on the severity of the clipping).  The authors do mention that their filtered images did suffer from pixel value clipping, but again, we are not presented with any information about the proportion of clipping that was observed in any of the stimuli in any of their experiments.  Given the amount of contrast that is dumped into either the low or high SF images (when normalized), I suspect that pixel clipping was substantial.  And, their method for dealing with clipping possibly exacerbates the nonlinearity. 

 

4)  Another crucial detail that is left out of all experiments is whether or not the experimental display was gamma corrected.  This is not a concern when presenting actual images as stimuli because the camera that sampled the images sets a nonlinear gamma so that the output will be linear on any given display (any electronic display of visual information is actually nonlinear – typically using a gamma of 2.2 or 2.5).  However, when an image is filtered (or otherwise generated through any number of image synthesis algorithms), the linear luminance values will be presented nonlinearly on the display.  This nonlinearity heavily favors the darker pixel values and therefore biases how that information is encoded in the visual system and alters how different spatial frequencies contribute to the activation of higher visual areas.

 

5)  While the choice of filter is important, equally important are the steps leading up to the filtering (either in the spatial domain or Fourier domain).  No information is provided about how the images were prepared prior to the discrete Fourier transform.  Natural images are non-periodic, meaning that there will be considerable ‘edge effects’ in the Fourier domain if not windowed (or at least padded).  However, because the authors intend to use filtered images as stimuli, then the best approach to minimize such edge effects is to symmetrize the images prior to the transform (and adjust the filtering parameters to account for the larger image size).  Such an approach allows one to recover the full image and not incur filter responses to the typical smooth windowing that is applied to images before submitting them to Fourier analysis.  These concepts are covered in many introductory signal processing textbooks, so I encourage the authors to spend some time with those if they want to report a study that sets important guidelines (and toolboxes) for the scene perception community.

 

6)  While I understand the desire to mask the stimuli, using broadband masks likely introduces confounds with respect to how the mask interferes with the processing of the available SFs in the filtered images.  Without knowing the ratio of Fourier power (normalized or not) between the filtered images and the corresponding SF band in the masks leaves the reader wondering about how best to interpret the observed difference in the different manipulations that were explored in this study.  Not only that, but there exist a number of psychophysical studies that show how far reaching contrast gain control (i.e., the extent of the normalization pool) can be when the masks (either backward or simultaneous) are broadband.  The masks used in this study really should have consisted of filtered noise with the same amount of Fourier power as the filtered stimuli.

 

Minor Concerns

 

1)  The stimuli should have been zero-meaned prior to filtering – that would have alleviated the convoluted approach to dealing with the presence of the DC in the low SF filtered images.

 

2)  The authors mention that all participants had normal or corrected to normal vision.  How was that measured?

 

3)  What were the pixel dimensions of the stimuli?  Such information is important to report given the coarse resolution of the experimental display (800x600).

 

4)  Given the differential effects of filter type reported in Experiment 2, what was the rationale for choosing the Butterworth filter in Experiment 1?

 

5)  I don’t understand why the authors chose to include an ideal filter (i.e., heaviside filter) in Experiment 2.  Such filters are well known to produce heavy ringing, and nobody uses them.  Including it here doesn’t make much sense for that reason.

Author Response

Dear Reviewer 1

Thank you for your thoughtful review. Please find our point-by-point response below in blue.

Sincerely,

Dirk B. Walther, on behalf of all authors

 

General Comments:  The issues addressed in this study are indeed crucial to any study that seeks to manipulate visual stimuli in a way that targets the various tuning profiles of neural populations within striate cortex and beyond.  Overall, the manuscript is well written and does a good job communicating digital signal processing concepts with as little jargon as possible.  That said, I wonder if the target audience of this journal is appropriate for the points that the authors aim to support.  The issues raised here have been known (and dealt with) in visual psychophysics dating as far back as the early 1960’s.  Nothing here will come as a surprise to anyone in the psychophysics community.  It seems that this work would be better suited for a more cognitive journal (research related or methods related) where the audience is less well versed in signal processing techniques to render stimuli that target different visual neurons. 

We appreciate that the reviewer finds our results relevant and well presented. We agree that these issues are not new to the low-level psychophysics community. The choices presented here are, however, not always carefully considered in the realm of natural scene perception. Since Vision draws on a broad community from all ranges of vision science, we would think that this journal is the right venue for this work. Ultimately, we leave this decision to the editor.

Regardless, there are a number of crucial psychophysical and signal processing details that are in need of clarification (some potentially damaging to the validity of this work) that need to be addressed.  I have listed my specific concerns below in no particular order.

 

Specific Comments:

 1)  I recognize that the authors are testing aspects of filter design that seem to be used frequently in the scene perception literature, but if we’re really going to talk about appropriate filter design, then none of the filters tested here are appropriate.  For example, those who insist on using lowpass and highpass filtering typically do not take into account that visual neurons exhibit bandpass tuning profiles that pass ~2.5 octaves worth of spatial frequency content at the lower frequencies to ~1 octave or less at the higher frequencies.  Further, the differential distribution of tuning profile peaks is disproportionately lumped over the higher SFs (e.g., more higher SF tuned neurons are needed to tile the visual field).  That said, if one is truly interested in testing the relative high-level contribution of different populations of neurons tuned to different SFs, then the filters used to process the images must be bandpass, and, have a tuning bandwidth that matches the underlying populations of interest.  The lowpass filters presented here actually cut the low SF tuned populations short (their bandpass functions are not completely filled with SF content), and, the highpass filters actually pass several octaves of spatial frequency content.  Therefore, from the start, there is an imbalance in in terms of how much ‘information’ is passed to each neural population, thereby confounding any comparative analysis of performance related to lowpass or highpass filtered images.  So, it would seem that a more fundamental question of appropriate filter design is missed in this study, which significantly limits the overall impact of this study.

We appreciate the reviewer's views, and we agree that a carefully controlled band-pass filter would be the correct choice when considering the SF tuning of specific neural populations, for instance, in striate cortex, as, for instance, in Blakemore and Campbell (1969) and many other studies afterward. However, this is not the objective here. In cognitive psychology and cognitive neuroscience, distinctions in visual processing are frequently made according to the broad discrimination into low and high spatial frequencies, and the filters used are invariable high- or lowpass filters, not bandpass. It would be an interesting discussion to argue the merits of these approaches versus bandpass for specific experimental goals. However, this discussion is outside the scope of our current manuscript. Our goal here is to enlighten the consequences of particular popular filtering choices on cognitive performance.

2)  Related to the comment above, the highpass filters used in this study are not only heavily biased to high SFs (compared to the tuning of their lowpass filters), but they are also biased in the amount of spatial frequencies at the obliques that are passed (consider that the frequencies along the oblique radial axes in the Fourier domain increase with a factor of sqrt(2)).  The authors really should have used a circular window in the Fourier domain to balance that, which is standard practice.

We are in fact using circular windows. Please see equations 2 and 7, where we use the term (f_x^2 + f_y^2) in place of f^2 so that all filtering is completely radially symmetric. Therefore, oblique orientations are treated exactly the same as cardinal orientations. We have added language making this point clearer (lines 124-125):

Note that the radial symmetry of the filter implies that cardinal and oblique orientations are filtered in the same way.

 3)  Another important filtering issue that the authors do not address is reporting the range of RMS contrast of the images prior to (or after) filtering.  This is an important issue as images that are already high in RMS contrast (and may already suffer from pixel value clipping) will result in images that have a high degree of pixel clipping after filtering.  Clipping is a problem because it introduces a nonlinearity into the images that may impact performance (depending on the severity of the clipping).  The authors do mention that their filtered images did suffer from pixel value clipping, but again, we are not presented with any information about the proportion of clipping that was observed in any of the stimuli in any of their experiments.  Given the amount of contrast that is dumped into either the low or high SF images (when normalized), I suspect that pixel clipping was substantial.  And, their method for dealing with clipping possibly exacerbates the nonlinearity.

Thank you for pointing out these issues. RMS was determined separately for each set of images that were generated from the same original. As the reviewer indicated, clipping is a non-trivial operation that is necessary due to the limited range of contrast values that can be displayed. If clipping was avoided entirely, the images would end up with very low contrast. We clipped pixels with values more than two standard deviations away from the mean. As expected, and verified empirically, this leads to clipping of about 4% of pixels on average. We have added more details on this operation (lines 137-140):

To limit values to a reasonable dynamic range for displaying, we clipped extreme values more than two standard deviation away from the mean to -2 and 2, respectively (on average 4% of pixels). Since clipping slightly reduces contrast, we divided by the standard deviation once more to guarantee equal RMS contrast across all versions of the image.

4)  Another crucial detail that is left out of all experiments is whether or not the experimental display was gamma corrected.  This is not a concern when presenting actual images as stimuli because the camera that sampled the images sets a nonlinear gamma so that the output will be linear on any given display (any electronic display of visual information is actually nonlinear – typically using a gamma of 2.2 or 2.5).  However, when an image is filtered (or otherwise generated through any number of image synthesis algorithms), the linear luminance values will be presented nonlinearly on the display.  This nonlinearity heavily favors the darker pixel values and therefore biases how that information is encoded in the visual system and alters how different spatial frequencies contribute to the activation of higher visual areas.

Thank you for pointing out this issue. It is important to distinguish between perceptual gamma (the relationship between physical and perceived luminance) and hardware gamma (the relationship between the voltage of the cathode ray tubes and the physical luminance of the CRT screen). Perceptual gamma is determined by the processing pipeline of the cameras that were used to record the photographs. Since we downloaded the images from a variety of sources on the internet, we have no control or knowledge of the gamma corrections performed by the cameras. The hardware gamma of our CRT screens were set by the manufacturer, and we did not change the defaults. The reviewer correctly points out that perceptual gamma needs to be taken into account for synthesized stimuli. However, since Fourier transform, multiplicative filtering in Fourier space, and linear contrast adjustments are all linear operations, the gamma correction encoded in the photographs is preserved throughout our processing pipeline. We therefore believe that additional gamma correction of the filtered images is not only unnecessary but would be wrong, because then the nonlinear relationship between physical luminance and perceptual brightness would be corrected for twice, once in the camera and once more in our filtering operations. As a consequence, we abstained from any additional corrections. Since this set of issues is fairly complex and not the main point of our study, we prefer to not discuss it at this level of detail in the manuscript. We have added clarifying language to the methods section (lines 105-107):

Since perceptual gamma correction is typically part of the camera processing pipeline, we did not perform any additional gamma adjustments.

5)  While the choice of filter is important, equally important are the steps leading up to the filtering (either in the spatial domain or Fourier domain).  No information is provided about how the images were prepared prior to the discrete Fourier transform.  Natural images are non-periodic, meaning that there will be considerable ‘edge effects’ in the Fourier domain if not windowed (or at least padded).  However, because the authors intend to use filtered images as stimuli, then the best approach to minimize such edge effects is to symmetrize the images prior to the transform (and adjust the filtering parameters to account for the larger image size).  Such an approach allows one to recover the full image and not incur filter responses to the typical smooth windowing that is applied to images before submitting them to Fourier analysis.  These concepts are covered in many introductory signal processing textbooks, so I encourage the authors to spend some time with those if they want to report a study that sets important guidelines (and toolboxes) for the scene perception community.

Thank you for pointing out this issue with the Discrete Fourier Transform. We are aware of the artifacts  that arise from the periodicity assumption in DFT. However, each of the potential remedies (mirroring, zero-padding, windowing) introduces other types of artifacts (overestimation of image symmetry, bleeding away contrast energy over the zero-padded edge, spurious Fourier representations of the windowing function). We believe that such pre-treatment of the images would be important if we were interested in a quantitative analysis of the Fourier spectra of the images. But since we are interested in filtering SFs and then inverse transforming back into image space, we believe that such treatments would be detrimental. We have added language to discuss this particular choice to the methods section in lines 114-116:

As we used the Fourier transform for frequency filtering rather than analysis of the spatial frequency spectra themselves, we did not include any measures to address artifacts arising from the periodicity assumption inherent in the Fast Fourier Transform implementation.

6)  While I understand the desire to mask the stimuli, using broadband masks likely introduces confounds with respect to how the mask interferes with the processing of the available SFs in the filtered images.  Without knowing the ratio of Fourier power (normalized or not) between the filtered images and the corresponding SF band in the masks leaves the reader wondering about how best to interpret the observed difference in the different manipulations that were explored in this study.  Not only that, but there exist a number of psychophysical studies that show how far reaching contrast gain control (i.e., the extent of the normalization pool) can be when the masks (either backward or simultaneous) are broadband.  The masks used in this study really should have consisted of filtered noise with the same amount of Fourier power as the filtered stimuli.

We agree with the reviewer's concern from a signal processing point of view. However, from an experimental psychology perspective it is imperative that the masks for all conditions must be the same. Otherwise, differences in performance cannot be uniquely attributed to the properties of the stimuli. Please note that we make no particular claims about the effects of the masks on the spectral properties of the images. The role of the masks here was merely to make the task difficult enough and to erase the retinal image following stimulus presentation. Hence, we opted for a broadband mask. We have added language discussing this aspect (lines 170-173):

Perceptual masks were constructed from noise textures with a broad spatial frequency spectrum. Masks were used to control the duration of the retinal image by flooding the visual system with a high-contrast, broadband signal. Masks were the same for all experimental conditions.

Minor Concerns

1)  The stimuli should have been zero-meaned prior to filtering – that would have alleviated the convoluted approach to dealing with the presence of the DC in the low SF filtered images.

As the reviewer correctly points out, it would have been cleaner to remove the mean from the images prior to filtering and adding it back in afterward. In practice this did not make a difference for the generation of our stimuli. All stimuli, except for the unnormalized condition in Experiment 1, were jointly normalized for equal luminance (mean) and RMS contrast (standard deviation), thereby removing the mean from all versions of the images. For the unnormalized condition in Experiment 1, the mean of the original image was passed to the LSF image by the low-pass filter. We explicitly added the mean back to the HSF image. This was incorrectly stated as adding medium gray in the previous version of the manuscript and is now corrected (lines 128-130):

For the “unnormalized” conditions we applied no further normalization to the LSF images. The mean (DC component) of the original images was passed to the LSF image by the low-pass filter. We explicitly added the mean of the original, full-spectrum image to the HSF image.

2)  The authors mention that all participants had normal or corrected to normal vision.  How was that measured?

Normal vision was assessed by self-report. This is now clarified in the methods sections, e.g. on lines 100-101:

All participants reported normal or corrected-to-normal vision and provided written informed consent.

3)  What were the pixel dimensions of the stimuli?  Such information is important to report given the coarse resolution of the experimental display (800x600).

The resolution of the images was also 800x600 pixels. They were presented on the full screen. We chose such a relatively low spatial resolution because it allowed us to drive the monitors at a relatively high temporal frequency of 150 Hz, therefore enabling us to control the duration of stimuli and masks more finely. This is now clarified on lines 104-105:

We used a set of 432 color photographs (800 x 600 pixels) of scenes of six categories: beaches, forests, mountains, highways, city streets, and offices, downloaded from the internet.

4)  Given the differential effects of filter type reported in Experiment 2, what was the rationale for choosing the Butterworth filter in Experiment 1?

Based on prior literature by our and other labs, a second-order Butterworth filter is a sensible and popular choice. We added the alternative filter choices in Experiment 2 later when we discovered that other filter shapes are also used frequently. We have added a sentence to clarify this choice (lines 117-119):

We multiplied the amplitude spectrum with a radially symmetric second-order Butterworth filter, which has been a reasonable and popular choice in other studies [2, 13]. For a comparison of other filter choices see Experiment 2.

5)  I don’t understand why the authors chose to include an ideal filter (i.e., heaviside filter) in Experiment 2.  Such filters are well known to produce heavy ringing, and nobody uses them.  Including it here doesn’t make much sense for that reason.

We agree! Shockingly, it is used in psychophysics and neuroscience experiments (e.g., Vuilleumier  et al., Nature Neuroscience, 2003). Hence we decided to include it here.

 

 

Reviewer 2 Report

Perfetto et al examine how the specific choices of spatial frequency filter and stimulus scaling impact behavioral performance in a categorization task. A large number (21 or 22 across studies) of naive observers completed a 6AFC categorization task for spatial frequency filtered natural scenes. The results confirmed much publicised performance differences between high-pass filtered and low pass filtered images. The results of Experiment 1 showed that a deficit for high pass filtered images was corrected by post-filtering contrast normalization. Experiment 2 showed that the performance deficit is for high pass filtered images the filter cut-off is sharp, but for low-pass filtered images when the cut off is gradual. Experiment 3 showed that the effect can be further moderated by converting high pass filtered images into edge maps. Collectively, these results show that when using spatial frequency filtered images to study visual processing, stimulus specification can drastically affect behavioral data.

This paper provides an important caution to researchers who use spatial frequency filtered natural images and may account for differences among published research findings. The experiments appear to have been conducted carefully and the manuscript is clearly written, and will therefore be of interest to a broad range of researchers across visual cognitive neurosciences. However, I think the authors should present the results as evidence that some of these parameters affect performance, and avoid making general, preachy recommendations. There may be good reasons not to normalize contrast after filtering, to select a particular filter bandwidth and there are fundamental problems with the dark line images.    

My main concern is with the “HSF dark lines” stimuli used in Experiment 3. These features are equivalent to the zero-bounded regions of Watt and Morgan’s MIRAGE model of edge detection, although they define the edge between these regions and use the distance between them to estimate blur. Firstly, the authors should specify the threshold dark line classification -  this is vaguely specified as as large HSF values on Line 404 “ we displayed pixels with large HSF values as pixels, irrespective of their sign”. More importantly, the conversion of high-pass filtered images to dark line images re-introduces unspecified low spatial frequency content in the image. Since a major goal of the authors is to urge caution and care with the spatial frequency content of filtered images, this seems inconsistent with that objective. I strongly disagree with their advice on Line 429 “Lastly, we recommend displaying large HSF value of either sign as dark pixels”, the spectra and distribution of luminance and contrast in such images are significantly different from the natural images from which they were created.

I also resist some of their preaching concerning best practice. For example on Line 421, they argue that “contrast normalization is essential”. This of course, depends whether the goal is to deliver the same energy to the visual system - for example the energy at high spatial frequencies in the un-normalized image is the same as it was in the unfiltered image. 

Similarly, on Line 425 they arbitrarily conclude that a second order Butterworth filter is a good compromise between cutoff specification and ringing artefacts. Obviously these choices depend on the objective of the study. 

What was the mean luminance of the Dell 800*600 150Hz display?

Author Response

Dear Reviewer 2

Thank you for your thoughtful review. Please find our point-by-point response below in blue.

Sincerely,

Dirk B. Walther, on behalf of all authors

 

Comments and Suggestions for Authors

Perfetto et al examine how the specific choices of spatial frequency filter and stimulus scaling impact behavioral performance in a categorization task. A large number (21 or 22 across studies) of naive observers completed a 6AFC categorization task for spatial frequency filtered natural scenes. The results confirmed much publicised performance differences between high-pass filtered and low pass filtered images. The results of Experiment 1 showed that a deficit for high pass filtered images was corrected by post-filtering contrast normalization. Experiment 2 showed that the performance deficit is for high pass filtered images the filter cut-off is sharp, but for low-pass filtered images when the cut off is gradual. Experiment 3 showed that the effect can be further moderated by converting high pass filtered images into edge maps. Collectively, these results show that when using spatial frequency filtered images to study visual processing, stimulus specification can drastically affect behavioral data.

This paper provides an important caution to researchers who use spatial frequency filtered natural images and may account for differences among published research findings. The experiments appear to have been conducted carefully and the manuscript is clearly written, and will therefore be of interest to a broad range of researchers across visual cognitive neurosciences. However, I think the authors should present the results as evidence that some of these parameters affect performance, and avoid making general, preachy recommendations. There may be good reasons not to normalize contrast after filtering, to select a particular filter bandwidth and there are fundamental problems with the dark line images.

Thank you for pointing out that our recommendations may not be universally appreciated. We have toned down language to this effect throughout the manuscript.

My main concern is with the “HSF dark lines” stimuli used in Experiment 3. These features are equivalent to the zero-bounded regions of Watt and Morgan’s MIRAGE model of edge detection, although they define the edge between these regions and use the distance between them to estimate blur.

Thank you for pointing out this similarity. The MIRAGE model was only defined for 1D luminance patterns. But the reviewer is correct, there are conceptual similarities. We have added a reference to the MIRAGE to the manuscript (lines 333-335):

This operation is similar in spirit to the MIRAGE model of edge representations [28], but for fully two-dimensional complex real-world images instead of one-dimensional luminance patterns.

Firstly, the authors should specify the threshold dark line classification -  this is vaguely specified as large HSF values on Line 404 “ we displayed pixels with large HSF values as pixels, irrespective of their sign”.

There is no threshold or classification in our procedure. We apologize for the confusing formulation. We simply map values of the HSF image prior to contrast normalization to gray values equivalent to their negative absolute value: small numbers (close to zero) to light shades and large numbers (irrespective of sign) to dark shades. The only non-linearity here is taking the negative absolute value of the filtered results prior to linear mapping to gray scale values. We have clarified the description of the method (lines 346-350):

In addition, we generated a version of the HSF, in which both negative and positive values were mapped onto dark shades according to their absolute numerical values, whereas values close to zero were mapped into light shades. This was achieved by computing the negative absolute value of the filtered image immediately following the inverse Fourier transform (Equation 3):

 

More importantly, the conversion of high-pass filtered images to dark line images re-introduces unspecified low spatial frequency content in the image. Since a major goal of the authors is to urge caution and care with the spatial frequency content of filtered images, this seems inconsistent with that objective.

We primarily interpret the "dark lines" operation as a choice in how to map values of the filtered image to pixel values on the screen. If the absolute value is interpreted as part of the signal processing pipeline, then the reviewer is correct, additional LSF energy is introduced. However, the image still strictly only contains information related from the HSF spectrum of the original image. We now discuss this issue on lines 448-453:

This last recommendation may be somewhat controversial. By computing the negative absolute value of HSF-filtered pixels, it introduces a non-linear operation and thereby potentially new low spatial frequency components. We view those operations as an alternate way of displaying images whose original LSF content has already been removed by the high-pass filter. Therefore, the remaining image content is related only to the HSF content of the original image.

 

I strongly disagree with their advice on Line 429 “Lastly, we recommend displaying large HSF value of either sign as dark pixels”, the spectra and distribution of luminance and contrast in such images are significantly different from the natural images from which they were created.

We have softened the tone of this recommendation and added the aforementioned caveat.

 

I also resist some of their preaching concerning best practice. For example on Line 421, they argue that “contrast normalization is essential”. This of course, depends whether the goal is to deliver the same energy to the visual system - for example the energy at high spatial frequencies in the un-normalized image is the same as it was in the unfiltered image.

Similarly, on Line 425 they arbitrarily conclude that a second order Butterworth filter is a good compromise between cutoff specification and ringing artefacts. Obviously these choices depend on the objective of the study.

That is correct. We have qualified and softened these recommendations. We still argue that contrast normalization is important when separating the selective processing of spatial frequencies by the visual system from the overwhelming effect of contrast. Otherwise results are strongly determined by the distribution of contrast over spatial frequencies in natural images and by the strong effect of contrast on neural activity. See lines 437-442:

Therefore, contrast normalization is essential for isolating the role of spatial frequencies from that of image contrast. Controlling the contrast of stimuli is a matter of course in virtually all of visual psychophysics. Spatial frequency filtering of images should not be an exception to this rule. When experiments do not normalize contrasts, reasons should be provided, and the potentially confounding effect of contrast should be acknowledged.

 

What was the mean luminance of the Dell 800*600 150Hz display?

We did not measure the mean luminance, unfortunately. Because of Covid-19 restriction we do not have access to the lab at this time to measure it. While it would be good to report display luminance as part of a complete methods description, we do not believe that this value has any bearing on the results of our paper or their interpretation.

Round 2

Reviewer 1 Report

I’ll start by reiterating my support for the general goals of this paper – I really do want to see it published once corrected for errors and omissions.  However, while the authors have addressed some of my concerns/requests for clarification (mostly the minor concerns), they did not address 5/6 of my primary concerns.  And to be honest, I found several of their responses to be quite dismissive, and others incorrect.  The concerns that remain to be addressed are given below (enumerated as originally listed in my first review).

 

Unaddressed Primary concerns

1) The authors have missed my point here, so I will reiterate and clarify.  I understand that scene perception researchers in cognitive psychology/neuroscience seek to address the role of different spatial frequencies in scene perception and use highpass or lowpass filters to do that.  The problem with doing so is that the bandwidth of the highpass filters is almost always much broader than the lowpass filters.  I do not mean that in terms of filter area, but in terms of the spatial frequency tuning bandwidths of the underlying neural population in striate cortex and beyond (i.e., the major source of inputs to the “higher” visual areas) – see my original comment for details, and the original work of Hugh Wilson and Russell De Valois (among others) on the topic of tuning bandwidths in striate cortex.  This is particularly problematic given that the density of neurons in the “lower” cortical areas is very much skewed towards high spatial frequency tuned neurons (i.e., more smaller receptive fields are needed to tile the visual field than larger receptive fields).  As noted in my original comment, what that means is each filter will differentiate that population with more activation coming from the high spatial frequency tuned population.  Thus, any claim based on unbalanced highpass or lowpass filtering is confounded.  That’s a major problem for researchers who have claimed that some higher scene processing areas respond more to high spatial frequencies than low spatial frequencies.  To be clear, if one uses filters to that do not equally activate the differently tuned neural populations (like the filters used in this study), the pathways carrying high spatial frequency information will be activated more than the low spatial frequency population.  Any differences in response magnitude in any scene area receiving that input will have everything to do with how the filters were constructed and almost nothing to do with scene area “preference”.  All that said, I recognize that the aim of this study is more about normalizing contrast (which is good to control for when relevant), but if how the filters recruit differently tuned neural populations is not accounted for during the filtering process, it really doesn’t matter. That is, contrast balanced or not, lowpass and highpass filtering (when neural tuning and density are not taken into account) is an inappropriate way address the question of spatial frequency selectivity in scene areas.  Given the lack of knowledge about such things amongst cognitive psychologists/neuroscientists (as the authors point out), I fear that the current paper will perpetuate further studies that are confounded in this way.  In a perfect world, the authors would have included balanced bandpass filters in their test set (as opposed to the pointless ideal rectangle filter), but they didn’t.  Given the COVID-19 situation, running those conditions is not immediately possible (though the authors would be wise to wait and collect meaningful data).  If the authors choose not to wait, then the only recourse that I can see here is for the authors to acknowledge (thoughtfully) how such filtering is the most accurate approach, but here they have focused on the typical lowpass/highpass approach for comparative purposes because that’s what cognitive psychologists/neuroscientists seem to do (albeit a confounded approach).

3)  Thank you for reporting the average percentage of pixel clipping.  However, I mainly asked to report the range of pre-filtered image RMS (e.g., a histogram). Or, a histogram showing the frequency of clipping percentages.  This is important to do for the very reason that I mentioned in my first round of comments.  Briefly, the large amount of Fourier power (i.e., contrast) that is loaded into the high spatial frequency stimuli will surely lead to significant clipping for images that were already high in RMS contrast.

4)  The response to this comment is incorrect.  First, it isn’t at all clear what the authors mean by “perceived gamma”.  Camera sampled images are raised to a power that is linearized when displayed on the monitor (which has its own exponent).  The output is physically linear, there is nothing perceptual about it.  Specifically: optical image^(1/n) = stored image; stored image^g = displayed image, where ^(1/n) is the camera process, and ^g is the display.  A good account of this is given in the book “Vision Research” by Roger Carpenter and John Robson.  Crucially, the authors assertion that linear filtering will preserve the camera’s process when displayed is incorrect.  If we were just analyzing the power within different frequency bands, then there would be no problem here (while recognizing that the image has been raised to a certain power).  However, we are talking about using filtered images as stimuli presented on a nonlinear monitor.  Yes, an image by itself would be displayed as intended (i.e., linear output), but what is frequently overlooked here is that the filter itself in the spatial domain actually constitutes an image.  If one were to present the spatial version of those filters on a monitor as stimuli, they would not be linear in terms of how they are displayed.  By convolving images with those filters, one is effectively ‘weighting’ (visually on the monitor) the image with an visually uncorrected filter (i.e., multiplication of an image and a filter the Fourier domain is identical to convolution of that image with the spatial version of that filter in the spatial domain).  The result will be an image that requires a linearized display, otherwise it will not be presented to the participant as intended.  The authors can verify this by measuring on an uncorrected monitor multiple points of filtered images with a photometer – the result will be a luminance distribution that is skewed toward lower luminance values.  Or, just simply generate a 2D sinewave (computer generated, and thus in need of monitor linearization if intended to be displayed linearly) and display that next to a sinewave that has been filtered out of one of lowpass or highpass filtered images.  They will look perceptually identical, which would not be the case if the filters preserved the image’s camera correction (which they do not).  All that said, monitor linearization with such stimuli is more important for experiments concerned with contrast, color, or brightness discrimination – probably not a major problem for filtered scene recognition (I don’t think anyone has tested that to be sure).  One way to proceed with a revisions here is to 1) remove the original revised statement as it sounds a bit silly (see the beginning of this comment), and 2) simply state that no attempt was made to linearize the monitor because it isn’t something that cognitive psychologists/neuroscientists do (again, to keep the authors’ approach consistent with what has been reported), and acknowledge that the accurate approach would have been to linearize the display.  Again, if the authors seek to educate their target audience, then they should point these things out (or run the risk of feeding a body of literature that has run such studies inaccurately).

5). The authors have missed my point here.  I was not talking about the appropriate way to prepare images for a quantitative analysis in the Fourier domain.  Otherwise, the use of symmetrization (or mirroring) prior to filtering would lead to inaccurate measures off the cardinal axes.  My point is that because the authors are filtering images to be used as stimuli, symmetrizing the images prior to filtering in the Fourier domain and then cropping the relevant region when inverse transformed is the preferred way of doing this as it dramatically reduces the accumulation of contrast at the edges of the stimulus.  So, I recommend that the authors revise their revised statement – the concern was never about making measurements in the Fourier domain, but using the best approach to dramatically reduce artifacts that will arise in the filtering process.  Simply stating that no attempts were made to account for the artifacts will be fine because, again, cognitive psychologists/neuroscientists don’t make those corrections (which is really unfortunate), and you are trying to compare your results to what has been published.

6) Again, the authors have missed my point.  My comment had nothing to do with signal processing – it was indeed targeted at appropriate experimental design.  As I stated in my original comment, broadband masks differentially interfere with the transmission of spatial frequency information beyond ‘low level’ cortical areas – that is why I referred the authors to studies focused on contrast gain control and masking.  Local neural populations differentially pool activity from neurons tuned to a range of spatial frequencies (i.e., the normalization term or denominator in gain control equations), which leads to a differential adjustment of signal magnitude (the strength of signal transmission in the brain that is).  The result is one where any observed differences in the activity in later scene processing areas (or behaviorally for that matter) are more to do with how the mask interfered with the stimulus than the processing of the target stimulus per se.  That is a huge experimental confound.  Choosing masks that only interfere with the peak spatial frequency of the filtered stimulus is the experimentally appropriate way to use masks for the purposes of what the authors are exploring.  The authors’ response to my comment suggests that they think a broadband mask will interfere with all spatial frequency channels equally, but that is simply not the case.  Using masks that have the same spatial frequencies as the stimuli does indeed allow any differences in performance to be attributed to the stimuli.  Using broadband masks does not allow for this for the reasons stated above (and in my original comment).  I realize that the masking is not the focus of this study, but it is an important factor in isolating stimulus-enabled performance, which is very relevant to how the experiments were conducted.  As with my responses above, the only compromise here (if the authors cannot wait to rerun follow up experiments until after the COVID-19 restrictions have been relaxed), is to acknowledge that broadband masks were used to increase task difficulty in accordance with other similar scene perception experiments (with filtered images), and that filtered masks are preferred in order to increase task difficulty while at the same time isolate stimulus-enabled performance (i.e., to remove the confounds that broadband masks create when filtered stimuli are used as targets).

 

Minor concerns

2)  Thank you for clarifying.  The revised statement is fine.  However, I should caution the authors about staying with the self-report approach.  I’ve run many participants who have claimed to see just fine (normal or corrected to normal), but when tested were actually 20/50 or worse.  That’s a problem when high spatial frequency filtered stimuli are involved.

Author Response

Reviewer 1:

 

Comments and Suggestions for Authors

 

I’ll start by reiterating my support for the general goals of this paper – I really do want to see it published once corrected for errors and omissions.  However, while the authors have addressed some of my concerns/requests for clarification (mostly the minor concerns), they did not address 5/6 of my primary concerns.  And to be honest, I found several of their responses to be quite dismissive, and others incorrect.  The concerns that remain to be addressed are given below (enumerated as originally listed in my first review).

We appreciate the reviewer's generally positive opinion of our manuscript. We would like to assure the reviewer that we did not mean to be dismissive in our responses in the previous round of revisions and regret if this impression arose contrary to our intentions. We thank the reviewer and the editor for the opportunity to address any remaining issues with the manuscript.

Unaddressed Primary concerns

1) The authors have missed my point here, so I will reiterate and clarify.  I understand that scene perception researchers in cognitive psychology/neuroscience seek to address the role of different spatial frequencies in scene perception and use highpass or lowpass filters to do that.  The problem with doing so is that the bandwidth of the highpass filters is almost always much broader than the lowpass filters.  I do not mean that in terms of filter area, but in terms of the spatial frequency tuning bandwidths of the underlying neural population in striate cortex and beyond (i.e., the major source of inputs to the “higher” visual areas) – see my original comment for details, and the original work of Hugh Wilson and Russell De Valois (among others) on the topic of tuning bandwidths in striate cortex.  This is particularly problematic given that the density of neurons in the “lower” cortical areas is very much skewed towards high spatial frequency tuned neurons (i.e., more smaller receptive fields are needed to tile the visual field than larger receptive fields).  As noted in my original comment, what that means is each filter will differentiate that population with more activation coming from the high spatial frequency tuned population.  Thus, any claim based on unbalanced highpass or lowpass filtering is confounded.  That’s a major problem for researchers who have claimed that some higher scene processing areas respond more to high spatial frequencies than low spatial frequencies.  To be clear, if one uses filters to that do not equally activate the differently tuned neural populations (like the filters used in this study), the pathways carrying high spatial frequency information will be activated more than the low spatial frequency population.  Any differences in response magnitude in any scene area receiving that input will have everything to do with how the filters were constructed and almost nothing to do with scene area “preference”.  All that said, I recognize that the aim of this study is more about normalizing contrast (which is good to control for when relevant), but if how the filters recruit differently tuned neural populations is not accounted for during the filtering process, it really doesn’t matter. That is, contrast balanced or not, lowpass and highpass filtering (when neural tuning and density are not taken into account) is an inappropriate way address the question of spatial frequency selectivity in scene areas.  Given the lack of knowledge about such things amongst cognitive psychologists/neuroscientists (as the authors point out), I fear that the current paper will perpetuate further studies that are confounded in this way.  In a perfect world, the authors would have included balanced bandpass filters in their test set (as opposed to the pointless ideal rectangle filter), but they didn’t.  Given the COVID-19 situation, running those conditions is not immediately possible (though the authors would be wise to wait and collect meaningful data).  If the authors choose not to wait, then the only recourse that I can see here is for the authors to acknowledge (thoughtfully) how such filtering is the most accurate approach, but here they have focused on the typical lowpass/highpass approach for comparative purposes because that’s what cognitive psychologists/neuroscientists seem to do (albeit a confounded approach).

We thank the reviewer for this clarification. We fully acknowledge the need to carefully consider bandwidth and filter coverage when targeting specific neural population, e.g., in V1. As the reviewer points out, these adjustments need to be tailored to the neural population, potentially even based on an individual subject level, as shown by individual differences in contrast sensitivity functions (CITE). Note that we are not targeting any particular brain area or neural population here. Furthermore, the spatial frequency tuning of some brain areas involved in natural scene perception, such as the parahippocampal place area, has not been characterized yet. It would therefore be challenging to adjust the settings of the experiment accordingly. We would be very interested in exploring this question in more depth. We invite the reviewer to contact us outside of this review process to discuss and potentially collaborate on new experiments that appropriately address these issues. We do not believe that this manuscript is the appropriate place to settle this debate. The purpose of this manuscript is to highlight perceptual consequences of commonly used filtering techniques in scene perception. Except for the conditions that we explicitly manipulate, we therefore adhere to filter settings commonly used in research on scene perception, including the use of high and lowpass filters, cutoff frequencies etc. We believe that a discussion of the appropriateness of these settings should be deferred to future work (perhaps with the reviewer?), where such statements could receive the theoretical and empirical support that they deserve, rather than being inserted as a side note in the methods section of this current manuscript. However, to acknowledge that bandpass filters might be appropriate in some circumstances, we have added this statement to the discussion section (lines 435-436):

When targeting specific frequency channels, e.g., in striate cortex, it may be advisabel to use bandpass filters instead of highpass or lowpass filters.

3)  Thank you for reporting the average percentage of pixel clipping.  However, I mainly asked to report the range of pre-filtered image RMS (e.g., a histogram). Or, a histogram showing the frequency of clipping percentages.  This is important to do for the very reason that I mentioned in my first round of comments.  Briefly, the large amount of Fourier power (i.e., contrast) that is loaded into the high spatial frequency stimuli will surely lead to significant clipping for images that were already high in RMS contrast.

At the reviewer's request, we now include a histogram of pre- and post-filtering RMS contrast as the new figure 2.

4)  The response to this comment is incorrect.  First, it isn’t at all clear what the authors mean by “perceived gamma”.  Camera sampled images are raised to a power that is linearized when displayed on the monitor (which has its own exponent).  The output is physically linear, there is nothing perceptual about it.  Specifically: optical image^(1/n) = stored image; stored image^g = displayed image, where ^(1/n) is the camera process, and ^g is the display.  A good account of this is given in the book “Vision Research” by Roger Carpenter and John Robson.  Crucially, the authors assertion that linear filtering will preserve the camera’s process when displayed is incorrect.  If we were just analyzing the power within different frequency bands, then there would be no problem here (while recognizing that the image has been raised to a certain power).  However, we are talking about using filtered images as stimuli presented on a nonlinear monitor.  Yes, an image by itself would be displayed as intended (i.e., linear output), but what is frequently overlooked here is that the filter itself in the spatial domain actually constitutes an image.  If one were to present the spatial version of those filters on a monitor as stimuli, they would not be linear in terms of how they are displayed.  By convolving images with those filters, one is effectively ‘weighting’ (visually on the monitor) the image with an visually uncorrected filter (i.e., multiplication of an image and a filter the Fourier domain is identical to convolution of that image with the spatial version of that filter in the spatial domain).  The result will be an image that requires a linearized display, otherwise it will not be presented to the participant as intended.  The authors can verify this by measuring on an uncorrected monitor multiple points of filtered images with a photometer – the result will be a luminance distribution that is skewed toward lower luminance values.  Or, just simply generate a 2D sinewave (computer generated, and thus in need of monitor linearization if intended to be displayed linearly) and display that next to a sinewave that has been filtered out of one of lowpass or highpass filtered images.  They will look perceptually identical, which would not be the case if the filters preserved the image’s camera correction (which they do not).  All that said, monitor linearization with such stimuli is more important for experiments concerned with contrast, color, or brightness discrimination – probably not a major problem for filtered scene recognition (I don’t think anyone has tested that to be sure).  One way to proceed with a revisions here is to 1) remove the original revised statement as it sounds a bit silly (see the beginning of this comment), and 2) simply state that no attempt was made to linearize the monitor because it isn’t something that cognitive psychologists/neuroscientists do (again, to keep the authors’ approach consistent with what has been reported), and acknowledge that the accurate approach would have been to linearize the display.  Again, if the authors seek to educate their target audience, then they should point these things out (or run the risk of feeding a body of literature that has run such studies inaccurately).

We thank the reviewer for the clarification. The linearization of the monitor so that the filter, and the filtered images, would be displayed linearly is a good point. As displaying photographic images usually does not require monitor linearization, this is not usually done in the scene perception literature, even when filtered images are being displayed. As the current paper specifically focuses on contrast normalization, we do not feel our data support making claims about monitor linearization, but we would like to follow up in a future study where we specifically test for different results when the monitor is linearized versus when it is not. This will let us see how sceptical we should be about previous work in the field, and would allow us to make stronger methodological recommendations.  

For now, we have changed the statement according to the reviewer's recommendation (lines 156-157): 

Note that we did not attempt to linearize the luminance response of the monitor, since this is not commonly done when displaying photographic images.

5). The authors have missed my point here.  I was not talking about the appropriate way to prepare images for a quantitative analysis in the Fourier domain.  Otherwise, the use of symmetrization (or mirroring) prior to filtering would lead to inaccurate measures off the cardinal axes.  My point is that because the authors are filtering images to be used as stimuli, symmetrizing the images prior to filtering in the Fourier domain and then cropping the relevant region when inverse transformed is the preferred way of doing this as it dramatically reduces the accumulation of contrast at the edges of the stimulus.  So, I recommend that the authors revise their revised statement – the concern was never about making measurements in the Fourier domain, but using the best approach to dramatically reduce artifacts that will arise in the filtering process.  Simply stating that no attempts were made to account for the artifacts will be fine because, again, cognitive psychologists/neuroscientists don’t make those corrections (which is really unfortunate), and you are trying to compare your results to what has been published.

Thank you for the clarification. The revised statement now reads (lines 112-114):

We here did not include any measures to address artifacts arising from the periodicity assumption inherent in the Fast Fourier Transform implementation, although zero-padding or mirroring the image at the edge are possible strategies to minimize such artifacts.

6) Again, the authors have missed my point.  My comment had nothing to do with signal processing – it was indeed targeted at appropriate experimental design.  As I stated in my original comment, broadband masks differentially interfere with the transmission of spatial frequency information beyond ‘low level’ cortical areas – that is why I referred the authors to studies focused on contrast gain control and masking.  Local neural populations differentially pool activity from neurons tuned to a range of spatial frequencies (i.e., the normalization term or denominator in gain control equations), which leads to a differential adjustment of signal magnitude (the strength of signal transmission in the brain that is).  The result is one where any observed differences in the activity in later scene processing areas (or behaviorally for that matter) are more to do with how the mask interfered with the stimulus than the processing of the target stimulus per se.  That is a huge experimental confound.  Choosing masks that only interfere with the peak spatial frequency of the filtered stimulus is the experimentally appropriate way to use masks for the purposes of what the authors are exploring.  The authors’ response to my comment suggests that they think a broadband mask will interfere with all spatial frequency channels equally, but that is simply not the case.  Using masks that have the same spatial frequencies as the stimuli does indeed allow any differences in performance to be attributed to the stimuli.  Using broadband masks does not allow for this for the reasons stated above (and in my original comment).  I realize that the masking is not the focus of this study, but it is an important factor in isolating stimulus-enabled performance, which is very relevant to how the experiments were conducted.  As with my responses above, the only compromise here (if the authors cannot wait to rerun follow up experiments until after the COVID-19 restrictions have been relaxed), is to acknowledge that broadband masks were used to increase task difficulty in accordance with other similar scene perception experiments (with filtered images), and that filtered masks are preferred in order to increase task difficulty while at the same time isolate stimulus-enabled performance (i.e., to remove the confounds that broadband masks create when filtered stimuli are used as targets).

We thank the reviewer for clarifying this point. As with the previous points, we are interested in future studies where we make specific recommendations with regards to the masking procedures in the scene perception literature (see point 1 as well). However, we do believe that making strong recommendations about the masking procedures using our experiments is not justified. As such, we have added the requested statement (lines 177-180):

Perceptual masks were constructed from noise textures with a broad spatial frequency spectrum. Note that frequency-matched masks would increase task difficulty while at the same time isolate stimulus-enabled performance. We here opted for broadband masks to keep them the same for all experimental conditions.

 

Minor concerns

2)  Thank you for clarifying.  The revised statement is fine.  However, I should caution the authors about staying with the self-report approach.  I’ve run many participants who have claimed to see just fine (normal or corrected to normal), but when tested were actually 20/50 or worse.  That’s a problem when high spatial frequency filtered stimuli are involved.

Thank you for pointing out this issue. We will include testing of visual acuity in future studies.

Reviewer 2 Report

The authors have addressed my concerns.

Author Response

Thank you!

Round 3

Reviewer 1 Report

The authors have addressed my concerns.

Back to TopTop