Next Article in Journal
Pedestrian Flows Characterization and Estimation with Computer Vision Techniques
Previous Article in Journal
Spatiotemporal Analysis of Emergency Calls during the COVID-19 Pandemic: Case of the City of Vaughan
 
 
Article
Peer-Review Record

Predicting Gentrification in England: A Data Primitive Approach

Urban Sci. 2023, 7(2), 64; https://doi.org/10.3390/urbansci7020064
by Jennie Gray 1,2,3,*, Lisa Buckner 4 and Alexis Comber 2,3
Reviewer 1:
Reviewer 2: Anonymous
Urban Sci. 2023, 7(2), 64; https://doi.org/10.3390/urbansci7020064
Submission received: 4 December 2022 / Revised: 6 May 2023 / Accepted: 10 May 2023 / Published: 13 June 2023

Round 1

Reviewer 1 Report

Overview

I’ve never spent this much time on a review before: not, I hasten to add, because the work wasn’t good, but because there was something that hadn’t quite made sense on the first or second read-through, and that only began to become clear after I had also read Gray et al. (2021) and Comber (2008) and that I think I can finally articulate clearly after struggling to write this report yesterday and then, ultimately, sleeping on it.

What it boils down to is this, a paper that is part of a series (as this obviously is) seems to me to have two ways to tackle its topic: option #1 is that your write each article so that it can stand largely on its own in order to effectively communicate its content and contribution while pointing at the others for those who want a ‘deep dive’ into specific aspects; or option #2 is that you basically say “As we showed in Gray et al (forthcoming), you can detect gentrification in South Yorkshire, so in this paper our focus is extending this project to England as a whole.” The ‘problem’ for the reader is that right now it doesn’t take either of these paths and tries to do both simultaneously.

If you take a step back from the text then I think this (now) becomes clear; for instance:

  • Data primitives aren’t quite explained in a detailed enough way that the reader who is fresh to the concept can grasp it in full, but neither does the explanation remain at a high enough level that I get the main ideas underpinning the data primitives approach and how it is translated into this problem domain from Remote Sensing. I personally would prefer to see some sort of high level statement about what the data primitives concept helps us to achieve (more on this below) and then a deeper dive into how it works to improve our prediction(s) of gentrification.

  • Gray et al. (forthcoming) claims to deal with gentrification in South Yorkshire which to me signals that that topic is kind of ‘done’ (I realise that often isn't the case, so this could just be a presentational issue), so why does it keep popping up here, seemingly as new work? Wasn’t the validatation work for South Yorks done in that article so that we can ‘take it as read’? Clearly the answer is ‘no’, but in that case what are the key points of difference. Again, this is about what you present and how you present it: did the first article just try to detect, so here we’re extending this into prediction?

  • In an article about gentrification in England, the first three figures/tables from the analysis are about one region that you’ve apparently already published on… (this is not meant to imply malpractice, just to highlight the mistmatch between the title of the article and the focus of your discussion). Are you confident that South Yorkshire is a good training ground (literally) for a model that can apply in London, Cornwall, and Kent? Why?

I’m sure that this comment will come as a surprise to the authors — since they are clearly pioneering thinking on this topic in the fields of physical (land cover/land use) and human (area change) geography — but I have never before had to track down other articles (Gray et al 2021, Comber 2008) in order to make sense of a core concept/theory in the one I’m reviewing. My instinct is that data primitives are introduced elsewhere as a solution to a particular conceptual problem that had been substantively laid out first (e.g. the Comber article), wheeras here it feels like we start from them as an obvious solution to a problem domain that hasn’t yet been fully laid out.

Summary of Article

To try to clarify my thinking (and hopefully help the authors make sense of my natterings) I have attempted to summarise the article and hope this will serve to clarify both the areas whether the reader is running aground and the ways in which the concept(s) might be introduced?

The article seems to engage with gentrification as follows:

  1. The limitations of geodemographic classifications that use clustering techniques to make static assignments of areas to groups (area A is ‘multicultural metropolitans’, area B is ‘Ageing City Dwellers’) when what we’re interested in is change in those areas.

  2. The temporal limitations of the data used to make these assignments, which seems to focus priarily on the latency/infrequency in the data and the ways that that limits analyses of change since we only see the outcomes at times t, t+10, etc.

  3. So rather than trying to infer gentrification from movements between static classes, we should instead be focussing on area-based ‘change vectors’ represented by the magnitude and direction of change in a multidimensional space using data with much higher temporal resolution whose selection we can justify with respect to underlying signals/dimensions in the problem domain.

  4. In the case of gentrification we boil this down to the following input observations… Is that right? The argument is then about how we need a good mapping between data of the ‘right sort’ that is available and the types of gentrification that we might encounter. Data primitives allow us to formalise our conceptualisation of this mapping/problem.

This points to the absence of a table that combines aspects of Table 1 (p.7 of Gray et al 2021) and the bulleted list on p.8 of the same article together with the key topic of this paper. Something like the following would be really helpful:

Type of Gentrification Sees change in…      
  Income Occupation Employment Ethnicity
Studentification No Change No Change? Decrease? ?
Transit-Induced Increase Increase Increase ?
New-Build Increase Increase Increase White British Increase
Rural Increase Increase Decrease BAME Decrease
First-Wave      

And there would then be a short section mapping the columns to the available data that streamlines and reorganises Tables 1 and 2 from this article. I’d note that the lead-in to this section seems to be making the argument that data primitives are sufficient for examining the process of interest (gentrification), but here we have a Table 2 that contains “additional data primitives” alongside “data used for descriptive analysis”.

So are the variables in Table 2 essential or not? Primitive or not? If they are part of process of analysis then I’d put them in Table 1, if they are used for descriptive purposes perhaps you can just say “A range of spatial and demographic variables — covering access to transport, greenspace, and ethnicity — that could be used to contextualise and validate the results were also collected; these are listed in full in Table X in the Supplementary Online Materials so as to keep the focus here on the data primitives used in the modelling process.”

For example, White British Ethnicity is listed in Table 2 but overlaps conceptually as a model input with Black and Asian Ethnicities from Table 1 — listing “Ethnicity Change” in Table 1 instead and removing the ‘Trend’ column helps to reduce this confusion if you also add the table mapping data primitives on to gentrification(s) as suggested above.

There’s also no mention of the temporal frequency of updates to any of these sources even though this seems to be much more important analytically than their spatial resolution. It would be quite easy to write: “Unless otherwise stated, all variables are reported at LSOA scale” so as to free up space for the “Temporal Resolution”. “Measurement” also lists ‘number’ but number of what (presumably residents)?

Finally, Table 3 is quite confusing since it simultaneously recapitulates information from Tables 1 and 2 and introduces new data sources. I would delete this table entirely and point to a Supplementary file for those who want the full variable list used for validation and exploration.

Additional Important Suggestions

The validation step(s) is also… hard to understand and seems rather impressionstic without some additional detail in a Supplementary. Have you considered reproducing the maps of others, or at least doing this in a narrative way; e.g. “Comparing our results to Yee and Dennet’s and Reades et al’s predictions for London we see…”

On p.5 a gentrification ‘score’ is mentioned that is nowhere (that I could spot) detailed or defined. And I think this opens up the other area where I was left a little befuddled: you cite our article (#11) as an example of research limited to a single city (absolutely fair) but I’m unclear as to whether you have more fundamental points of disagreement (or agreement!) with that work and the ones that have come after. The reason I point to this is that it suggests a different ‘branch’ of the literature that isn’t ‘debated’ in this piece: we use a scoring process (but were certainly not the first to do so) not a clustering one, but we also use Census data. I think you need to add a little more about where you feel the ML-based approaches fall relative to yours, but that requires a little more detail on scoring. Or, again, you need to be able to say “We demonstrated the value of a X-based scoring mechanism in Gray et al (forthcoming) that [is consistent with approaches taken by Y and Z OR delivers significantly better results than the approaches taken by Y and Z] and we here extend it to the national context.”

A short paragraph discussing the CDRC data would be good to add as well since for many readers it will still be quite new; there are only, to my knowledge, 3 published journal articles and one report by the Runnymede Trust [which should probably get a mention] that use this data source. I think this is one source of my own perplexity: because the CDRC churn and ethnicity data is, of course, a model built on top of an underlying data set (LCRs, Onomaps) and is not a ‘primitive’ in the way that my brain keeps trying to interpret that word.

I think you also need to be a little more clear about the possible limitations of your analysis: researchers (and certainly qualitative ones) have come up with a lot more types of gentrification that you’ve listed here (Residential, Rural, and Transport-Oriented). How should we interpret this? Are you saying that these are the only three types that exist, or that they are the only three tyeps that you can detect using your approach, or that they represent three classes of change that are detectable and that you’ll need to see whether there are mixes or magnitudes or directions of change that help you to pick out other types of gentrification theorised by researchers or explored in qualitative/other quantitative work? I believe you do kind of get to this later, but by that point I was confused as to which of these arguments you were making.

I’d suggest that a ‘Key findings’ (p.13) that runs to 4 pages of an article is perhaps less of a distillation than the reader would expect. I also had a hard time working out how the different neighbourhoods differ, only that they must differ in certain ways. I think the issue is that a lot of material in this section feels like it actually belongs in the lit review! I made this notation specifically against lines 164–477 and 487–491 (p.15) but the same is undoubtedly true of the section on Rural Gentrification on p.14. The introduction of the NYC Taxi Commission feels… tangential at this point. Equally, do lines 540–554 (roughly) on p.16 belong in the Methodology?

The absence of comparisons to outputs from other works (which could add robustness to your validation) on p.17 (and throughout the analysis and discussions) really works against you here. It kind of feels like, after all of this, you’re reduced to saying “It’s too much work to validate all of England using our preferred technique so you’ll have to trust us that it must have worked outside South Yorkshire.” I don’t think that’s quite what you’re saying or quite what you’ve done, but you should borrow analytical power from the work that others have done wherever possible.

Minor Suggestions

Figures/Maps:

  • The black bg look nice on-screen but are hell on printers, please, please, please change them to a plain old white background.

  • Aside from the regional gorundaires there are no points of orientation: not cities are marked (not even the ones mentioned in the text), no roads, … even the outline of Wales (which would give us a way to locate Liverpool/Manchester and Bristole) is missing, you really have to know England well for the patterns to be meaningful. On p.11 you say “Transport gentrification is scattered in towns along major motorways…” but show neither the towns nor the motorways!

  • Given the small size of the maps and the varying sizes of the LSOAs, I do wonder if this is one case where aggregating into hexes or something might really help with the interpretation. I realise that’s going against the whole pont of fine-grained prediction, but Figure 4 seems to important but is so hard to pull insight from…

  • Is there a need for a missing distribution plot (responding to statement on p.11, lines 303–305)? The plot would show the distribution of probabilities across all three types for the entire the UK? This also raised a question I had: what if an area has scores on all three types of gentrification? Do you pick the most likely? What if they are all high but one just happens to be fractionally higher?

Tables:

  • For the sake of all that is holy please stop centre-aligning content in tables.

  • Please remove unecessary vertical rules (table 2)

  • Please resolve the mixed use of “UK Government” and “https://data.gov.uk/”. Ideally, I’d prefer to see URLs for all of these – obviously, for the CDRC data this would a link to the ‘asset listing’, not the data itself and the same approach could be used elsewhere.

  • Table 4 is probably unecessary. You could cover this in a single sentence because the results don’t mean anything to me (as someone who has travelled through South Yorkshire and even spent time there!). I also don’t understand where the other LSOAs have gone? Why aren’t the included in the None count? Or is ‘None’ meant to mean: ‘no particular type of gentrification dominates’ (see also Table 8)?

  • Table 1 is nicely designed (aside from my personal dislike of centre alignment) but the rest seem to have taken the default Word formatting style that breaks up the data in ways that make it hard to pull out the key findings.

Word choice/typos:

  • p.3, lines 107–112: it seems a little strange (to me at least) to talk about ‘churn’ when you’re really talking about population increases. I would ordinarily read ‘churn’ as implying replacement/departure.

  • p.3, lin 103: delete comma after towns

  • p.3, line 138: delete ‘a’ between ‘contract’ and ‘and’

  • p.4, line 162: insert ‘which’ after ‘determine’

  • p.4, line 162: insert ‘sensing’ after ‘remote’

  • p.4, line 163: I found ‘over each tie period between 2010 and 2019’ quite vague — are we talking years? months? days? Given the limitations of LCRs I assume ‘years’, but this is not clearly stated.

  • p.5, lines 18–184: suggest rewriting as “to seek evidence of visual changes within Google Earth or, where aerial imagery was not available, Good Street View. Visual validation was supported by…”

  • p.5, lines 197–205: Simple cut+paste along the lines of “We compared the performance of three ensemble methods: GBM, …. GMB does X. Bootstrap Aggregation does Y. XGBoost does Z. We evaluated these models using [MSE | RMSE | ????] because this minimised [large errors | overall rate of errors | … ].” Note the addition of a sentence or two about how you evaluated the models.

  • p.5, lines 208–211: I don’t immediately see why these couldn’t have been stated alongside the inital RO1/RO2/RO3 bullet points. They are much more clear with respect to what you’re actually planning to achieve.

  • p.6, lines 221–226: Is ‘tuning grids’ a grid search CV? You have 45 ‘change periods’ but I don’t know what period you’re actually using! They obviously aren’t years. Are they quarters? “Different combinations of variables were evaluated…” is at risk of sounding like a fishing expedition and introducing the multiple comparisons problem. What thought-process drove this exploration? How did you specify the search and evaluation? I’d really like more detail here.

  • p.6, lines 239–246: precision is overdone here: I’m not sure that 99.18% means any more to me than 99%, nor does .9666 mean much more than .97.

  • p.8, lines 264–274: more overdone precision in my opinion. You are welcome to disagree with me on this.

  • p.9, lines 284–285: because I don’t fully understand the ‘visual evaluation technique’, the ‘excellent accuracy’ claim seems like mostly an assertion.

  • p.14: there’s an implication (inadvertent) that people displaced from urban areas (e.g. London) are buying in the Cotswolds.

  • p.15, line 455: is ‘rural neighbourhood’ a bit of a self-contradiction? Rural ‘areas’? I guess I have a very urban view of what a neighbourhood should be, though there’s also the technical definition of neighbourhood from spatial stats… is that the one you’re using?

  • p.17, line 594: assumed it’s ‘explored’, not ‘implored’

  • p.17, line 595: data availability will not necessary ‘continually improve’ — the CDRC’s LCR data are going a bit in reverse because people are opting out of the unedited Electoral Register.

Wrap-Up

Although I’ve obviously given a lot of feedback and noted a lot of potential points of ‘clarification’, I want to stress three things:

  1. The work is clearly novel and important research; indeed, part of the problem is that the novelty requires extra work to be done in positioning the article relative to other schools/domains and that seems to be missing (or at least insufficient) at the moment. Perhaps some of this is in the missing ‘in review’ article (access to which would probably have really helped) but it still needs signposting from here.

  2. I think that there are some blocks of content that need to be moved around and, from there, we should see some gaps that need filling; however, it’s hard for me to gauge the amount of work that this entails. I’ve opted for Major Revisions not because the work needs a major rethink but because I can’t see how what I’ve suggested represents, say, a week’s light effort over the holiday period. Even responding to my comments in enough detail to tell me to ‘piss off’ would take some time, so Major Revisions hopefully gives you that.

  3. The rest is about stripping away and refining so that the focus is more clear. The article is a bit sprawling which, when combined with #1 and #2, means that reader gets a bit lost in terms of what matters and what the key findings/advances really are.

I hope this comes across as robust but helpful feedback—I do look forward to seeing this in press!

Good luck!

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

 

 

1.     There are many methods to study gentrification. Data primitives for neighbourhood change research have recently been introduced as a novel approach for identifying and analyzing neighbourhood change over time. This research is a new attempt and a progress in the field of geography.

2.     In the innovation part, others studied London, and the author team studied the whole country. This model was used in South Yorkshire, and the author team used it in England, these two are a progress. Readers and reviewers want to know what the theoretical contribution of the author team in studying gentrification is? What is the distinctive academic value?

3.     In the conclusion part, the author's team uses data primitives to study gentrification in England. This research has certain social value for the local government to make scientific decisions, readers and reviewers want to know how about the popularization and universality of the research? Is it necessary to increase the contrast with other regions in the conclusion? This may be more comprehensive evidence of the scientific nature of the research conclusions.

4.     Suggestions for details, there are some problems in the map of the article, such as no basic compass and measuring scale in the map, which should not occur as a researcher of urban geography. It is suggested that professionals should be invited to standardize the map.

Comments for author File: Comments.pdf

Author Response

Please see the attachment

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

General

I really appreciate the effort that has gone into rewriting this work based on my feedback: I think (and hope the authors will agree) that the result is a more clearly articulated contribution to the discipline that engages effectively with the extant literature and makes its own contribution more clear. I'm happy to be able to recommend it for publication, though I think it would benefit from one more pass (which is why I've ticked 'Minor revisions') since, as might be expected from a major rewrite, there are a few inconsistencies that have crept in. I do not need to see this again and have indicated this to the Editors — so I'd like to give you the opportunity to tweak the submission if you find the comments below helpful, but when you think it's done then it's done and it doesn't require another gander from me.

However, to be clear: looking at my main comments in the review these have all been addressed by the new submission. The relationship between the study area and the predictions is much more clear and the removal of detailed information about South Yorkshire has kept the focus where it belongs. The framing of data primitives is also much more straightforward and clear, enabling the reader to 'get on with' understanding the work.

Main Elements

I've tried to focus on what seem to me to be the main areas where a bit of reorganisation would improve the clarify of the article and have attached a scan that also shows some possible copy edits that the authors might wish to consider. These hand-written notes are also part of my thinking/framing my understanding, so in some cases the comment should be completely ignored (I've captured the ones that matter to me below, the only other ones worth reviewing are typos) — if you feel the edit is useful then do it, otherwise ignore it. There are minor typos/accidentally included references (instead of "[XX]") that are mostly circled.

  1. The Abstract is a bit confusing to me since it moves between tenses: "Data primitives were recently introduced... Gentrification was conceptualised and four key data primitives were used..." I read all of that as referring to another publication (gentrification was... data primitves were), but I'm fairly sure it refers to this submission. In which case a little tweak to 'In this article, three types of gentrification are conceptualised and four key data primitives are used to...'

  2. The last paragraph of Section 1 first paragraph of Section 2.1 (p.2) need another look as the sequencing is confusing. My suggested approach would be to group the data primitives bits together: "The concept of Data Primitives [8] originates in ... They were developed to overcome issues when translating... We extend this concept into the temporal domain in an attempt to..." And to group the gentrification bits together separately rather than trying to interweave them: e.g. "The absence of a dynamic element in geodemographic classification is a particular problem when dealing with change, such as occurs when an area undergoes gentrification. Conceptualising the data primitives — and associated derived variables — as a kind of gentrification 'space', we draw on the Data Primitives approach to conceptualise gentrification as a change in the position of a small area within that data space... This approach is, of course, dependent on the variables that are selected..."

  3. Setion 2.2 (p.3): I would suggest moving the first, short paragraph down and folding it into the text further down. So you would now lead with "In U.K.-based studies... " I would then start a new paragraph after citation 25 ("Annual data..."). This would pull together the earlier part and could also serve as a useful caveat: e.g. "While not necessarily exhaustive of the forms that gentrification might take — others [19, 20] have noted super- and green- gentrification, fo instance — these four data primitive domains should be sufficient to capture... To apply our approach we therefore collected annual data covering these four key neighbourhood characteristics and trained Machine Learning models on manually-validated observations of gentrification."

  4. Section 3 (pp.4–5): I would be inclined to move the specifics of the South Yorkshire training data set below the discussion of ensemble methods because that is really the application of the method to a particular context and so, conceptually, should follow from it. I think this is just a case of gathering together the bit after Table 1 with a couple of the summary stats and moving this down a bit. You could also give these subsection numbers as for Section 2. (ie. Section 3.1: Data; Section 3.2: Ensemble Modelling; Section 3.3: Case Study and Training... ) to help the reader navigate and you to decide where to put the re-sequenced elements.

  5. There might be some funky formatting on p.7 but I can't find where Section 4 starts. My assumption is that the Results (which Section 4 presumably is) should begin with Table 1. I think it would be really helpful to insert one short paragraph before that to the effect of "To recap, we derived a data set of 77 attributes — X of which were derived from the four data primitives, and Y of which were taken from contextual features — on an annualised basis. These attributes were used to train three ensemble models for Yorkshire and the results validated manually. We then re-trained the best-performing model for England as a whole... Finally, multivariate models were used to predict the type of gentrification..." It's just a good point in the text to confirm for the reader that they've apprehended the framework.

  6. Figure 3A seems to be for Rural, not residential gentrification. I had a query (see scan) about whether a bivariate plot of residential and transit gentrification would be helpful, but looking back at the text I don't think it is. What it did lead me to think, however, (see note on p.10 of scan) is that it would be really useful to have a slippy map or similar for each of these so that a reader can zoom and pan in my greater detail: you'd really have to know England in great detail to be able to extract a lot of insight from Figure 4. This is obviously not a requirement for acceptance, but if it were to be something fairly trivial for the authors it would represent a significant 'value add' to the text.

  7. Last paragraph of Section 4 (p.11) seems to have an artefact from accepting changes since there's a dangling part of a sentence ("Contrastingly...").

  8. Section 5 (p.11): I'd suggest a change in sentence order just to streamline things. "This research demonstrates that the data primitive approach is a viable alternative to, and advancement upon, traditional approaches to analysing neighbourhood change. Gentrifying and non-gentrifying neighbourhoods, as well as different types of gentrifying neighbourhoods, can be distinguished through the use of data primitives at a resolution of years, not decades. And predictive models... "

  9. Top of p.12: I might be overly-sensitive, but I was wondering if you might prefer something along the lines of: "However, we also observe contrasting predictions for some areas (e.g. [43] predicts decline where we predict gentrification), suggesting opportunities for further investigation: it could be that our selected training region of South Yorkshire is unsuitable for predicting change across all of England, but it's also just as likely that the additional temporal resolution of our data yields more timely predictions over ones derived from the Census."

  10. Bottom of p.12: This seems to be the start of a Limitations section that lasts to line 490 on p.13. I'd give it a subtitle. My feeling is that these limitations are too self-flagellating: it's normal to 'big up' academic results in an article (which I dislike), but this seems to go too far the other way by listing limitations in a way that makes me feel "Oh, it's all doomed!" Moving this ahead of the Discussion would mitigate this feeling (though it would require more editing work), but focussing on (most likely by removing) the paragraph beginning "The methodoligical decision..." would also work. I have note to the effect that it's functionally impossible to have 'controls' (areas that didn't gentrify) since you'd need to visually assess whether an LSOA hasn't meaningfully changed. Do with that what you will.

  11. Bottom of p.13: insert subsection title Future Work immediately before paragraph introducing future work. Or fold it into Conclusion and Future Work? I had crossed out 'temporal boundaries' in that paragraph because so much of this text is focussed on temporal resolution, not spatial. Re-reading while writing up my notes I realise that, of course, with high-res data primitives you could free yourself from Census geographies, but the way this is written doesn't immediately bring that to mind. I'm not sure what to suggest here, but the way this article works doesn't quite 'prove' that fact, even though it's implicit.

Supplementary Information

I'd love a little more detail about the 'multivariate models' and some kind of summary that shows what coefficients/changes in your data allow you to distinguish between the different types of gentrification since this is very much of interest to me; however, it would detract from the main direction of the article so it doesn't belong there. What you've submitted as a Supplementary helpfully pulls a large and complex table out of the body text, but perhaps use could be made of the space created by the presence of a SOM to include additional resources/technical information for interested readers. Probably not worth a lot of your time, but...

Comments for author File: Comments.pdf

Author Response

Thank you for your kind feedback, we do agree that the revised paper was much improved. There were errors missed in the resubmission due to the extensive revisions that were made, and the tracked changes format, but these have been corrected with the final revisions. We thank you again for your very informative feedback which have enabled this progression. 

We have considered many of your points and have again restructured some of the paper accordingly, including the background, methods, and the removal of some limitations in the discussion. However, due to time limitations, the additional technical information regarding the models has not been included. We have however included a link to an interactive map to explore the predicted gentrification. https://uni-of-leeds.maps.arcgis.com/apps/instant/basic/index.html?appid=99671dafa1ba4650814c31ee1159a050&locale=en 

Back to TopTop