A Protocol for Evaluating Contextual Design Principles

This paper explains how scientific data can be incorporated into urban design decisions, such as evaluating contextual design principles. The recommended protocols are based on the Cochrane Reviews that have been widely used in medical research. The major concepts of a Cochrane Review are explained, as well as the underlying mathematics. The underlying math is meta-analysis. Data are reported for three applications and seven contextual design policies. It is suggested that use of the Cochrane protocols will be of great assistance to planners by providing scientific data that can be used to evaluate the efficacies of contextual design policies prior to implementing those policies.


Introduction
Contextual design-the concept that new projects should fit into their contexts-has been and is a major planning principle. More than 90% of the cities in the United States, as well as cities in many other countries, regulate architectural aesthetics [1][2][3][4][5][6][7]. The importance of contextual fit stems from the fact that it is a major evaluation criterion in design review. Lightner [2] found that 77% of U.S. planning departments evaluated new buildings using the general criterion of how well the building would fit into its context. In addition, contextual design is relevant to other planning goals, such as facilitating tourism [8], attracting a highly-skilled workforce [9], coping with the increased density, For a two-alternative decision, such as adding street trees or prohibiting cars, the experimental validation would involve how strongly each option changed a criterion, such as visual appeal, and then using that data as the benefit in the cost-benefit analysis required for quality control. That is the promise, which leads, inevitably, to the question of how such experimental designs can actually be done.

An Experimental Protocol
A very simple, yet highly efficient, protocol for obtaining data on effects, such as visual impacts, requires the creation of images showing the alternatives (before and after or proposed and a random sample of an existing one). Then, semantic differential ratings (such as (1) not appealing to (8) appealing) are obtained. Third, appropriate statistical tests are done to ascertain which, if any, alternatives are better or worse than other alternatives. The most common statistical test is a contrast between alternatives. Details on how to calculate contrasts are given in Rosenthal and Rubin [17]. Working examples are given in Winer, Brown and Michels [18]. Computer implementations are given in Tabachnick and Fidell [19].
For instance, Figure 1 shows two images that could be used to test whether the visual appeal of a block would be diminished by allowing a new infill building that would be taller than the existing buildings. The existing buildings would be the control group, and the new infill house would be the urban design alternative. When the experiment was done [20], the mean rating on the criterion of visual appeal was M = 5.51 for the existing block and M = 4.94 for the blocks with one taller infill building. Since 4.94 is less than 5.51, the presence of a larger infill building decreased the visual appeal of the block. Numerically, the difference in means was 5.51-4.94 = −0.57. Statistically, this difference achieved something called a "p-level" (For readers with statistical backgrounds, the "p-level" is actually the probability of an alpha error or the probability of reporting a false positive. The term "p-level" is used here, because behavioral scientists are more likely to be familiar with it than with the concept of alpha errors). The "p-level" indicates the probability that a finding could be due to chance. The conventional interpretation of a "p-level" is to accept a claim if p < 0.05. Finally, in order to make the findings compatible with findings from other experiments, the difference between means is converted into another measure: a correlation. The symbol for a correlation is r. No content is lost in this conversion. It is like measuring temperature in Celsius rather than Fahrenheit. Either measure will do equally well. The advantages of using r are explained in Rosenthal and Rosnow [21]. For the contrast between the existing block and the alternative, the contrast, reported as a correlation, was r = −0.27.

Validation of the Proposed Protocol
Of course, the utility of this simple protocol for urban designers depends on the validity of the protocol. There are numerous possible difficulties. One possible objection is that environments cannot be represented by static color images [22]. Because the issue of simulation validity is so important, a considerable amount of research has been devoted to the Beaux Arts hypothesis. The basic paradigm for testing simulation validity is to obtain the same responses in the field, where people can look, turn or move around at will and then obtain responses using simulated environments. Over the past 30 years, this paradigm has been applied to many media. The media includes dynamic virtual reality models.
Responses include ratings of pleasure, naturalness, familiarity, order, inertia, arousal, threat, disliking, liking of the environment, being nice area to walk through, a good area to live in, appreciation of the area, visual appeal, evaluation, ambience, arousal, privacy, security, pleasant, interest, comfortable, excited, playful, water, built, water flow, sun, sound, strolling, resting, talking, observing, preference and spaciousness. Figure 1. These two images show a difference due to the presence or absence of one environmental feature: an infill house that would be higher than the existing houses.
The current data cover 967 environments and 6323 participants. Overall, responses obtained from static color images agree very strongly (r = 0.86) with responses obtained in the field [23]. From an evidence-based point of view, the topic of simulation validity has moved beyond the general question of whether simulations are valid at all for providing guidance regarding which refinements of simulations are most effective in different contexts. For example, measuring visual impacts along streets can be done with two images, while visual impacts in areas, such as open landscapes, will require four images [24].
Other possible difficulties might be the use of semantic differential scaling, how to represent affective responses, possible demographic effects, whether results from single experiments can, in fact, be replicated and whether the proposed protocol has worked in real-world applications. Again, there are considerable data on these issues. On the issue of semantic differential scaling, a review of studies from many researchers, covering 1,150 stimuli, indicated that many common methods of scaling (ratings, rank orders, Q sorts, physically placing stimuli on a table, raw score, comparative judgment, true score and signal detection theory) generated findings that were virtually identical (r = 0.99) [14] (pp. 98-101). For a model on how to represent affective responses, a review of work by many researchers generated data covering 7,168 participants, 1,768 environments and 23 possible measures. The overall finding was that affect can be measured in terms of three basic dimensions: pleasure, arousal and dominance. For the current article, the only relevant dimension is pleasure, which can be expressed as dislike/like, not appealing/appealing, ugly/beautiful or any synonyms thereof. For demographic effects, the current data, also obtained from the work of many researchers, covers 19,000 participants from 23 countries, 3,821 environments and 12 demographic groups. There was a very high consensus on the criterion of visual appeal for all demographic groups, except for adults and children ages 12 or younger, general population and designers for avant-garde projects and special interest groups, such as industry lobbyists or neighborhood groups. Furthermore, data are available regarding the reproducibility of the simple protocols recommended in this article. The result is that these protocols have been highly reproducible (r = 0.90). References to the data that support these validation claims are given in Stamps [25]. Real-world case histories of how the proposed protocols have been used include an avant-garde building, a determinant building envelope, an architectural competition, a study in scale and character, citizen participation, restoration after an urban forest fire and the design of a unique house [14] (pp. 185-271). Overall, judgments based on pre-construction experiments were highly predictive of judgments obtained from as-built projects (r = 0.86) [14] (p. 282).

Part 2 of a Cochrane Method
Another possible problem with using science to guide policy is that each single experiment will necessarily be limited in its scope and, inevitably, have flaws. Scientists know full well that no study is either complete or perfect. However, if findings from multiple experiments can be combined, then the applicable scope of the findings will be increased, and those pesky flaws will tend to cancel each other out (this is reader-friendly language for the theory of least squares and Fisher's insight that randomization is the physical implementation of that theory). Cochrane solved the problem of combining findings from multiple sources with the technique of meta-analysis. Meta-analysis is the second part of the Cochrane solution.

Meta-Analysis
Meta-analysis is used very widely. A search on the key words "meta-analysis" in the Science Citation Index during December, 2012, generated over 54,000 references. A general introduction to meta-analysis is given in Borenstein [26]. The use of meta-analyses to test theories is described in Cook et al. [27]. Mathematical implementations are given in Hartung, Knapp and Sinha [28]. The mathematics used in this article are given in Hedges and Olkin [29], because they provide a very simple model that requires only two inputs from each study: a measure of effectiveness, such as a correlation (r), and the n over which that correlation was calculated.
The connection between experimental design, meta-analysis and quality control in medicine is this: if there were a database of meta-analyses of various medical treatments for various diseases, then that database could assist medical decision-makers in selecting the best treatment for a given disease. For example, doctors now have a comprehensive database available on which drugs are most efficacious for various diseases. They can rely on the database, because all inputs must pass through the rigorous testing of randomly controlled trials. This database allows doctors to make judgments based on far more experience than any one doctor could acquire individually.
Such a database has, in fact, been implemented. The implementation has been done by the Cochrane Collaboration. Thus, the overall approach of the Cochrane Collaboration is as follows: "How do you know if one treatment will work better than another, or if it will do more harm than good? Cochrane Reviews are systematic reviews of primary research in human health care and health policy, and are internationally recognized as the highest standard in evidence-based health care. They investigate the effects of interventions for prevention, treatment and rehabilitation. They also assess the accuracy of a diagnostic test for a given condition in a specific patient group and setting. They are published online in The Cochrane Library" [30].
In other words, the Cochrane approach puts the evidence into "evidence-based decision making" [31]. To date, this approach has been applied to over 5,000 medical interventions [32]. Details on the Cochrane method are given by the Cochrane Collaboration [33]. The connection between the Cochrane Collaboration and quality control in urban planning is this: if there were a database on how effective various urban design guidelines are, then planners could use that information as a quality control technique to assist in selecting the best guidelines for implementation in their own jurisdictions. Table 1 lists meta-analytic findings that would be obtained by reviewing scientific data on two hypothetical alternatives: Plan A and Plan B. In this simple example, there are data from three experiments for each of Plan A and B. Altogether there are 145 environments (the number of stimuli or "nstim") for each plan. The efficacies, or estimates of the measure of performance, for the individual experiments in Plan A were r = 0.85, r = 0.70, and r = 0.77. Without meta-analysis, the best we could conclude is that three experiments found strong support for Plan A. With meta-analysis, we can go further and conclude that, when combined, the collective estimate of the efficacy of Plan A is r = 0.76, and the probability of this results being due to random chance is very much less than 0.05. The policy recommendation would be that control would have a positive effect. The same conclusion can be reached by inspecting the 0.05 confidence interval (0.05 CI). If this interval includes 0.0, then p > 0.05. If this interval does not include 0.0, then p < 0.05. The reason for reporting both the p-level and the 0.05 CI is that most behavioral research uses the "p < 0.05" criterion, but the 0.05 CI is much more informative. The same finding can have a p-level <0.05, but with very narrow (well-defined) precision or with very wide (poorly-defined) precision. Reporting both measures enables readers using either criterion to interpret the findings.

Converting Meta-Analytic Findings into Implementation Recommendations
The support for Plan B is different. Here, the overall efficacy is r = 0.13 and the p-level is greater than 0.05. For Plan B, the simplest scientific recommendation would be that, given the current scientific data, plan B has no discernible benefit, so the implementation would not be recommended.
Many variations on this simple meta-analysis are possible. The efficacies of two or more alternatives (e.g., Plan A vs. Plan B) can be compared to find out which is more efficacious. In this example, Plan A at r = 0.76 just plain works better than Plan B at r = 0.13. Experiments need not have a single factor: combinations and interrelations of multiple factors can also be analyzed; or, if there are different venues in the experiments (such as Experiments 1 and 4 showing residential streets, Experiments 2 and 5 showing shopping malls and Experiments 3 and 6 showing urban parks), then the findings reported above would indicate that, whatever it is, the efficacy of Plan A obtains over a variety of venues. Details on how to do these more complicated variations are given in Hedges and Olkin [29].

Applications
The preceding sections of this article have attempted to suggest that, in theory, the Cochrane method can be applied to issues of contextual fit. This section continues the presentation by reporting three experiments that investigated the effects on the visual appeal of the three factors chosen for inquiry in this article: trees, cars or diversity.

Little Boxes
The venue for this study was Daly City, the homogeneity of which was made famous by the folk song "Little Boxes" [34]. The scenes were blocks of eight houses. In order to eliminate possible selection bias, eight different houses were selected randomly from a book on the architecture of Daly City [35]. Each house was then modeled in a CAD program. The CAD models were used, because the afore-referenced data on simulation efficacy indicated that these models can be used to represent real environments. Colors were taken from Watch and Hope [36] (pp. [2][3][4][5] and assigned at random to the houses in each scene. Figure 2 shows the site plan.

Experimental Design and Stimuli
The experimental design was a Greco-Latin square [37] (p. 146, Plan 4.2). There was a total of 16 scenes. There were four factors: trees, cars, diversity of building shape and diversity of building color. Ranges were 0, 2, 4 and 8 for the numbers of trees and cars. The blocks in this experiment were created with 0, 1, 2 or 3 bits of shape and color entropy. Entropy has been shown to be a very strong predictor of perceived diversity [38][39][40]. For total entropy (both shape and color), the range was 0 to 6 bits. Figure 3 shows examples of the design features and the minimum and maximum option for each feature.

Data Acquisition
The participants were 29 undergraduate engineering students, with a mean age of 20.1 years (SD = 4.2). Twenty-six were male and three were female. Political affiliations were 11 liberal, 14 moderate, two conservative and two who did not state political affiliation. Scenes were shown in a Power Point presentation during a class. Instructions were "Please look at some street scenes and rate each on a scale of not appealing (1) to appealing (8). The first two scenes show the range of variation. Please do not rate the first two scenes". The two warm-up scenes were shown, followed by the 16 scenes in the experimental design.

The Venue
There were three constraints for selecting the venue in this experiment. The first constraint was that the style of the existing houses had to be homogeneous. The reason for this was that architectural style has such a large effect on visual appeal that it has to be controlled to investigate the effects of other design features. Second, a real venue was desirable to compensate for any effects due to simulations. Third, that real venue had to have blocks of houses with and without cars. The Sunset neighborhood of San Francisco met these constraints. Most of the neighborhood was built by a single developer at one time [41], resulting in perhaps the stylistically most homogeneous neighborhood in San Francisco. In fact, it is much more uniform than the Little Boxes of Daly City, even though both venues were built by the same developer. The neighborhood also had street-cleaning days, during which, in theory, there would be no cars on the street. Figure 5 shows the site plan.

Experimental Design and Stimuli
The original experimental design called for eight blocks, photographed during regular days and then again during street-cleaning days. However, after three months of attempts, it was impossible to find the required number of blocks in which there were eight contiguous sufficiently law-abiding households to satisfy the requirements of that experimental design. Accordingly, a random procedure was used to select the blocks. The random sampling began with a list of assessor's block numbers. A random permutation of those block numbers generated the (random) order in which blocks would be visited. The blocks were then visited in that random order and photographed. The sample was complete when there were eight blocks with and eight blocks without cars. Thus, this experimental design controlled for block effects through randomization. The result was a sample of 16 scenes, eight of which had street parking and eight of which did not. Examples are shown in Figure 6.

Data Acquisition
In the Little Boxes experiment, cars accounted for 8.2% of the variance, so that was the target for this experiment. Power analysis [42] indicated that 14 participants would be needed. Twenty-four participants were recruited by a professional survey research firm from the adult population of a major city in the United States. The participant sample was balanced for gender and political affiliation. The mean age was 49 years (SD = 17). Occupations ranged from student to attorney.

Results
Overall, the presence of cars had a very small effect on the visual appeal of the street (0.2% of variance, F(1,345) = 1.40, α = 0.17). The contrast of cars vs. no cars (shown in Figure 7) repeated the overall findings (Mcar = 4.30, Mno_car = 4.14, F(1,345) = 1.40, α = 0.17), to the effect that cars had a very small effect on the visual appeal of these streets. However, the results also hinted at the more detailed information that the visual appeal of streets was greater if there were cars. Given the ambiguity of this finding and previous findings for cars, it seemed appropriate to attempt replication in yet another venue.

Figure 7.
Results for car vs. no car. Solid lines indicate contrasts that achieved the individual "p < 0.05" level of significance. Dotted lines indicate contrasts that did not achieve the individual "p < 0.05" level of significance.

"New" Amsterdam
The last experiment in this article revisited the factors of trees, car and diversity. The reason for this selection of factors was that the previous work produced ambiguous or discrepant findings, suggesting that additional work in a different venue would be useful in deciding how well the previous data generalized. For trees, the Little Boxes finding was that trees increased visual appeal. For cars, the finding from the Little Boxes was that cars increased the visual appeal of a street, while the finding of cars vs. no cars was that adding cars had an undetectable effect on the visual appeal of a street. For diversity, the finding was that the entropy of color had more influence on appeal than the entropy of shape, but that finding was based on a venue chosen to be highly homogeneous, so that result might or might not hold up in more complex venues.

The Venue
In order to enhance the data on these three factors, another experiment was done in a more complex venue that was altered to express different amounts of trees, cars and diversity. A suitable venue was Amsterdam. Accordingly, this experiment used the same design features as were used in the Little Boxes study, but the options were changed. The venue was changed from suburban America to a dense European city (Amsterdam). Houses were selected from a visit to Amsterdam, and blocks were created in a CAD program to find out how strongly trees, cars and diversity in façade color would change the visual appeal of "New" Amsterdam. Figure 8 shows the site plan.

Experimental Design and Stimuli
Options for trees were 0, 2 or 4. There were either no cars or eight cars. Diversity was expressed as differences in brick color, with a range of 0, 2 or 3 bits. The CAD protocols were the same as were used in the Little Boxes study. The experimental design was a factorial of trees (3) by cars (2) by brick entropy (3), for a total of 18 scenes. Examples of the design features and options are shown in Figure 9.

Data Acquisition
Power analysis indicated that 19 participants would be required. Twenty-four participants were recruited by a professional survey research firm. There were equal numbers of men and women and also equal numbers of political liberals, moderates and conservatives. The mean and standard deviation of age were M = 45.1 and SD = 13.5 years. Occupations ranged from social worker to business owner.
In terms of correlations with visual appeal of the street, the findings were r = 0.92 for trees, r = −0.11 for cars and r = 0.075 for the entropy of bricks, all on n = 18 stimuli. Figure 10. Results for "New" Amsterdam. Solid lines indicate contrasts that achieved the individual "p < 0.05" level of significance. Dotted lines indicate contrasts that did not achieve the individual "p < 0.05" level of significance.

Synthesis
As mentioned above, the data described in this article are not the only scientific work done on principles of contextual design. Table 1 in the Appendix describes 55 such findings, obtained from 638 scenes and 1,130 participants, on how well seven contextual fit policies would, if implemented, work. The caption in Table 1 also describes how to interpret the symbols "Experiments" "Σnstim", " r ", and "0.05 CI". The data listed in Table 1 are useful for detailed applications and research; however, the sheer quantity of information can be over-whelming, and valid generalizations can be obscured by all of that detail. Fortunately, this is precisely the problem that can be solved with meta-analysis. Application of meta-analysis to the data in Table 1 produces the list of the efficacies of seven contextual design principles listed in Table 2.
The relevance of Table 2 to planning policy is as follows. When evaluating how features of new projects will change the visual appeal of the physical context of those projects, the most important feature so far is architectural style. Based on 191 scenes, the overall effect is r = 0.64. Based on the detail provided in Table A1, it seems that whether a new project does or does not match the styles of its context makes a difference. The recommendation for planning policy is, consequently, that it will be efficacious to include style as a design feature worth regulating. For diversity, the current database consists of 136 scenes and a finding of r = 0.51 for the effect of diversity on visual appeal of a streetscape. Based on the detail listed in Table 1, it seems that streetscapes with more visual diversity are more appealing than streetscapes with less visual diversity. Accordingly, visual diversity is a design feature that is worth the effort needed for regulation.
The third feature is the height of new projects. This time, the current database consists of 136 scenes, and again, the relationship between infill height and the visual appeal of a streetscape is solid. Based on the detail in Table A1, it seems that when the height of an infill project is more than twice the height of the buildings in the project's visual context, the appeal of the streetscape decreases. However, the size of the effect of infill height on the visual appeal of the streetscape (r = −0.38) is less than the effects of style or diversity. This indicates that, when selecting design features for regulation, a cost/benefit analysis would be helpful.
The next three design factors (distance, third story setback and cars) did not have detectable effects (p > 0.05). That means that there is not yet sufficient data to provide scientific justification for regulation of these factors. Thus, the current scientific recommendation for the practice of these three factors is to withhold implementation.

Discussion
This article attempted to bring together practice and research by applying the Cochrane method to issues of contextual fit. The Cochrane method, like all methods of empirical science, has its limits. However, where there are limits, there are also opportunities for further dialog. For example, in this article, choices had to be made regarding scope, technical details and whether the scientific findings should be advisory or mandatory. Of course, more than one perspective is possible. This section describes some of those possibilities and suggests keywords that may reward additional discourse.

Scope
As was noted above, this article focuses on the evaluation of the efficacies of specific contextual design principles. This focus allowed the application of the most basic type of experimental evidence to a planning decision: ascertaining how well one end could be achieved by alternate means. Actual planning decisions, of course, will typically involve more than one end. For example, other possible ends for planning might include transport, energy, environment, accessibility, social justice, participation, participatory planning, sustainability, urban design or affordable housing. The scientific literature contains many meta-analyses that may be useful for planners who need to incorporate these ends into their plans. For instance, a search in the Science Citation Index during October, 2014, on the keyword "meta-analysis" and each of the planning ends listed above located twelve possibly relevant meta-analyses [43][44][45][46][47][48][49][50][51][52][53][54]. Whether an existing meta-analysis, or, for that matter, any other type of scientific study, meets the criteria for generating valid scientific policy guidance will have to be determined on a case-by-case basis. Guidelines for making such determinations are given in [55]. Likewise, planning decisions typically also require making trade-offs among different ends. Techniques for making those trade-offs are described under the concept "quality control" [56][57][58][59][60][61][62][63][64][65].

Technical Details
Besides being based on empirical data, the work in this article is based on the mathematics of contemporary statistical analysis. In particular, the Cochrane method is based on the mathematics of meta-analysis. The presentation of meta-analysis provided in this article is very much a bare-bones model. The bare-bones model is adequate for addressing simple, focused questions, but more elaborate questions will require more elaborate types of meta-analyses. Guidance for performing more elaborate meta-analyses can be found in [16,[26][27][28][29][66][67][68][69][70][71]. For example, there may be concern about whether an individual finding generalizes over different groups of subjects or different physical environments. In meta-analysis, this type of question is handled by considering the heterogeneity of a meta-analytic result. The details on how to analyze heterogeneity are beyond the scope of this article, but the information needed to analyze heterogeneity can be found in, among other sources [29] (pp. 108-191).

Advisory or Mandatory
The synthesis section of this article proposes that contextual fit guidelines should be implemented only if those guidelines are scientifically valid. This proposal raises another issue that may be of interest to planners: whether scientific findings should be used as advice or as mandatory standards. In medicine, the Cochrane library is used as an advisory service for doctors by providing the best available data on how well various medical interventions work. This information enables doctors to access much, much more information than is available from one doctor's experience. The Cochrane library can also be used in a regulatory capacity, such as approving or disapproving drugs. A good source for information on this point is the Food and Drug Association of the United States' presentation on how drugs are approved.

Conclusions
Much more specific guidance can be generated from the available data, but space limitations preclude inclusion of that material in this article. Accordingly, this article concludes by only suggesting that: (1) there is a need for urban design to have a better system of validation and a method for testing urban design theories and generating alternatives; (2) the same need has been addressed and solved in the discipline of medicine using randomized experimental designs for individual studies and meta-analysis for discovering results from multiple studies; (3) the methods of randomized experimental design and meta-analysis can be applied to issues in urban design, such as contextual design; and (4) the application of randomized experimental designs and meta-analysis has produced very useful information regarding seven specific contextual design principles.   Urban streets Trees 0-4 24 * 18 * 0.92 "New" Amsterdam * The same participants were already counted in previous experiments; t "nsubj" means number of subjects. "nstim" means number of stimuli.