Google Medical Update: Why Is the Search Engine Decreasing Visibility of Health and Medical Information Websites?

The Google search engine answers many health and medical information queries every day. People have become used to searching for this type of information. This paper presents a study which examined the visibility of health and medical information websites. The purpose of this study was to find out why Google is decreasing the visibility of such websites and how to measure this decrease. Since August 2018, Google has been more rigorously rating these websites, since they can potentially impact people’s health. The method of the study was to collect data about the visibility of health and medical information websites in sequential time snapshots. Visibility consists of combined data of unique keywords, positions, and URL results. The sample under study was made up of 21 websites selected from 10 European countries. The findings reveal that in sequential time snapshots, search visibility decreased. The decrease was not dependent on the country or the language. The main reason why Google is decreasing the visibility of such websites is that they do not meet high ranking criteria.


Introduction
Some types of websites could potentially impact people's future happiness, health, financial stability, or safety. Google calls such pages "your money or your life" (YMYL) pages [1]. Google recognizes five types of YMYL pages. First are shopping or financial transaction pages. These webpages allow people to make purchases, transfer money, pay bills, and so on, online (such as online stores and banking) [1]. Second are financial information pages. These webpages provide advice or information about investments, taxes, retirement planning, home purchases, paying for college, buying insurance, and so on [1]. Third are medical information pages. These webpages provide advice or information about health, drugs, specific diseases or conditions, mental health, nutrition, and so on [1]. Fourth are legal information pages. These webpages provide legal advice or information on topics such as divorce, child custody, creating a will, becoming a citizen, and so on [1]. Fifth are news articles or public/official information pages that are important in order to have an informed citizenry. These webpages include information about local/state/national government processes, policies, people, and laws, disaster response services, and government programs and social services; as well as news about important topics such as international events, business, politics, science, and technology, and so on [1]. Of course, not all news articles are necessarily considered YMYL.
In past studies, authors observed three main areas in analyzing search data on health and medical information: medical and health information websites, search engine result pages with medical and health information results, and data collected from Google Trends and/or Google Flu Trends and/or Google Cloud Healthcare application programming interface (API) (formerly Google Health). queries connected to diseases [44]. GT is a source of reverse-engineered data. It shows what was searched in Google, and the data are normalized in terms of search frequency and presented in relative search volumes. Data are segmented into years and months, and into geographical regions. Researchers can compare a maximum of five keywords using segments in one try. Studies on GT can be divided into four areas-infectious diseases, mental health, other diseases, and general population behavior [45]-and are mainly conducted to examine seasonality [46].
Basic search engine visibility is combined data of unique keywords, positions, and URL results. According to the concept of search engine visibility described in [87], the visibility of websites in search engines comes from algorithms that rank and order them according to calculated ranking positions. The original concept [88] of ranking for the Google search engine is named PageRank, after one of Google's founders. PageRank was invented and published in 1998 [89]. This concept takes into account incoming links, and based on volume and quality, ranking positions for websites and corresponding keywords are estimated [90]. Currently, web search engines use different ranking factors for websites to determine their position on a results page.
Today this topic is attracting more attention [91] and can be divided into onsite and offsite factors [92]. Onsite factors are domain-, website-, and page-related [93]. Search engines take into account different elements found in the source code of a webpage such as title, headings, descriptions, time of last update, mobile design, and structured data for rich snippets [94]. Offsite factors are link-related [95], user action-related [96], special rules-related [97], brand-related [98], and spam-related [99].
The motivation behind the present study is to analyze what types of websites experience decreasing visibility on search engine results pages due to low-quality medical and health information content. There is no doubt that much research has been done on health and medical information websites and Google as a source of medical knowledge. However, there is little knowledge of the ways medical and health information websites are lowered or removed from search engine results pages due to low-quality content.
Thus, the current gap represents a lack of research on the decreasing visibility of medical and health information websites. There have been several studies on medical and health information presented on Google, conducted mainly by researching GT. Recognizing which websites are not considered to be proper sources of medical and health information by Google and why is a gap the author is trying to fill.
The objective of this study was to analyze data from an external service that monitors Google's search engine results pages and collects data on websites' visibility. By measuring increased or decreased visibility of health and medical information websites, it is possible to recognize the websites that are considered to have low-quality content. Based on the above discussion, the following research questions related to decreasing visibility of health and medical information websites in Google are proposed: 1. Why is the Google search engine decreasing the visibility of health and medical information websites?
2. How can we measure the decreased visibility of health and medical information websites?
The paper is organized as follows. Section 2 includes the method, search engine visibility concept, and material for data retrieval and processing. Section 3 contains the results, while Section 4 presents the discussion. In Section 5, the author highlights the contribution of the research, discusses its limitations, and, finally, draws conclusions about the results and proposes possible future research avenues.

Materials and Methods
This study used search engine visibility data on websites with health and medical information. The author selected European countries based on criteria using this list: https://en.wikipedia.org/w iki/List_of_European_countries_by_population, including countries located in Europe, not in Asia (excluding Russia, Turkey, and Kazakhstan), and countries where Google operates. For the first to ninth positions on the list there was no doubt about population. For the tenth position, five countries had >10 million population, which was sufficient for choosing one of them. Greece was chosen, since this part of Europe was not yet represented in this study and French-and Dutch-speaking countries were already on the list, which is why Belgium was not chosen. Countries with higher populations have more Internet users, thus it is more likely that there are many health and medical websites. Ten countries were selected to check whether visibility in a search engine depends on the country or/and language or is not influenced by either.
The author analyzed the top 20 results on a sequence of keywords: "google medical update site:.cc," where "site" is a search operator that narrows the results, in this case to country domain name, and ".cc" means country-coded domain name. The author selected 10 countries with the highest populations in Europe: Germany, France, United Kingdom, Italy, Spain, Ukraine, Poland, Romania, the Netherlands, and Greece.
The term "Google medical update" refers to changes in Google's algorithm starting on 1 August 2018 [100]. Results for these sequences of keywords allowed for collection of websites potentially affected by decreased visibility by the Google search engine. The new algorithm rewards websites with well-researched, accurate health and medical content and decreases the visibility of those whose content is lacking in terms of credibility [100].
The top 20 results returned by Google for the query "google medical update site:cc" were examined and the author collected the list of websites as a convenient sample that could be the subject of further study. The steps for searching and examining results were repeated 10 times for each country, using the country-coded domain suffix. Table 1 shows a comprehensive list of found and selected websites for further examination.
In Table 1, Code refers to the country code used in the search query. Almost all of the websites collected use country-coded domain names; however, some websites use generic domain names such as ".com" or ".org" or others such as ".to," which belongs to Tonga but in Polish means "it." Most websites use the official language; for example, Ukrainian websites are in Russian. Index size is the number of results displayed by the Google search engine for the search operator "site:website." Although Google displays a maximum of 1000 results, this number is shown below the query and above the first results. It is a size indicator of the website and estimated number of pages that belong to one website. Index size was retrieved on 19 December 2019. Visibility in search engines is measured as the number of keywords, positions, and visible pages and can be used to compare competing organizations in one common area or industry [101,102]. The comparison will disclose the search market share of each compared organization [103]. Based on this comparison, further analysis of Internet strategies in marketing, sales, promotion, and publishing can be done. Visibility in search engines is always subject to algorithms that sort and set rankings of results based on type of content, metadata, and models of content creation [104][105][106].
In this study, the data did not originate from Google, but from external services. Data regarding visibility was retrieved through the commercial online tool Ahrefs [107]. This tool is specialized in retrieving and saving data about website visibility in search engines. Ahrefs, except to preserve basic visibility, imports additional data and develops its own visibility metrics. These data were used to compare search engine visibility of websites on health and medical information before and after they were affected by Google's new update.
The method of collecting data from a search engine is called scraping. Usually search engines, in their terms of service, do not allow data scraping. However, it is impossible for search engines to differentiate scraping, when done very gently, from normal user search behavior. Users use search engines dozens of times a day, and only if the search engine recognizes different traffic from the user's network can it ask for a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) to solve, to prove that entered queries are not automated. Google does not share any download or export methods for results or provide an API for exporting search results. The only way to obtain data is to scrape them directly from the results. Scraping Google is against the terms of service. Online tools such as Ahrefs allow subscribed users to use scraped data. These tools use scraping on a large scale to obtain data from Google. The next section shows results from the visibility of the websites examined.

Results
The first step of the study was to collect data on the visibility of health and medical information websites. The list contains 21 websites that are popular in the 10 most populated European countries. Using Ahrefs, four data snapshots were retrieved. Each snapshot has a 5-or 6-month time interval. The collected data were run through a search visibility metric developed by Ahrefs, built on the number of keywords and positions and estimated click-through ratio. Data snapshots were taken with the following timestamps: The visibility metric estimates the total monthly search traffic to the target website from the organic search results. It is calculated as the sum of traffic from all keywords for which the target website ranks in the search engine results page. The data retrieved are presented in Table 2.
The data retrieved have very large scope, which strongly depends on the index size in the Google search engine. Websites with more indexed pages have more chances to be visible in search results, because more particular webpages can be displayed. Visibility strongly depends on keywords, resulting in webpages being shown in search results. The more pages indexed from one website, the more keywords will result in a search engine results page. In the second step of the study, data were normalized in a similar way to data presented in Google Trends. In GT, the most frequent keyword is set to a score of 100 and is used as an indicator. Other keywords are relative to this indicator and have scores between 1 and 100. In this dataset, all results for snapshot S1 were normalized to 100 and were used as the starting visibility indicator. Then, results from the next three snapshots were relative to the starting indicator. Results from four snapshots are presented in Figure 1. Figure 1 is a boxplot illustrating that visibility decreased in the following time snapshots. Snapshot S1 was taken two days before Google announced changes in its algorithm for health and medical information websites. The value for each website was normalized to 100, which is why all descriptive statistics in Table 3 for snapshot S1 equal 100. Data from the second snapshot reveal that changes in Google's algorithm were observed. In snapshot S2, five websites had increased visibility, one had the same, and 15 websites had decreased visibility in Google search engine results. Descriptive statistics for snapshot S2 show a median of 70 and mean of 77.57, whereas previously both were 100.
In snapshot S3, decreased visibility is still observable. Only one website had higher visibility compared with the starting date. Other websites measured had further decreased visibility in Google search engine results. Descriptive statistics for snapshot S3 show a median of 52 and mean of 51.43. In snapshot S5, visibility stayed on the same level as in the previous timestamp. Parts of websites have better visibility than in S3, but the dataset still had lower visibility compared with the starting date. compared with the starting date. Other websites measured had further decreased visibility in Google search engine results. Descriptive statistics for snapshot S3 show a median of 52 and mean of 51.43. In snapshot S5, visibility stayed on the same level as in the previous timestamp. Parts of websites have better visibility than in S3, but the dataset still had lower visibility compared with the starting date.  1 Snapshot S1 is normalized to 100. Table 3 presents descriptive boxplot statistics for all snapshots. It shows that visibility in the observed periods changed, and in this dataset, visibility decreased in snapshots S2 and S3. The last snapshot, S4, is very similar to the previous one.
It was stressed in Section 2 that visibility over a long period of time depends on many factors. Search engines take into account different factors found inside and outside websites and treat them as ranking signals. All of these factors over a long period of time influence websites' visibility. In a shorter period of time, large changes in visibility are the effects of changes in Google's ranking algorithm. This proves that the studied websites had decreased visibility after Google rolled out its medical update.  1 Snapshot S1 is normalized to 100. Table 3 presents descriptive boxplot statistics for all snapshots. It shows that visibility in the observed periods changed, and in this dataset, visibility decreased in snapshots S2 and S3. The last snapshot, S4, is very similar to the previous one.
It was stressed in Section 2 that visibility over a long period of time depends on many factors. Search engines take into account different factors found inside and outside websites and treat them as ranking signals. All of these factors over a long period of time influence websites' visibility. In a shorter period of time, large changes in visibility are the effects of changes in Google's ranking algorithm. This proves that the studied websites had decreased visibility after Google rolled out its medical update.

Discussion
The main finding of this study is that websites that did not meet high ranking criteria in terms of health and medical information were lowered in ranking since 1 September 2018. According to Google's general guidelines, the search engine considers three areas of a website when rating the quality of a page [1]. The first is webpage content, by identifying main content, supplementary content, and advertisements. The second is understanding the website by finding the homepage, who is responsible for the website, and who created the page content, and finding sections on the page such as "about us," contact information, or customer service information. The third is evaluating the reputation of the website or the creator of the main content by identifying sources of information on reputation and customer reviews of businesses. Page quality rating is based on how well the page achieves its purpose.
According to the results of this paper, the websites studied have lower visibility in the Google search engine. Since the exact criteria used by Google are not generally known (e.g., the current ranking algorithm is considered confidential), it is assumed that the main reason for the lower visibility is low-quality content. Low-quality websites may have been intended to serve a beneficial purpose. However, they do not achieve their purpose well because they lack an important dimension, such as having an unsatisfactory amount of main content, or because the creator of the main content lacks expertise for the purpose of the website.
The observed change in Google's algorithm is about health and medical information websites. Until this change, this topic was unregulated. As the answer to research question 1, the author found that anyone can create health and medical content and publish it online. It does not need to be checked and corrected by a medical professional. Many people search for health and medical information using Google, and researchers use data on this from GT. However, content created without any professional supervision can be misleading and ultimately dangerous. Inaccurate information can cause unforeseen consequences such not visiting a doctor or having a false sense of security. That is why the most accurate, respected, and thoroughly researched health and medical content is displayed at the top of the search engine results.
To measure decreased visibility, it is necessary to have a sample of websites. All websites need to be measured at the same time with the same visibility metric. In this study, the author used Ahrefs data as the source of visibility. When sequential data snapshots reveal that calculated metrics are decreasing in each timestamp, it proves that visibility is decreasing. This was observed for 21 websites collected as a convenient sample for this study. The results show also that visibility does not depend on the country or language of the website, thus answering research question 2 as well.
To the best of the author's knowledge, this is the first study to use data on search visibility on Google to assess the fluctuation of health and medical information websites in search engine results. Moreover, this is one of the first studies to compare Google visibility data between multiple countries and languages.

Conclusions
Google has very high page quality rating standards for YMYL pages, because low-quality YMYL pages could potentially have a negative impact on users' happiness, health, financial stability, or safety [1]. In terms of medical and health information websites, medical advice should be written or produced by people or organizations with appropriate medical expertise or accreditation. Medical advice or information should be written or produced in a professional style and should be edited, reviewed, and updated on a regular basis.
It is possible to have everyday expertise in YMYL topics. For example, there are forums and support pages for people with specific diseases. Sharing personal experience is a form of everyday expertise. If forum participants tell others how long their loved ones lived with liver cancer, this is an example of sharing personal experience (in which they are experts), not medical advice. Specific medical information and advice (rather than descriptions of life experiences) should come from doctors or other health professionals. Formal expertise is important for topics such as health and medicine.
The strength of this work is in pointing out that the dominant search engine has started to rate health and medical information websites more rigorously than before. This approach can be observed using the method proposed in this work. The weakness of this work is that low quality was only assumed in manually examining these websites. Most of them offer pseudo therapies or health tips not sustained by scientific evidence, and even cooperate to provide a platform for distributing fake health tips.
This study has several limitations. First is that the observation was conducted in only one area. It does not reflect other types of information-centred websites under higher page quality ratings, such as financial, legal, or government sites. Two health and information websites from each country were the subject of the study; however, this sample size cannot adequately represent the whole area. To make the conclusions more convincing, data from more websites will be collected in the future. Second, the observation was conducted only for 10 European countries. This observation does not reflect online health and information websites in other countries globally. Data reflecting more countries will be collected in order to further investigate the role of Google's medical update in online health and information websites. Third, although each health and information website was observed in terms of the same factors, there are still unobservable factors such as brand recognition across online webpages, which might influence their search visibility. Further studies will retrieve more data to address this issue.
One avenue of future research is to study how health and information websites are reacting to decreased visibility and measures they take to counteract this decrease. Another direction for future research is to study health and information websites for which visibility has increased and analyze which factors influenced the increase.