Big Data is a term that typically describes a large amount of data that can be harnessed in novel ways to provide insights of significant value (Mayer-Schönberger and Cukier 2013
). One source of ‘Big Data’ is Google Books
. Initiated by Google Inc
. in 2004, Google Books
contains the digitized full texts of over 25 million books (Heyman 2015
), and other publications, dating back centuries and can be accessed for free by anyone with online access. All the written works that are currently in the digital database have been scanned and converted to text using optical character recognition. Over a decade ago, Duguid
) attempted to assess the quality of the Google Books Library Project
) claiming that quality assurance online is mostly provided via innovation or inheritance (i.e., quality assurance techniques that pre-date the internet and which are assumed to carry across to the online world). Duguid suggests that quality assurance in relation to the GBLP
mainly came through inheritance (e.g., library reputation). Based on his search for one particular book (i.e., Lawrence Sterne’s Tristram Shandy
) and comparing it to other online repositories, Duguid concluded that the GBPL
was important and invaluable but that purely relying on the power of its search tools, Google
had ignored elemental metadata, such as volume numbers and was therefore (at times) inadequate.
Other scholars have noted errors occurring within Google Books
metadata, including inaccurate dates, misspellings, duplicate copies, and inaccurate subject classifications (Harper 2016
; Jacsó 2008
; Weiss 2016
), as well as the lack of recent content and skew towards documents in the English language (Abrizah and Thelwall 2014
; Harper 2016
) and a lack of books in some areas such as health sciences (Harper 2016
). However, Google Books
has also been credited as being a ‘huge laboratory’ for interpretation, indexing, and working with document image repositories (Jones 2010
) with a number of academic papers providing practical tips on how academics can get the most out Google Books
; Moore 2018
; Ortega 2014
; Whitmer 2015
More recently, Fagan
) carried out an evidence-based review of academic web search engines. His paper attempted to find articles, books, and conference presentations that had examined the use and/or utility of Google Scholar
, Google Books
, and Microsoft Academic
from the previous three years (2014–2016). It was concluded that Google Books
as a search tool needs dedicated research from information scientists and librarians concerning its coverage, utility, and/or adoption but that there were many conveniences including the type of book availability, date, document type, print-on-demand services for out of print books, and textbook rental (e.g., Harper 2016
; Mays 2015
; Ortega 2014
Despite the wealth of research on the utility and practicality of using Google Books as an academic tool, the present paper introduces a new research method that can be used within Google Books to verify, or disconfirm the most basic earliest known publication knowledge of a term, word, or name from within any academic discipline. The method is a very precise six-stage Boolean date-specific research method on Google, referred to as Internet Date Detection (IDD) for short.
The present authors argue that the IDD method is superior in finding earlier texts to simply searching on Google with one try, or using its Ngrams viewer. This is because IDD transcends using the search engine the way its developers apparently intended and involves altering Google’s default search parameters to get it to search on exact words, terms, and names multiple times in a very specific way. Google’s programmed default position is to stop users doing exactly that and to make them search wider and so is less precise, because, if the user lets it, Google removes the Boolean element from the user’s search after their first try. The way it does that, and how to override it is explained later in the ‘Method’ section of this paper. As will be demonstrated, the IDD method overrides Google’s programmed default position, on its default search engine page which is possibly there to save valuable server processing time in favor of non-academic users clicking revenue generating advertisements and leaving, rather than exploiting its method to the full for scholarly purposes. That said, it should be noted that the first four steps and step six of the six-stage search method in the present paper (outlined below) can, at the time of writing, be implemented directly on the Google Books Advanced Book Search webpage.
It is important here to make a distinction between the simple coinage myth-busting findings made with the use of the IDD method and current scientific knowledge from other published research on the Google Books
corpus. For example, within the field of socio-cultural and linguistic evolution, research into the usefulness of Ngrams, it has been concluded that we should “take a very cautious approach to any effort to extract scientifically meaningful results
” (Pechenick et al. 2015, p. 27
Making the Invisible More Visible
) explained how pervasive, mundane technology, such as electric lights and automobiles, becomes ‘invisible’ and so taken for granted. He did so to make the point that some technology might appear mundane and therefore comparatively simple by association, but this does not make its use any less valuable. Today, Internet search engines can easily be taken for granted, but they have significantly helped when it comes to finding data. Search engines are used on the Internet, in libraries, and newsprint archives to find otherwise out-of-sight information. Bourdieu
) noted that the function of sociology, as of every science, is to reveal that which is hidden. Internet search engines certainly do that, and it would be unwise not to mark their great sociological value in that regard as an example of technology and science combining to enable the discovery, analysis, and explanation of things of value to the social sciences that are otherwise undetectable. An amateur discovered the largest known burial of Anglo-Saxon gold, with a metal detector (Alexander 2011
). Analogously, Internet search engines enable non-experts to detect long buried literary treasures that experts would wish to find.
Discoveries of prior origination of terms and related concepts of all kinds are useful in different ways. To provide just one example, unknown routes for influential knowledge contamination of someone believed to be an independent originator of a prior published prose, or of a prior scientific breakthrough, results in misleading data about the full historic details of that creative or discovery process. In sum, erroneous and significantly incomplete historical data corrupts key knowledge informing history, policies, and practices for the future Big Data information enablement of creative and discovery processes.
2. Method: The Internet Date Detection Method
The Internet Date Detection (IDD) method is a six-stage Boolean research method developed by the first author that can be used within Google Books to independently verify facts and to disconfirm prior knowledge claims. Each step of the search technique is explained below. Six examples, disconfirming knowledge (including five from the Oxford English Dictionary), are provided to demonstrate the IDD method in action. Example 1 is described in detail. The remaining five examples utilize exactly the same IDD method but are only briefly summarized. The terms chosen come from different disciplines (i.e., psychology, sociology, biology, history, and English literature) to demonstrate the utility of IDD across different academic subjects.
The first example to be examined is the term ‘self-fulfilling prophecy’, a popular term used within many social science disciplines including sociology and psychology. At the most basic level of understanding, self-fulfilling prophecy refers to an individual, or group, who unknowingly cause a prediction to come true simply because, essentially, they expect it to come true, or their behavior, based upon a specific prior belief, in some way brings it about. Textbooks (e.g., Gold 2009
; Hoffer 2010
), and arguably less scholarly online sources such as Wikipedia
) simply claim that the sociologist Merton
) coined the basic term. Avoiding opening up a new debate as to whether Merton’s use ‘coined’ a specific sociological meaning, but to simply verify whether Merton was the originator of the term ‘self-fulfilling prophecy’, the six following very specific steps were taken:
On the Google toolbar of a computer connected to the Internet, the term “self-fulfilling prophecy” should be entered in double speech-quotation marks (most importantly, not single quotation marks) to create a Boolean search. This ensures Google searches within the literature for this exact term only.
Next, press the ‘Enter’ button on the keyboard to begin the search. Ignoring everything Google turns up for the search, click the ‘Book’ tab on the screen.
Ignoring all the books that came up, click the ‘Search Tools’ tab on the screen and select the ‘Any Time’ option. From this, select the ‘Custom Range’ option.
Within the ‘Custom Range’ option, enter the date ‘1500’ in the ‘From’ box. Next, in the ‘To’ box, enter the year prior to the expert knowledge claim (in this case, 1947, the year before Merton first used the term ‘self-fulfilling prophecy’). This enables Google to search millions of publications between 1 January 1500, and 31 December 1948.
Examine the results of this search. When carried out by the first author, the results demonstrate that Google detected numerous pre-1948 books containing the precise phrase, many books in the nineteenth century. This proves that Merton never coined the term.
To locate the earliest IDD discoverable date of use, again select the ‘Custom Range’ option to see if it was used between 1500 and 1800. When carried out by the first author, the Google search said: No results found for “self-fulfilling prophecy”. Instead, the search said: Results for self-fulfilling prophecy. The difference here is that the results found by Google are for publications detected outside the double speech quotation marks that were entered. This is because the exact complete term could not be detected in the literature for the date range provided. Consequently, only something other than the exact term was detected before the year 1800. However, publications containing the precise term “self-fulfilling prophecy” were required. To find this term on Google, the double speech quotation marks are essential. Searching on a later date—between 1800 and 1860—and re-entering the term in double speech quotation marks demonstrated that the simple term was used by Hoffman in 1841, long before Merton’s own first published usage, and arguably with the same essential meaning.
This simple use of the IDD method demonstrates that many recent textbooks (e.g., Tauber 1997
; Hedström and Bearman 2009
; Kaldis 2013
; Ugwudike 2015
) and peer-reviewed papers (e.g., Fritzberg 2001
; Collier 2011
; Van Lente 2012
; Samaha 2012
) are wrong about Merton’s originality in coining the term ‘self-fulfilling prophecy’. The date of any publication found using the IDD method should be verified independently, because it is not an infrequent occurrence for Google
to attribute the wrong date to some of its scanned publications. Since the Oxford English Dictionary (OED) is considered by many to be a particularly expert, valid, and reliable source of information, wherever it has published claims about the first use or coinage of the words, terms or names examined, it has been cited it in the present paper as a ‘usage case’.
The OED claims Grosse
) is the earliest known source of the name Humpty Dumpty being used for a character.
The character name Humpty Dumpty appears to be derived from the classical comedy character villain Punchinello (predecessor of Mr. Punch)—see Anonymous
(1701, p. 28
‘Beau Humpty-dumpty next appears/A merry Lump well grown in Years/With Back and Breast like Punchanello’
Moreover, disconfirming the apocryphal story of the siege of Colchester (e.g., Willock et al. 2014
), the present authors found no evidence whatsoever in the historic literature record of a Royalist force cannon in the English civil war named Humpty Dumpty. But IDD originally revealed that there was one used by the Parliamentary forces named Punchinello (see Pepys 1665, p. 1065
Claim: The OED claims Charles Darwin (1859) coined the term ‘living fossil’.
The term ‘living fossil’ appears in the literature at least 147 years earlier in the work of a Welsh Botanist Lhwyd
(1712, p. 506
). He used it in Philosophical Transactions
, the journal of the Royal Society of London (the first ever peer-reviewed scientific journal).
The OED claims the earliest discovered publication of the term ‘Moral Panic’ is in a periodical named the Galaxy
in 1877. There is also a pervasive and deeply entrenched myth among criminologists and sociologists that Stanley Cohen and Jock Young coined the term in the 1960s (see Horsley 2017
‘Megandie a French physician of note on his visit to Sunderland where the Cholera was by the last accounts still raging praises the English government for not surrounding the town with a cordon of troops which as “a physical preventive would have been ineffectual and would have produced a moral panic far more fatal than the disease now is.”’
Claim: Charles Dickens coined the word ‘boredom’ in his 1858 novel Bleak House.
The word was used at least 29 years earlier in a novel written by Catherine Grace Frances Gore called Romances of Real Life
(Gore 1829, p. 99
Claim: Richard Dawkins coined the term ‘selfish gene’ in his 1976 book The Selfish Gene.
Among the more notable examples presented in the present paper concerns the origins of the term ‘selfish gene’. Over 100 science websites, scholarly books, and peer-reviewed journal papers (e.g., Hull 1980
; Kourilsky 2012
; Nixon 2012
) all assert that the renowned Darwinist, Richard Dawkins, coined the term ‘selfish gene’ in his 1976 book of the same name. The IDD method demonstrated that William Hamilton used it in a prior published paper (Hamilton 1971
). Five years later, he cited other publications by Hamilton as a major influence on his thinking, but Dawkins used the term and concept without citing Hamilton’s prior-publication as the origin of the term. The science-myth that Dawkins coined the term ‘selfish gene’ is so pervasive that it could be argued that it has attained the status of a ‘fixed false belief.’ Brewers Dictionary of Phrase and Fable
(2012, p. 1209
) provides a typical example of why the present authors think this is the case:
‘Selfish gene: In genetics, a gene that exploits the organism in which it occurs as a vehicle for its own self perpetuation. A gene of this type was posited by the evolutionary biologist William Hamilton (1936–2000) and given its memorable name by Richard Dawkins in his book The Selfish Gene (Dawkins 1976). The theory overturned the traditional concept of the gene as a vehicle of inheritance for the organism and did much to popularize the study of socio-biology.’
Elsewhere, (Von Sydow 2012, p. 34
) is not untypical as an expert academic in similarly getting the facts wrong about the origins of the term ‘selfish gene’:
‘In some respects, a selfish gene viewpoint indeed seems to have implicitly present in the texts of Hamilton and R.L Trivers in particular…But only Dawkins coined and popularised the metaphorical phrase in his book The Selfish Gene, while clarifying and radicalising this position.’
Jablonka and Lamb
) provide another example of confident expert dissemination of the claim that Dawkins coined the term:
‘Richard Dawkins took up Hamilton’s approach. Extended it and popularised it. He suggested that taking a gene’s eye view can help us to understand the evolution of all adaptive traits, not just the paradoxical ones like altruism. He coined the term selfish gene, which recognizes that the “interests” of a gene may not coincide with the interests of the individual carrying it.’
Answering the most basic research question, as to whether the manual IDD method, or using Google Books Advanced Book Search as it currently functions, is better than referring to expert knowledge in highly esteemed publications, the research conducted here found publications containing earlier than previously published use of words, terms, and names that had, until then, lain undetected in the historic publication record. Using the IDD method, published evidence was found that disconfirmed many oftentimes repeated prior knowledge claims. However, it should be noted that whilst IDD uncovered earlier than previously known examples of published usage, Google has not scanned every publication in the world and earlier examples than the ones cited in this paper may well exist in the historic publication record.
At the time of writing, each of these newly found facts has been shown to disconfirm expert knowledge claims regarding the history of the published origins of specific names, terms, and phrases. Furthermore, as these examples demonstrate, the ability of everyday search engine technology to find previously hidden disconfirming data for established expert knowledge claims in distinguished sources is proven. These data also enable scholars to learn exactly who cited the published work, when, and where. Between 2013 and 2016, the first author’s research on Google’s
search engine demonstrated that it was at that time the best searchable Big Data historic record available for detecting the earliest discoverable citations of publications and use or words, terms, and phrases, in any publication scanned and uploaded to the Internet in Google Books
. Using IDD, the first author managed to go back further in the historic publication record than the Google Inc.
itself had apparently managed, because the IDD method also located what is (at the time of writing) the earliest known published use of the word ‘google’ (by Hildebrand 1903
), something not mentioned by Google
anywhere, and not mentioned by Koller
) in an eyewitness account of their choosing of the word for their company name.
Google’s library of publications that are searchable using the IDD method demonstrates the value of Big Data in exposing and correcting historical, scientific, and other academic discipline discovery falsehoods and myths. From the perspective of our cultural, national, regional, and local heritage, Big Data analyses of any kind used to examine the possible intellectual and cultural influence of otherwise unacknowledged individuals has the potential to revolutionize knowledge about original conceptions, discoveries, and the creative process in all academic disciplines, ensuring the right individuals and their respective nations, cities, towns, and villages are veraciously honored.
As demonstrated above, the IDD method, although simple, involves working within specific and defined parameters. For example, after some experimentation by the first author, it was found that using IDD, only a maximum of three words in each term or phrase works effectively. A mixture of words, terms and phrases may be entered. But no more than three individual Boolean commands in total to work most effectively, giving a maximum total of nine words in a tri-Boolean search. It is also necessary to insert another caveat about IDD. At the time of writing, the six steps outlined above does not work quite as well as it did pre-2017. Currently, some pre-20th century publications are no longer discoverable by searching on words, terms, or phrases in the way described above even though these publications remain in the Google Books corpus and can be found, currently, using the Google Books Advanced Book Search page.
For example, at the current time of writing, the IDD method still detects Hamilton’
) earlier published use of the term ‘selfish gene’, and IDD originally unearthed it on March 5, 2013, thereby enabling the first author to originally debunk the myth of Richard Dawkins’ original coinage. Google Books
Advanced Book Search page does currently enable scholars to find the earliest discoverable published use of that term if an individual searches up to the publication date 1973 on the exact term. However, contrastingly, and most importantly, at the time of writing, the IDD method first used to unearth it, no longer detects Hoffman’s
) use of the term ‘self-fulfilling prophecy’ as it did when the first author detected it on 30 October 2013 and published it on his ‘Best Thinking’ criminology blog and later on his ‘Dysology’ blog site, as he did for all the other IDD facilitated discoveries published in this paper. Moreover, without a researcher having advance knowledge of its existence, and so searching on it by the author’s name, Google Books
Advanced Book Search page cannot detect Hoffman’s earlier usage either.
The present authors do not know why the IDD method was once superior in this regard to the Google Books Advanced Book Search page of today, but one likely reason for the loss of functionality of the IDD method is that Google Inc. introduced a new autonomous artificial intelligence deep learning program (called RankBrain) in October 2015. By removing full functionality of the IDD method, Google has demonstrated the potential for loss of function in Google Books on both its standard search engine and on its advanced search page. It is recommended that scholars note this loss of research potential demonstrated by the examples of original findings examined in this paper and consider solutions.
Step five of the IDD process involves checking the veracity of Google’s search engine results. This should be done whether one employs the manual IDD method or uses the Google Books
Advanced Book Search page. Some dates shown by Google Books
are misleading. For example, searching at the time of writing for “self-fulfilling prophecy” in Shakespeare’s Macbeth
, Google Books
ascribes the date 1784, but not to an old book at all. Instead that date is ascribed to a 2002 adaptation of the play, with that term used not by Shakespeare, but the modern playwright Nielsen
) in the text of his own book, to which Google Books
has shown the wrong date.
There are also some other potential limitations to take into account. Firstly, the process of tracing the first use of terms is of course complicated in cases where more than one language is involved. A classic example here would be ‘standing on the shoulders of giants’, often attributed to Newton but with earlier uses in Latin by Bernard of Chartres and John of Salisbury (Dorizzi and Sette 2012
). This raises the complication that the idea underpinning a familiar expression may be older than first usage, and this process may be complicated by the availability of different translations (in this case Latin). Secondly, the first usage of popular terms may not be appearing for the first time when they appear in print because they may be preceded by usage in conference papers, working papers circulated informally, or in letters, etc. Therefore, tracing the origins of phrases may be difficult especially if the unpublished work of scholars has not been digitized. Thirdly, in the history of science there are cases where two (or more) people claim to have come up with an idea independently of each other (Sutton 2014
Despite a slight loss in functionality, IDD remains an innovative and practical tool that can be used in many types of academic research—including the humanities and social sciences—to confirm or disconfirm prior origination, conception, discovery, and coinage knowledge claims made in specific disciplines. Google has demonstrated both the value of open source Big Data sets of scanned, date, and Boolean searchable publications. It also raises questions about the problematic uncertainties of relying upon commercial organizations (such as Google) to supply, maintain, and continue to allow data held by such companies to be adequately and efficiently interrogated by academic researchers. Consequently, it is proposed here that the academic community seek funding to improve upon Google’s Big Data innovation by ensuring the provision of a newly independent virtual library project of equivalent or greater size and quality, which will remain accurate, maximally functional, and consistently universally available.
Please note: Some of the examples presented here first appeared on the first author’s blog.