Open Access This article is
- freely available
Big Data Cogn. Comput. 2018, 2(2), 14; doi:10.3390/bdcc2020014
The Development of Data Science: Implications for Education, Employment, Research, and the Data Revolution for Sustainable Development
Centre of Mathematics and Data Science, School of Computing and Engineering, University of Huddersfield, Huddersfield HD1 3DH, UK
H-STAR Institute, Stanford University, Ventura Hall, 220 Panama Street, Stanford, CA 94305-4101, USA
Author to whom correspondence should be addressed.
Received: 28 May 2018 / Accepted: 16 June 2018 / Published: 19 June 2018
In Data Science, we are concerned with the integration of relevant sciences in observed and empirical contexts. This results in the unification of analytical methodologies, and of observed and empirical data contexts. Given the dynamic nature of convergence, the origins and many evolutions of the Data Science theme are described. The following are covered in this article: the rapidly growing post-graduate university course provisioning for Data Science; a preliminary study of employability requirements, and how past eminent work in the social sciences and other areas, certainly mathematics, can be of immediate and direct relevance and benefit for innovative methodology, and for facing and addressing the ethical aspect of Big Data analytics, relating to data aggregation and scale effects. Associated also with Data Science is how direct and indirect outcomes and consequences of Data Science include decision support and policy making, and both qualitative as well as quantitative outcomes. For such reasons, the importance is noted of how Data Science builds collaboratively on other domains, potentially with innovative methodologies and practice. Further sections point towards some of the most major current research issues.
Keywords:big data training and learning; company and business requirements; ethics; impact; decision support; data engineering; open data; smart homes; smart cities; IoT
1. Data Science as the Convergence and Bridging of Disciplines
The context of our problem-solving and analytics will always be quite fundamental and very specific and particularly oriented. Section 4 draws some interesting and relevant implications of this. This article is very oriented towards commonality and mutual influence of methodologies, and of analytical processes and procedures. A nice example of the parallel nature of such things is how Big Data analytics is often considered a synomym of Data Science. In Section 2.2, it is mentioned how public transport may well use smartphone and mobile phone wireless connection data to observe locations of individuals. This close association or, perhaps even, identity of Big Data analytics and Data Science will have growing importance with the Internet of Things, and smart cities and smart homes, and so on, as noted in Section 8. Here is an outstanding company perspective on this: Ref  “Five years ago, the McKinsey Global Institute (MGI) released Big data: The next frontier for innovation, competition, and productivity. In the years since, data science has continued to make rapid advances ⋯”.
In Section 8 and Section 9, very important developments are at issue, encompassing newly oriented and pursued methodologies, and the integration of research domains. In Section 7, there is the noting of how important all of the content here is to sustainable development. The phrase “data revolution” is based here on ongoing work by the United Nations, and by so many of us in this domain, and from national authorities in Africa and the Middle East discussing issues here at the most recent (2017) World Statistics Congress.
In , at issue are: parallels between astronomy and Earth science data, methodology transfer, and metadata and ontologies characterized as being crucial. The convergence or bridging of disciplines must address “non-homogeneous observables, and varied spatial, temporal coverage at different resolutions”. This quotation is very familiar to us in regard to how NoSQL databases are now very widely used, as well as traditional relational databases, and another example is how text mining and social media and many other domains have become so very important in many contexts. Then, given computational support, “it is the complexity more than the data volume that proves to be a bigger challenge”. Further benefits of this Data Science convergence are termed here tractability and reproducibility. There is discussion in Section 2 of  of the complexity relating to resolution and distributions. In , this is also characterized in terms data of encoding. Plenty of work now emphasizes the importance of p-adic data encoding (binary or ternary when p = 2 or 3), compared with real-valued encoding (m-adic, especially when m = 10).
The convergence and bridging of disciplines are emphasized in , as follows: “Methodology transfer can almost never be unidirectional. Diverse fields grow by learning tricks employed by other disciplines. The important thing is to abstract data—described by meaningful metadata—and the metadata in turn connected by a good ontology.” Further description is at issue in regard to Data Science: “We have described here a few techniques from astroinformatics that are finding use in geoinformatics. There would be many from earth science that space science would do well to emulate. Even other disciplines like bioinformatics provide ample opportunities for methodology transfer and collaboration. With growing data volumes, and more importantly the increasing complexity, data science is our only refuge. Collaboration in data science will be beneficial to all sciences.”
2. Historical Development of Data Science and Some Contemporary Examples of Cross-Disciplinarity
A short historical perspective that follows is with reference to such disciplines as computer and information sciences, mathematics and statistics, physics, and, implicitly, social sciences. In concluding this description, a key point will be how Data Science encompasses and embraces all of the following: cross-disciplinarity, interdisciplinarity and multidisciplinarity.
2.1. Historical Prominence of Data Science in Recent Times
So many of the origins are due to Chikio Hayashi and others. Consider Hayashi , “I will present “Data Science” as a new concept.” followed by a very relevant introduction to the science of data, with this: “Data Science consists of three phases: design for data, collection of data and analysis on data”. In Ohsumi , the abstract has this: “In 1992, the author argued the urgency of the need to grasp the concept “data science”. Despite the emergence of concepts such as data mining, this issue has not been addressed.”
In the Preface of , it is noted how Data Science arises from the convergence of computer science and statistics, that “gives birth to a new science at its core”. That Preface concludes with this: “To take data as a starting point provides a complementary vision of theory and practice, and avoids creating an unfortunate gap between two steps, both of which are essential in any scientific process.”
In this very comprehensive overview of data science , it is stated (Section 3.6) how the “first conference to adopt “data science” as a topic” was the International Federation of Classification Societies (IFCS) 1996 conference, in Kobe, Japan. This was fully consistent with our work as participants, then and now (IFCS 2017, in Tokyo, Japan, also had Data Science in its title). In , this point is also made about IFCS 1996 as the first conference with Data Science in its title. It is also stated there  that the journal Behaviormetrika is “the oldest journal addressing the topic of Data Science”, when it started in 1974. Data Science is specified as “an interdisciplinary field that includes the use of statistical methods to extract meaningful knowledge from data in various forms: either structured or unstructured”.
In , there are additional historical perspectives, with the section heading, “The Data Science journey”, and this relates largely to work in the 1960s and 1970s. This includes “information discovery” as a continuing key objective in Data Science. This is a key Data Science orientation also in . The latter emphasizes the “semantic dimension of Data Science”, through the information discovery lifecyle, and the “discovery lifecycle in text mining”. While also emphasizing cooperation, and cross-disciplinarity, there is this:
We see the data scientist’s responsibility
- in the design of an overarching semantic layer addressing data and analysis tools,
- in identifying suitable data sources and data patterns that correspond to the appearance of structured and unstructured data, and
- in the management of the information discovery lifecycle and discovery teams.
An ever-more important issue is the second here, for example cf., Section 3 below, arising from the data sources that are employed. As a summary expression, Data Science is, firstly, the integration of data sources and analytical and related data processing methodologies, and, secondly and quite fundamentally, arising from the convergence of disciplines. Convergence of disciplines can be so very beneficial in practice. That is, beneficial in regard to addressing and solving problems, and also in regard to the cooperation yielded by cross-disciplinarity. See Section 5, below, for some current discussion on how the problems and challenges to be addressed can and should be, quite naturally, arising out of all aspects of Data Science.
The current era of Data Science can be considered as the following previous epochs that gave rise to major digital technology advances, with implications in all social domains. Largely the first epoch (in the 1980s) brought about laptop and desktop computers, and the second epoch (in the 1990s) gave rise to the Internet and the World Wide Web.
2.2. Practical Association of Disciplines and Sub-Disciplines
In Section 2.3, “What is Data Science”, in , mention is made of Data Science being centred on the following disciplines: statistics, informatics, sociology and management science. Clearly, as in Section 5 , there is emphasis on “synergy of several research disciplines” and how “interdisciplinary initiatives are necessary to bridge the gaps between the respective disciplines”. This is exciting and not least because of how there is convergence of disciplines or subdisciplines. We may consider, for example, how Digital Humanities can incorporate relevant areas of a few disciplines, how computational psychoanalysis can come to the fore (see chapter 8, “Geometry and Topology of Matte Blanco’s Bi-Logic in Psychoanalytics” in ). With a major focus on psychometrics, Coombs  has chapters that proceed from “Basic Concepts” to “On Methods of Collecting Data”, and “Preferential Choice Data”.
Now, data is so very central to all of our sciences, and to all aspects of our engineering and technology. Just what data is, is a key theme in . This includes data coding, or perhaps also, this should be termed data encoding. After all, data is measurement. It results from this how important mathematical underpinning is in Data Science. Implications that follow include the relevance and importance for new, innovative directions to be followed, and from effective problem-solving. The mathematical view of what measurement means is all important. Even in the discipline of physics, in , there is the citation of eminent physicist, Paul Dirac, as to how mathematics underpins all of physics, and how the work of eminent psychoanalyst, Ignacio Matte Blanco, has mathematics being integral to psychoanalysis.
From a major study of Big Data and surveying by the American Association for Public Opinion Research, there is the following : “The classic statistical paradigm was one in which researchers formulated a hypothesis, identified a population frame, designed a survey and a sampling technique and then analyzed the results… The new paradigm means it is now possible to digitally capture, semantically reconcile, aggregate, and correlate data.”
A note is made in  about wireless connection data forming a basis for public transport management. Such Big Data sources can be associated with, or even integrated with, personal and social behavioural patterns and activities. There is this small heading in , “Better living through data?”, followed by a very critical statement: “The other thing I need to declare is that I’m no fan of our contemporary belief that life can only get better the more data we have at our disposal.” A response to this would be: Data Science, as the science of data, is everything relating to the path and trajectory connecting data, information, knowledge and wisdom.
In , it is stated that “The UK’s next census will be its last”, with administrative, governmental authorities’ data replacing the national census. This is acknowledged: “Collecting the data itself is only half the work. A great deal of effort must go into combining it with other sources, in order to answer real questions.” That can be understood as undertaking scientific investigation of such data, and other potentially relevant data. The cross-disciplinarity inherent in that also can, and perhaps must, lead to new interdisciplinary linkages. Arising out of the ending of the national census, as such, is : “The way government counts its people is changing, and it could transform policy”.
One issue here has been how mathematics underpins so much, across disciplines, and also in the commercial and in most social domains. A good comment to make is this: many universities in the recent past shut down their mathematics departments and no longer provided teaching in mathematics; and now, this is being reversed, with again university courses being provided in mathematics.
3. Open Data, Reproducibility and the Data Curation Challenge
While generally recognized as so important for innovation in both application outcomes and in regard to analytics and methodologies, Open Data plays a key role, for us data scientists. (Information and news about Open Data is well provided by this organisation, Open Data Institute, https://theodi.org).
One major aspect of how Big Data analytics are quite central to Data Science is the increasing availability of open data. In , this is associated with methodology too, through “the open model rather than a closed one”. The following was central to a presentation (in May 2017 in London, UK) by Dr. Robert Hanisch, Director, Office of Data and Informatics, NIST—National Institute of Standards and Technology, USA. Dr. Hanisch worked for 30 years on the Hubble Space Telescope (HST) project. (The author, Murtagh, of this paper was awarded a medal in 2016: “Outstanding Contributions to Astrostatistics Award, International Astrostatistics Association. Commendation: For his long time contributions to astroinformatics and related areas in the computational sciences; advancing scientific knowledge in classification theory and image analysis; for his contributions to the success of the Hubble Space Telescope; ⋯and for his long time efforts in dealing with the statistical analysis of ‘Big Data’.”) Due to open access to observed data, from our cosmos, Dr. Hanisch noted that three times the number of people directly engaged in HST work were working on HST data. It results that there were three times the benefits drawn from HST data.
It was noted by Dr. Hanisch how important a role is played by national metrology institutes. Arising from this was, and is, the importance of reproducibility and interoperability of all of analytics comprising Data Science. Underpinning these very important themes in Data Science work is data curation. Data curation is still a major challenge to be addressed. Noted in Dr. Hanisch’s presentation is the contemporary “crisis” of reproducibility. At issue is to support data management from acquisition to publication, or business or medical or governmental or other deployment. The computing expert will recognize this crucial theme of data curation as associated with metadata and evolving ontologies.
For the latter, i.e., the very important and central role of evolving ontology, research publishing, and research funding, are discussed in this broad and general context in . While there remain challenges to be pursued and addressed, it is a good point to make that astronomy and astrophysics offer interesting paradigms for open data, and, in many ways, for data curation. (Here, a humorous remark can be added: in astronomy and cosmology, for the Cosmic Microwave Background, our data sources go back in time to the orgin of the Universe, 13.7 billion years ago.) There will certainly be further research carried out on data curation, and evolving and interacting ontologies, and it has been shown here how such are core issues for metrology, hence the very basis of all manufacturing technology, and, as described in the latter citation, for research publications and research funding.
Noted in this section is the discussion in  of “the open model and open data”. It results from this that multidisciplinarity, that we are also expressing as the convergence of disciplines, is to be aided and facilitated by openness of analytics, data management, and all of Data Science methodologies. By openness of methodology, it is intended here to allow domain experts to both link up with, and perhaps even if feasible, to integrate with all that is at issue in other relevant domains. Thus, this is a plea for openness of Data Science as a discipline due to it being a convergence of disciplines.
4. Integration of Data and Analytics: Context of Applications
At issue in this section is an important aspect or byproduct of the integration of data and analytics in Data Science. This theme is to acknowledge, and to seek to address challenges and other issues, in regard to data and the underpinning or contextual reality of the data. Informally expressed, our data represents reality or the context from which the measurements arose, i.e., the data numeric values or qualitative representations.
An outcome of this is to be the quality and standards of our work as data scientists.
Much, and perhaps all, that is at issue in  is very important for all that is under discussion here. This reference, , describes the problems of data quality, in the Big Data context, relating to administrative data. Hence, data curation is very relevant for reproducibility of analytics. There are implications for analytics: “the fact that data are often not of the highest quality has led to the development of relevant statistical methods and tools, such as detection methods based on integrity checks and on statistical properties[⋯]. However, this emphasis has often not been matched within the realm of machine learning, which places more emphasis on the final modelling stage of data analysis. This can be unfortunate: feed data into an algorithm and a number will emerge, whether or not it makes sense. However, even within the statistical community, most teaching implicitly assumes perfect data. Challenge 1. Statistics teaching should cover data quality issues.”
Our analytics should not be a “black box”, a term that was informally used in regard to neural networks in earlier times. Rather, transparency should always be a key property of analytical methodologies.
The view offered by Anderson , and discussed in , quoting Peter Norvig, Google’s research director, is that “Petabytes allow us to say: ‘Correlation is enough’. We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot”. However, this interesting view, inspired by contemporary search engine technology, is provocative. The author there maintains in a provocative way that: “Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all”.
It cannot be accepted that correlation supersedes causation, i.e., that analytics can be automated fully, and thereby obfuscate, or make redundant, data science, health and well-being analytics. As described above, in , the case is made for comprehensive information governance, encompassing fully the contextualization of all the analytics that are being carried out. In , we discuss quite a good deal of the contextualization of analytics of health and well-being data. In the discussion accompanying this seminal work in statistical perspectives on health and well-being (), our contribution to the discussion has the following response from the paper’s authors: “We agree with Murtagh that ‘big data’ may offer insights, provided that there are appropriate analytics.”
It is so very relevant to note here that data science has inherent and integral involvement in the sourcing and in the origins of data, i.e., selection and measurement. This point is emphasized for applications in . This implies full integrating of the analytics with what data is selected and sourced, and that may well imply what and how measurement is carried out.
An aspect resulting from this section may be the priority to be accorded to induction-based, i.e., inductive, reasoning (cf., ). This could be a minor argument for the importance of approaches that follow from data mining, unsupervised classification, latent semantic analysis, and various other themes. Very clearly, however, all of one’s studying and teaching, one’s work for companies, for Government agencies, health and other authorities, all one’s work should and really must be properly focused on the aims and objectives. The latter, of course, may need, partially in any case, to be determined by the expert data scientist.
5. Short Review of Contemporary Data Science in Education and in Employment
It is quite clear to all of us involved with many companies and involved in education, that Data Science is becoming one of the most important employment prospects, and this comparison of employment salaries in the US has Data Science as having the highest median salary in 2017: . It follows from this section that there will certainly be study made of Data Science and Big Data university courses, and these will of course be related to the prospects and the potential for the students.
In this section, the two themes are relating to the contemporary context. While comprising a short review here, first there is the theme of higher education in Data Science, and second there is the theme of company employment advertisements. Used here is accessible and available data. Of course, an expert Data Scientist is very likely to be involved in many discussions and debates with current and potential students, and with company executives and with many others. It can even be seen that most disciplines have to be integrated into data science. In this regard, for education, see .
5.1. Teaching and Learning for Data Science
Briefly considered here are current higher education post-graduate programmes (usually termed MSc courses) in Data Science. This is also possibly relevant for undergraduate programmes, and certainly relevant for undergraduate projects and company placements of students.
In universities in all countries worldwide, in recent years, there has been a very great increase in graduate level courses in Data Science, and increasingly also in undergraduate level courses. In Press , there is a listing of graduate courses, in some cases but not in all, with the title Data Science. This listing, with links to the host institute, contains: 102 MSc courses listed, 19 online courses, followed then by 11 free online courses, and eight for a fee. Therefore, this has 140, for the most part, graduate level courses.
What follows now is, again, the theme of having data and analytics well integrated.
Considering the most essential requirements of a data scientist, in Englmeier and Murtagh , we note the very close linkage between data science and Big Data. We emphasize the great need to avoid false positives coming from the data science analytics that is carried out. This arises from treating the data without fully linking and even integrating the analytics with the context, the relevance, and all that is to do with application and problem conceptualization. Noted in that article are the well-known errors arising out of the Google Flu Trends, arising from Google searching, and service usage patterns obtained from taxi company, Uber. These were outcomes that produced false positives. There must be fully comprehensive information governance, encompassing all levels of information discovery, through conceptualization that can benefit very much if collectively undertaken.
5.2. Employment Requirements in Data Science
So many employment possibilities are now on offer for a Data Science role. In the very comprehensive review of Data Science, Ref.  has Section 6 entitled “Data economy: data industrialization and services”. An increasingly popular web service entitled “DataScientistJobs” is available at https://datascientistjobs.co.uk.
In this section, there is a small data analysis carried out of stated requirements for Data Science posts. While one must give the fullest perspective to the companies that one works with, and the university Data Science courses that one teaches, what follows is both a consideration and a selection of data, and preliminary results. This preliminary study of requirements for Data Science roles, some of them in senior management roles, represents what we will further pursue over time, both for the benefit of our Data Science students, and to be fully prepared for our work, and association, with companies, nationally and globally.
Online discussion of Data Science and of Big Data have become commonplace, and there are often surveys carried out. Examples include  on Big Data, where senior corporate executives were surveyed, and the dominant sector was financial services. This survey concerned internal investment and organisational matters, and business practices and plans. In , more than 620 data professionals were surveyed, in regard to skills required and at issue in Data Science. There is an interesting summary and presentation of results obtained in a factor space.
We considered descriptions of new posts as a Data Scientist, from 2015 to 2017, all from the distribution list, StatsJobs (sometimes indirectly, through links), in England. In most cases, languages or software environments that are at issue were indicated. In a few cases, the job advertisements do not explicitly list these details. Retained for use here were 73 such job descriptions. The very frequent (more than four advertisements) software languages and software environments were as follows, required by the 72 employers here: R (50), Python (44), SQL (30), SAS (28), Hadoop (25), Matlab (17), SPSS (17), Java (16), Hive (14), Excel (9), MapReduce (9), NoSQL (8), Spark (7), C++ (6), Pig (6), Tableau (6), HBase (5), C# (4), Mahout (4), QlikView (4), and Scala (4).
To have sufficient comparability of software languages or environments, 21 of these above were selected that were required by at least four employers from the set of advertisements here. Since some employers had none in this set of software languages or environments, and indeed about three of the set of 72 had no detailing at all about what was required, consequently the set of employers was reduced to 60. Thus, we have a cross-tabulation of 60 employers, wanting to employ a Data Scientist, with a few requirements or desires for expertise in the set of software languages and software environments that are listed above. The manner of expression was most often: one or another or another again.
Correspondence analysis takes the employer set, and the software set, in the dual multidimensional spaces, both endowed with the chi squared metric, and maps both clouds into a factor space endowed with the Euclidean metric. Hierarchical clustering was carried out from the full dimensionality (therefore with no loss of, or decrease in, information content) factor space. We are seeking just to see what associations of software are most likely to be the case, from these Data Scientist job advertisements.
Figure 1 and Figure 2 display the clustering of the software languages or environments. It is our intention to take such a mapping much further, with supplementary elements, also termed contextual elements, to locate them in the factor space, and these would include: country or location of the job, industrial sector or government agency, or global corporate firm, for the job. The objective is to determine sectoral or regional preferences in the skills and abilities of the Data Scientists employed.
6. Data Science Methodology to Address: Selection Bias, Scale and Aggregation Effects, and Qualitative Evaluation of Decision-Making Impact
Following the pointing to certain challenges in particular in Big Data analytics, involving selection bias and replacement of individual attributes with aggregrated attributes (hence, the collective attributes of groups to which the individual can belong), the main aim here is as follows. It is to point to innovative new methodological perspectives that can both address such issues and challenges, but also benefit from the context, for example of using Big Data sources. In , a case study involving work for a major company is illustrated with an example of how aggregated data can be used, if required, for individual-related analysis.
Ethical as well as methodological issues arise in scale effects, representation and expression, and particular context effect. Here, we both summarize the ethical implications, and the potential for qualitatively and quantitatively evaluating impact of decision-making and of policy-making.
The quite regular lacking of coordination, alignment and integration of methodology including modelling, with data sourcing, is pointed to in Hand : “ignorance of selection mechanisms has led to mistakes”, “This applies in human interactions—where it has been suggested that the notion that ‘data=all’ can replace the need for careful theorising and statistical modelling—but also in the hard sciences and medicine.”
In Keiding and Louis , it is well pointed out how one case study discussed “shows the value of using ‘big data’ to conduct research on surveys (as distinct from survey research)”. Limitations though are clear: “Although randomization in some form is very beneficial, it is by no means a panacea. Trial participants are commonly very different from the external … pool, in part because of self-selection, ⋯”.
Important points towards addressing these contemporary issues include the following. “When informing policy, inference to identified reference populations is key”. This is part of the bridge which is needed, between data analytics technology and deployment of outcomes.
“In all situations, modelling is needed to accommodate non-response, dropouts and other forms of missing data.” While “Representativity should be avoided”, here is an essential way to address in a fundamental way, what we need to address: “Assessment of external validity, i.e. generalization to the population from which the study subjects originated or to other populations, will in principle proceed via formulation of abstract laws of nature similar to physical laws”. In our discussion of the important issues here, in , it is noted how, related to eminent social scientist, Pierre Bourdieu’s, work with homology between fields of study offer clear perspectives on how beneficial innovative practice can be pursued.
This incorporates our need to “rehabilitate the individual” in our analytics, and not simply replace the individual by the mean of some group, Many case studies of the latter are provided in this book by an eminent mathematical data scientist: O’Neill . From : “Rehabilitation of individuals. The context model is always formulated at the individual level, being opposed therefore to modelling at an aggregate level for which the individuals are only an ‘error term’ of the model.”
Calibrating surveys and other data sources, through use of Big Data, has been at issue in addressing challenges and obstacles described in Keiding and Louis . In regard to decision-making and policy-making, the analysis of discourse in a data-driven way can provide relevant or necessary contextualization. Without having such an approach, there is the following limited capability on the part of those in authority : “top-down communication campaigns both predominate and are advised by those involved in social marketing⋯ However, this rarely manifests itself through measurable behaviour change⋯”
Instead, mediated by the latent semantic mapping of the discourse, we develop  semantic distance measures between deliberative actions and the aggregate social effect. We let the data speak in regard to influence, impact and reach. Impact is defined in terms of semantic distance between the initiating action, and the net aggregate outcome. This can be statistically tested. It can be visualized. It can be further visualized and evaluated.
For research and for all engagement in Data Science, it is so very motivational to both address, and have significant achievements, in regard to innovative methodology.
7. Benefits of Very High Profiling of Data Science
There are many blog postings, currently, with the theme of “Big Data is dead”. (A Google query of the phrase, dated 2017-12-29, lists 153,000 results.) At issue is just this: complete priority is to be given to the problems to be solved and the challenges to be addressed. In the extensive and outstanding detailing of many aspects of Data Science in , there is the acknowledgement that there is much that is still currently “tremendous hype and buzz”, and “engendering enormous hype and even bewilderment”. There is this perspective too, which can be a viewpoint if the sole aim were for Data Science to automate data analytics in all domains of application: counterposed to advanced analytics, “dummy analytics is becoming the default setting of management and operational systems” .
Fully in line with the context of those perspectives, a major theme of this article is that the convergence of disciplines, in the Data Science framework, builds on cooperative and collaborative expertise, and thus does not seek to replace or supplant such expertise. Thus, a major conclusion is not to replace current disciplines (mathematics, statistics, computing, engineering, physics and chemistry, arts and humanities, social and psychological sciences, and so on), but, where relevant and where appropriate, and also where motivated and where justified, to re-orientate and to bridge primary as well as foundational levels of disciplines.
In a somewhat humorous fashion, in the sense of revolution versus evolution, let the following be noted. At the 61st World Statistics Congress, in July 2017, in Marrakech, Morocco, there was a session organized jointly by the High Commission for Planning (HCP), Morocco, and the Ministry of Development Planning and Statistics (MDPS) of Qatar. This session was entitled “The Data Revolution for the Sustainable Development Goals”. One comment raised in the Question and Answer session was a request for evolution to be at issue rather than revolution. It is also interesting to note how there is an important Advisory Group in the United Nations, called the Data Revolution Group. See: http://www.undatarevolution.org. This has the theme of: “Mobilising the data revolution for sustainable development”. For Data Science, it is clear that there is great inspiration here. Some other organisational initiatives will now be mentioned. This is to complement a great deal that which is being done, already, by major organisations in statistics, in classification and data research, in engineering, and explicitly in Data Science.
In European research funding, i.e., Horizon 2020, an important supported project is entitled the European Data Science Academy (http://edsa-project.eu). EDSA dates from 2005. There could well be an important role for such an organisation in the future, in regard to sponsoring fellowship levels of organisational memberships, and it would be interesting to promote chartered membership. In the European Commission context, dating from July 2014, there is this: “Best practice guidelines for public authorities and open data” under the theme of “Commission urges governments to embrace potential of Big Data” (http://europa.eu/rapid/press-release_IP-14-769_en.htm).
At the UK national level, an important initiative, directly or indirectly related to much that was under discussion in this article (in Section 3, in particular) is open data. The Open Data Institute (cf., https://theodi.org) in the UK was founded in 2012 by Sir Tim Berners-Lee and Sir Nigel Shadbolt. In welcoming membership applications, there is this: “Membership: Join the data revolution”. There is this prominent statement too: “Data is changing our world”.
In a practical sense, focused on data to begin with, and no doubt whatsoever, relevant for data curation now and in the future, cf., Section 3, there is the Research Data Alliance, RDA (https://www.rd-alliance.org). RDA is supported from the EU, from the NSF (National Science Foundation) and NIST (National Institute of Standards and Technology) in the US, by the JISC (Joint Information Systems Committee) and other agencies in the UK, and from Australia and Japan.
8. Important New Research Challenges from Data
In this section and the section to follow, both are engaged with major new developments, for problem solving, and for Data Science and Big Data analytics, as noted in the Abstract of this article, with the partial or complete integration of relevant sciences and technologies, and methodologies, in observed and empirical contexts.
Data Science, integrating potentially all application domains, with mathematical foundations for methodology as befits observational science, and integrated observational and experimental science, fully relates data to all that is accomplished and achieved from the data sources. This results in the great importance of contemporary increasing orientation towards, and requirement for, open data. The following is a good understanding of this development in Data Science, and of the potential here for application transfers, in parallel with methodology transfers .
The Open Universe initiative (http://www.openuniverse.asi.it) was established by the United Nations . This work involves: “Today acknowledging that open data access drives innovation and productivity is a well-established principle in every scientific discipline. However, there is still a considerable degree of unevenness in the services currently offered by providers of data⋯”. Among six objectives, there is: “Advancing calibration quality and statistical integrity”, with outcomes for education, globally, and private sector involvement. Here, and through transference to each and all domains for Data Science, what is required for open data and, in this motivational and inspirational work, open data and all associated open information, must be: Findable, Accessible, Interoperable, and Reusable, termed the FAIR principles, and Reproducible .
Supporting the FAIR principles is ESASKY (European Space Agency, Sky), accessible from http://sci.esa.int/home, and described thus: “ESASky, a discovery portal that provides full access to the entire sky. This open-science application allows computer, tablet and mobile users to visualise cosmic objects near and far across the electromagnetic spectrum.” The interesting new research challenges in Data Science can be stated to be foremostly related to the transfer to many domains of FAIR-based open science, discovery portals.
Quite an important application domain here will be emerging smart technologies, which encompass smart homes, smart cities, smart environments in general, and Internet of Things technologies. An important Situation Theory methodology is in , in an information space that is mathematically based, furnishing a comprehensive representational system. Associated with this are the social, legal and economic aspects of emerging smart technologies in real-life applications.
9. Information Space Theory for Big Data Analytics in Internet of Things and Smart Environments
Context is so very important in Big Data analytics and in many domains .
Situation Theory is to provide humans (generally, trained domain experts) with powerful, flexible representations that enable them to perform better, both as analysts and decision makers. Systems such as the one outlined in Figure 3 for the US Army, cf. , have a software back-end, possibly including AI, but they are in no way “calculators” or expert systems for making decisions. What was done was to harness the power of mathematics primarily as a representational system, compared to its computational capacity. While the back-end software can manipulate the network—each completion diagram is a structurally identical piece of code—perhaps permitting the eventual application of familiar-looking network-optimization algorithms, many of those completion diagrams represent inherently human thoughts, intentions, and actions, and, for the foreseeable future, the human mind remains the best tool to handle them. This work for the United States Army was to use Situation Theory to develop a first-iteration specification for a workstation to be used by a field commander, in both mission planning and real-time control. This work includes the taking account of the many different ontologies in a modern battlefield. The role of ontologies is very central in qualitative analysis of research, cf., .
Context, Situation Theory, Completion Diagrams
In the early 1980s, a group of researchers at or connected to Stanford University started to develop an analogous mathematically-based representation of communicating humans, looking deeper than the mere fact of communication (captured by the network model used by the telecommunication engineers) to take account of what was being communicated. (Part of the challenge was to decide how far it is possible to go into categorizing that “what” in order to achieve a representation that is useful in analyzing communication and designing communication-based activities such as work.) That approach is generally referred to as Situation Theory. Devlin was one of those early pioneers, and wrote a theoretical book on the subject, Logic and Information . Subsequently, the techniques developed by the Stanford group were applied in [37,38] to solve an actual workplace problem involving communication in the workplace.
The representation  used was (of necessity) similar to that used by telecommunication engineers, Google, the postal system, UPS, FedEx, in that the domain is represented by a network. However, whereas those earlier examples had networks of point nodes, the nodes in the network were more complicated objects, which were termed “completion diagrams.” See the right-hand side of Figure 3, where “situation s1” results in “type T1”, and “situation s2” results in “type T2”, so that transition from “situation 1” to “situation s2” has the related association between “type T1” and “type T2”. The exact nature of the entities in such a completion diagram: they can be considered as capturing the key elements of a basic human act, here military and managerial, including a communicative act. Much of Logic and Information is devoted to the development and explication of such a completion diagram. It has its origins in .
Information is a vehicle for the use of a Big Data approach to underpin the study of interaction and communication in smart environments (cities, workplaces, homes). Information Space Theory is to provide the focus for building an inter-disciplinary community concerned with social and technological issues associated with recent technological advances. Relevant emerging research and innovation disciplines include Internet of Things, Internet of Everything, Big Data Analytics among others, that contribute to the design, development and effective implementation of Smart Environments in real life.
Research projects related to both “Information Space Theory” and “Interaction Space Theory” include the following: SANE, “Sustainable Accommodation for the New Economy”, a European Framework 5 research project, with very innovative aims and outcomes for research and for industrial companies : “A multi-disciplinary and multi-cultural R&D project has created a unified framework that ensures compatibility between fixed and mobile as well as between local and remote work areas. Specific ICT tools were prototyped and developed emphasising the innovative application of emerging technologies and services.” Another project involving universities in the UK and in Germany was IS-VIT, “Interaction Space of the Virtual IT Workplace”. Related outcomes of these projects are in [41,42].
The Information Space Theory takes into account the following: (i) People who inhabit smart environments and spontaneously generate data and information in the course of their day-to-day activities; (ii) Place which can be public (smart cities), privileged (workplaces) or private (homes) with varying degrees of privacy and security constraints that shape information sharing; and (iii) Patterns of interaction between people and technology that is an integral part of smart environments and influences human–human, human–device and device–device interaction.
A summary follows of Inter-disciplinary Information Space Theory and its application in Smart Environments: (i) Introduction to studies of Information, Data and Interaction; (ii) Big Data Analytics as a tool for the development of Information Space Theory; (iii) Information Space Theory and its impact on the design of Smart Environments; (iv) Information space and human communication research, involving an account of the evolution of smart interaction systems; (v) Further refinement of Information Space Theory informed by cross-disciplinary perspectives and requirements of application in smart systems and emerging technologies, including contribution to the application of Big Data Analytics in real-life smart environments; and (vi) Emphasis can be on introducing the concept of Information Space as a distinct feature of human context that makes it possible for people to achieve coordination and reciprocity of perspectives through smart interaction systems that safeguard their privacy and security.
Such work is to build on the work of an inter-disciplinary group of researchers within mathematics, computer and social sciences who will together address the key research questions—How do emerging smart technologies influence information sharing in interaction between people and technology in smart environments? What are the social, legal and economic impacts of emerging smart technologies in real-life application?
To this end, the concept of Information Space will guide the investigation into interactions that occur within smart environments taking account of human–human, human–device and device–device in a uniform framework. Special attention is given to information sharing—pathways, enablers and gatekeepers—to incorporate security and privacy concerns that urgently need to be addressed in order to optimise the technology potential in real-life applications of smart environments. The working assumption behind this approach is that inter-disciplinary, formal, and theoretical understanding of the nature of these interactions is essential for these concerns to be addressed and resolved.
In this context, mathematics plays a crucial role in developing and using a mathematically-based representation framework for the analysis and design of work in the era of the Internet of Things. Both in life and in scientific studies, what we can achieve depends on, and is constrained by, the representational system we use. The greater the complexity of the domain, the more significant is the representation at our disposal—representations are what make it possible for us to understand and reason about the world. For instance, trade, commerce, and financial activity in Europe were revolutionized by the introduction of the Hindu-Arabic, decimal arithmetic system (“modern arithmetic”) in the 13th Century, which made it possible for anyone to become proficient in arithmetic after just a few weeks practice. A similar revolution occurred in the 1980s, when the introduction of the modern, windows–icons–mouse interface for personal computers made it possible for ordinary people to use what had until then been a tool for trained experts. Long before those two examples, the introduction of numbers themselves, in the form of a monetary system, transformed human life by providing a simple, quantitative representation system for property ownership and social indebtedness.
The rise of natural science involved a new representation system that assigned numerical values to various features of the environment (features given names such as length, area, volume, mass, temperature, momentum, etc.) and shifted the focus from trying to understand why things occurred to simply measuring how one quantified feature varied with another—an approach that proved to be extremely fruitful for society. The representation systems of the natural sciences have all been based on mathematics to a considerable extent. In the social realm, mathematically-based representation systems are less common, but when they have been developed, they have proved to be extremely powerful. (Money is a particularly dramatic example.) Indeed, one of the most widespread applications of mathematics in today’s world is the optimization of various human activities. Computer search depends on optimization in a mathematical space that treats every living human as a node in a simple mathematical structure called a graph. “Modelling” a person as a point node in a mathematical network omits all information about a person save for one factor: the connections of that human to all other humans. However, for questions that hinge on that one factor, the representation enables mathematical algorithms to be applied that provide society with one of its most important tools.
Another example is provided by the algorithms that route our telephone calls, our Internet communication, or mail and package delivery systems, and our transportation systems. In those cases, whereas a search engine like Google represents the human domain as a two-dimensional network of nodes and edges, the domains of communicating devices such as phones or computers, of letters and packages in shipment, and of travellers are represented as high dimensional “polytopes,” generalizations of the familiar polygons of high school geometry to higher dimensions, to which mathematical methods such as the Simplex Method or Karmarkar’s can be applied to determine optimal routings. These representations work by ignoring almost everything about the entities in the domain apart from the one or two features that are germane to the task. The result is that the power of mathematics can be brought to bear to a problem that, on the face of it, is part of the complex web of human activity that defies the methods of science in terms of its complexity and (local) unpredictability.
Having indicated a few highly important, and relatively recent, organisational initiatives (cf., Section 7), let us again emphasize that Data Science, viewed as the convergence of disciplines, or, in practice, sub-disciplines, should very much incorporate open methodology, open data, and transparency, reproducibility, and interoperability (cf., Section 3).
This article has sought to form a foundation for further study of the specific content of Data Science education and training (cf., Section 5.1), and of business sectoral importance (cf., Section 5.2). After all, progress and impact ensure development and evolution over time. As noted above, too, we may, if we wish, refer to the contemporary data revolution.
Both challenges (cf., Section 4 and Section 6) and impactful potential (cf., Section 2.2) are prominent, and it is good to see them as predominant, in our rapidly growing (cf., Section 2.1) discipline of Data Science.
Section 9 by K.D. and other sections by F.M., all represent our extensive research work, and teaching and some consultancy also.
Conflicts of Interest
The authors declare no conflict of interest.
- McKinsey Global Institute. The Age of Analytics: Competing in a Data-Driven World. Research Report. (under “Our Research”, “Technology and Innovation”). 2016, p. 136. Available online: www.mckinsey.com/mgi (accessed on 18 June 2018).
- Mahabal, A.A.; Crichton, D.; Djorgovski, S.G.; Law, E.; Hughes, J.S. From sky to earth: Data Science methodology transfer. In Proceedings of the International Astronomical Union, Sydney, Australia, 17 July 2017; Brescia, M., Djorgovski, S.G., Feigelson, E., Long, G., Cavuoti, S., Eds.; Cambridge University Press: Cambridge, UK, 2017; pp. 17–26. Available online: https://arxiv.org/pdf/1701.01775.pdf (accessed on 18 June 2018). [Google Scholar]
- Murtagh, F. Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics; Chapman & Hall/CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
- Hayashi, C. What is Data Science? Fundamental concepts and a heuristic example. In Data Science, Classification, and Related Methods; Hayashi, C., Yajima, K., Bock, H.H., Ohsumi, N., Tanaka, Y., Baba, Y., Eds.; Springer: Heidelberg, Germany, 1998; pp. 40–51. [Google Scholar]
- Ohsumi, N. From data analysis to data science. In Data Analysis, Classification, and Related Methods; Kiers, H.A.L., Rasson, J.-P., Groenen, P.J.F., Schader, M., Eds.; Springer: Heidelberg, Germany, 2000; pp. 329–334. [Google Scholar]
- Escoufier, Y.; Fichet, B.; Lebart, L.; Hayashi, C.; Ohsumi, N.; Baba, Y. (Eds.) Data Science and Its Applications; Academic Press: Tokyo, Japan, 1995. [Google Scholar]
- Cao, L. Data science: A comprehensive overview. ACM Comput. Surv. 2017, 50, 43:1–43:42. [Google Scholar] [CrossRef]
- Ueno, M. As the oldest journal of Data Science. Behaviormetrika 2017, 44, 1–2. [Google Scholar] [CrossRef]
- Englmeier, K.; Murtagh, F. Data Scientist—Manager of the discovery lifecycle. In Proceedings of the 6th International Conference on Data Science, Technology and Applications—Volume 1: DATA, Madrid, Spain, 26–28 July 2017; pp. 133–140. [Google Scholar]
- Coombs, C.H. A Theory of Data; Wiley: Hoboken, NJ, USA, 1964. [Google Scholar]
- Japec, L.; Kreuter, F.; Berg, M.; Biemer, P.; Decker, P.; Lampe, C.; Lane, J.; O’Neil, C.; Usher, A. AAPOR Report on Big Data; Technical Report; American Association for Public Opinion Research (AAPOR): Oakbrook Terrace, IL, USA, 2015; 50p, Available online: http://www.aapor.org/Education-Resources/Reports/Big-Data.aspx (accessed on 18 June 2018).
- Abbany, Z. A Public Transport Model Built on Open Data, News Article. Available online: http://www.dw.com/en/a-public-transport-model-built-on-open-data/a-41546053 (accessed on 27 November 2017).
- Darabi, A. The UK’s Next Census Will Be Its Last—Here’s Why, News Report. Available online: https://apolitical.co/solution_article/uks-next-census-will-last-heres (accessed on 5 December 2017).
- Murtagh, F.; Orlov, M.; Mirkin, B. Qualitative judgement of research impact: Domain taxonomy as a fundamental framework for judgement of the quality of research. J. Classif. 2018, 35, 5–28. [Google Scholar] [CrossRef]
- Hand, D. Statistical challenges of administrative and transaction data. J. R. Stat. Soc. Ser. A 2018, 181, 1–24. [Google Scholar] [CrossRef]
- Anderson, C. The End of Theory: The Data Deluge Makes The Scientific Method Obsolete, Wired Magazine. Available online: http://www.wired.com/science/discoveries/magazine/16-07/pb-theory (accessed on 16 July 2008).
- Murtagh, F. Origins of modern data analysis linked to the beginnings and early development of computer science and information engineering. Electron. J. Hist. Probab. Stat. 2008, 4, 26. [Google Scholar]
- Englmeier, K.; Murtagh, F. What can we expect from data scientists? J. Theor. Appl. Electron. Commer. Res. 2017, 12, i–iv. [Google Scholar]
- Murtagh, F.; Farid, M. Contextualizing Geometric Data Analysis and Related Data Analytics: A Virtual Microscope for Big Data Analytics. J. Interdiscip. Methodol. Issues Sci. Spec. Issue Digit. Contex. 2017, 3, 1–19. [Google Scholar]
- Allin, P.; Hand, D.J. New statistics for old?—Measuring the wellbeing of the UK. J. R. Stat. Soc. Ser. A 2017, 180, 3–43. [Google Scholar] [CrossRef]
- Wessel, M. You Don’t Need Big Data—You Need the Right Data. Harvard Business Review. 3 November 2016. Available online: https://hbr.org/2016/11/you-dont-need-big-data-you-need-the-right-data (accessed on 18 June 2018).
- Jobs Rated Report 2017: Ranking 200 Jobs. Available online: https://www.careercast.com/jobs-rated/2017-jobs-rated-report (accessed on 18 June 2018).
- Kei Daniel, B. Reimaging research methodology as Data Science. Big Data Cogn. Comput. 2018, 2. [Google Scholar] [CrossRef]
- Press, G. Graduate Programs in Big Data Analytics and Data Science. (Last updated: October 26, 2017). Available online: https://whatsthebigdata.com/2012/08/09/graduate-programs-in-big-data-and-data-science (accessed on 18 June 2018).
- NewVantage Partners (NVP). Big Data Business Impact: Achieving Business Results through Innovation and Disruption. Big Data Executive Survey 2017, Executive Summary of Findings. 2017, p. 16. Available online: http://newvantage.com/wp-content/uploads/2017/01/Big-Data-Executive-Survey-2017-Executive-Summary.pdf (accessed on 18 June 2018).
- Hayes, B. Empirically-Based Approach to Understanding the Structure of Data Science. Business over Broadway; Seattle, WA, USA, 18 January 2016. Available online: http://businessoverbroadway.com/empirically-based-approach-to-understanding-the-structure-of-data-science (accessed on 18 June 2018).
- Murtagh, F. Security and ethics in Big Data: Analytical foundations for surveys. Arch. Data Sci. 2018. submitted. [Google Scholar]
- Hand, D. The Dangers of Not Seeing What Isn’t There: Selection Bias in Statistical Modelling. In Proceedings of the ISA Gosset Lecture, Dublin, Ireland, 6 April 2017; Royal Irish Academy: Dublin, Ireland, 2017. [Google Scholar]
- Keiding, N.; Louis, T.A. Perils and Potentials of Self-Selected Entry to Epidemiological Studies and Surveys. J. R. Stat. Soc. Ser. A 2016, 179, 319–376. [Google Scholar] [CrossRef]
- O’Neill, C. Weapons of Math Destruction; Crown/Archetype: Danvers, MA, USA, 2016. [Google Scholar]
- Le Roux, B.; Lebaron, F. Idées-clefs de l’analyse géométrique des données. In La Méthodologie de Pierre Bourdieu en Action: Espace Culturel, Espace Social et Analyse des Données; Lebaron, F., Le Roux, B., Eds.; Dunod: Paris, France, 2015; pp. 3–20. [Google Scholar]
- Murtagh, F.; Pianosi, M.; Bull, R. Tracking and mapping Habermas’s communicative action: A case study using Twitter social media. Qual. Quant. 2016, 50, 1675–1694. [Google Scholar] [CrossRef]
- United Nations. “Open Universe” Proposal, an Initiative Under the Auspices of the Committee on the Peaceful Uses of Outer Space For Expanding Availability of and Accessibility to Open Source Space Science Data. In Proceedings of the Committee on the Peaceful Uses of Outer Space, 59th Sesssion, Vienna, Austria, 8–17 June 2016; Available online: http://www.unoosa.org/res/oosadoc/data/documents/2016/aac_1052016crp/aac_1052016crp_6_0_html/AC105_2016_CRP06E.pdf (accessed on 18 June 2018). [Google Scholar]
- Wilkinson, M.D.; Dumontier, M.; jan Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.-W.; da Silva Santos, L.B.; Bourne, P.E.; et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 2016, 3, 160018. [Google Scholar] [CrossRef] [PubMed]
- Devlin, K. Logic and Information; Cambridge University Press: Cambridge, UK, 1991. [Google Scholar]
- Devlin, K. A Uniform Framework for Describing and Analyzing the Modern Battlefield, US Army Feasibility Study Report. 2011. Available online: http://web.stanford.edu/~kdevlin/Papers/Army_report_0711.pdf (accessed on 18 June 2018).
- Devlin, K.; Rosenberg, D. Language at Work: Analyzing Communication Breakdown in the Workplace to Inform Systems Design; CSLI Publications: Stanford, CA, USA, 1996. [Google Scholar]
- Devlin, K.; Rosenberg, D. Information in the study of human interaction. In Handbook of the Philosophy of Information; Adriaana, P., van Benthem, J., Gabbay, D., Thagard, P., Woods, J., Eds.; Elsevier: Amsterdam, The Netherlands, 2008; pp. 685–710. Available online: http://web.stanford.edu/~kdevlin/Papers/HPI_SocialSciences.pdf (accessed on 18 June 2018).
- Barwise, J.; Perry, J. Situations and Attitudes; CSLI Publications: Stanford, CA, USA, 1999. [Google Scholar]
- Sustainable Accommodation in the New Economy (SANE). European Union Framework 5 Project. Result in Brief. 2005. Available online: https://cordis.europa.eu/project/rcn/58059_en.html (accessed on 18 June 2018).
- Rosenberg, D.; Foley, S.; Lievonen, M.; Kammas, S.; Crisp, M.J. Interaction spaces in computer-mediated communication. AI Soc. 2005, 19, 22–33. [Google Scholar] [CrossRef]
- Walkowski, S.; Doerner, R.; Lievonen, M.; Rosenberg, D. Using Game controller for Relaying Deictic Gestures in Computer Mediated Communication. Int. J. Hum.-Comput. Stud. 2011, 69, 362–374. [Google Scholar] [CrossRef]
Figure 1. From 60 Data Scientist job advertisements, non-empty from the initial set of 72 employers, with use of 21 software languages or environments. The latter were required by at least four employers. Displayed is the principal factor plane.
Figure 2. Hierarchical clustering, and derived 3-class partition, of the 21 software languages or environments at issue here, based on the full dimensionality, Euclidean-metric endowed, factor space. See Section 3. The agglomerative criterion is the Ward minimum variance method.
Figure 3. The completion-diagram network is complex. In the screenshot, the aerial view (taken from a previous mission used in training) is of an urban battlefield. The back-end system links the elements in each completion diagram to a corresponding feature in the aerial view, permitting the user to work fluidly with the two representations, having the benefit of two very different views, one spatial, the other human-structural, so the user can explore the domain from (literally) different perspectives.
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).