According to Tufféry (2012), data mining is the art of extracting information, in other words knowledge, from data. It is in fact an ‘intelligent’ extraction and not simply a presentation of numerical results, statistics, surveys, sales statements, quoted market values, macroeconomic indicators, meteorological record. Nor does it consist of simple descriptive statistics. In essence, data mining occurs when specific raw data is used to attempt to go from the known to the unknown in order to formulate predictions or pursue more in-depth trend analyses. Data mining goes beyond observation to include inference and modeling. One aspect of data mining deals with modeling the past to predict the future: we attempt to find hidden rules in a large volume of data relating to former customers so that we can apply them to new customers and make the best possible decisions. Therefore, data mining is both descriptive and predictive: the descriptive (exploratory) techniques bring to light information that exists but is buried in a mass of data (as with automatic classification of individuals and discovering associations between products or drugs); the predictive (explanatory) techniques aim to extrapolate new information from existing information, this new information can be either qualitative (classification or scoring) or quantitative (prediction).

Several major technological advances have contributed to data mining’s growth.

  • The first advance relates to the storage and calculation capacity offered by modern computing equipment and techniques: data warehouses capable of storing several tens of terabytes, parallel architectures and computers that are increasingly more powerful.
  • The second advance pertains to the increasing number of optimized integrated software that offer ‘packages’ with different types of statistical and data mining algorithms that can automatically be linked to each other. These user-friendly algorithms provide a quality of output and possibilities for interactivity that were previously inconceivable.
  • The third advance is a major evolution in the field of decision making: the implementation of data mining methods in production processes, ranging from periodic output of information to end users to automatic event triggering.
  • A fourth advance has joined the previous three. It deals with the possibility of processing all kinds of data, whether it consists of incomplete, aberrant or even text data (thanks to text mining).
  • A fifth element has played a role in the development of data mining: the creation of gigantic databases to accommodate the management requirements of businesses. This has lead to the realization that these databases contain a wealth of unexploited data.
Buy the book

Tufféry, Stéphane (2012). Data mining et statistique décisionnelle : l'intelligence des données, Paris , Éditions Technip.