Learning Data Mining and Warehousing

Contents

 

Modern Data Wsarehousing, Mining, and Visualization

Check the availability and buy your books from our Bookshop.

Contact us here

Online Business School  for the delivery and management
of your own existing or the customised versions of our programmes for in-class or global distance learning.

Teaching and Research Skills

 

Teaching Online

 

 

For further information see also

 

The Bookshop

Today's Videos Playlist

 

Loading

 

 

Facebook

Twitter

Rationale

Teaching and Learning Resources

 

Case Studies

Related Workshops

 

Learner Support

 

Recommended Texts

Resources

 

Assignments, Assessments

 

Learning Centres

 

 

 

 

Data Mining and Warehousing

 

Rationale

 

 

 

Data Mining (DMM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, clustering, etc. Data mining is a complex topic and has links with multiple core fields such as computer science and adds value to rich seminal computational techniques from statistics, information retrieval, machine learning and pattern recognition.

 

 

Data Mining

 

See also

 

External links

 

 

Today's Videos

Teacher Tube

 

 

Teaching and Learning Resources

 

Click on titles

Learning Contents

Tutorials Assignments Recommended Texts Readings Learner Support Workshops Case Studies Web Cases Resources Staff Development Discussion Forums Subject Reviews

Introduction to Data Mining, Warehousing, and Visualization. Data Mining Techniques

Tutorials

 

Readings

 

 

 

Transforming Data into Useful Information

 

Transforming Data into Useful Information

 

OPEN is comprised of a tightly integrated set of core system components and application modules that work together to deliver an enterprise power information management solution. These core system functions include:

 

The Business Intelligence Blog

 

 

 

 

Real Time Marketing

 

 

The Data Warehouse

Tutorials

 

Readings

 

 

A data warehouse is a repository (collection of resources that can be accessed to retrieve information) of an organization's electronically stored data, designed to facilitate reporting and analysis.[1] More simply, a data warehouse is a collection of a large amount of data.

 

Data Warehouse Process

 

This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed and cataloged and is made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support (Marakas & OBrien 2009). However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata.

Data warehousing arises in an organization's need for reliable, consolidated, unique and integrated analysis and reporting of its data at different levels of aggregation.

The practical reality of most organizations is that their data infrastructure is made up by a collection of heterogeneous systems. For example, an organization might have one system that handles customer-relationship, a system that handles employees, systems that handle sales data or production data, yet another system for finance and budgeting data, etc. In practice, these systems are often poorly or not at all integrated and simple questions like: "How much time did sales person A spend on customer C, how much did we sell to Customer C, was customer C happy with the provided service, did Customer C pay his bills?" can be very hard to answer, even though the information is available "somewhere" in the different data systems.

Another problem is that enterprise resource planning (ERP) systems are designed to support relevant operations. For example, a finance system might keep track of every single stamp bought; When it was ordered, when it was delivered, when it was paid and the system might offer accounting principles (like double bookkeeping) that further complicates the data model. Such information is great for the person in charge of buying "stamps" or the accountant trying to sort out an irregularity, but the CEO is definitely not interested in such detailed information, the CEO wants to know stuff like "What's the cost?", "What's the revenue?", "Did our latest initiative reduce costs?" and wants to have this information at an aggregated level.

Yet another problem might be that the organization is, internally, in disagreement about which data are correct. For example, the sales department might have one view of its costs, while the finance department has another view of that cost. In such cases, the organization can spend unlimited time discussing who has the correct view of the data.

It is partly the purpose of data warehousing to bridge such problems. In data warehousing the source data systems are considered as given: Even though the data source system might have been made in such a manner makes it difficult to extract integrated information, the "data warehousing answer" is not to redesign the data source systems but rather to make the data appear consistent, integrated and consolidated despite the problems in the underlying source systems. Data warehousing achieves this by employing different data warehousing techniques, creating one or more new data repositories (i.e. the data warehouse) whose data model(s) support the needed reporting and analysis.

 

See also

 

External links

Davenport, Thomas H. and Harris, Jeanne G. Competing on Analytics: The New Science of Winning (2007) Harvard Business School Press. ISBN 978-1422103326

Ganczarski, Joe. Data Warehouse Implementations: Critical Implementation Factors Study (2009) VDM Verlag ISBN 3-639-18589-7 ISBN 978-3-639-18589-8

Kimball, Ralph and Ross, Margy. The Data Warehouse Toolkit Second Edition (2002) John Wiley and Sons, Inc. ISBN 0-471-20024-7

 

 

 

Data Mining and Data Visualization

Tutorials

 

Readings

Data visualization is the study of the visual representation of data, meaning "information which has been abstracted in some schematic form, including attributes or variables for the units of information".[1]

Data Mining Add-ins for Visio 2007 released

 

According to Friedman (2008) the "main goal of data visualization is to communicate information clearly and effectively through graphical means. It doesn’t mean that data visualization needs to look boring to be functional or extremely sophisticated to look beautiful. To convey ideas effectively, both aesthetic form and functionality need to go hand in hand, providing insights into a rather sparse and complex data set by communicating its key-aspects in a more intuitive way. Yet designers often fail to achieve a balance between design and function, creating gorgeous data visualizations which fail to serve their main purpose — to communicate information".[2]

Data visualization is closely related to Information graphics, Information visualization, Scientific visualization and Statistical graphics. In the new millennium data visualization has become active area of research, teaching and development. According to Post et al (2002) it has united the field of scientific and information visualization".[3]

KPI Library has developed the “Periodic Table of Visualization Methods”, an interactive chart displaying various different data visualization methods.[4] This is probably the most comprehensive collection of different types of data visualization. It details 6 types of data visualization methods: data, information, concept, strategy, metaphor and compound.

Data visualization strategies were recently used by the Barack Obama administration as a way to enhance and broaden the support of his campaign. The "Road to Recovery" visualization became famous by showing US job loss figures between December 2007 and January 2010. This enabled people to compare the number of jobs lost during President Obama’s first year in office with the number of jobs lost during President Bush’s last year in office.[5]

 

See also

 

Software

 

External links

 

Advanced Visual Systems

 

Data visualization

 

 

Machines That Can Learn

Tutorials

 

Readings

Machine learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases. A learner can take advantage of examples (data) to capture characteristics of interest of their unknown underlying probability distribution. Data can be seen as examples that illustrate relations between observed variables. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data; the difficulty lies in the fact that the set of all possible behaviors given all possible inputs is too large to be covered by the set of observed examples (training data). Hence the learner must generalize from the given examples, so as to be able to produce a useful output in new cases. Artificial intelligence is a closely related field, as are probability theory and statistics, data mining, pattern recognition, adaptive control, computational neuroscience and theoretical computer science.

 

Machine Learning (ML)

 

 

See also

 

External links

 

Design and simulate fuzzy logic systems

 

Executive Information Systems

Tutorials

 

Readings

An Executive Information System (EIS) is a type of management information system intended to facilitate and support the information and decision-making needs of senior executives by providing easy access to both internal and external information relevant to meeting the strategic goals of the organization. It is commonly considered as a specialized form of a Decision Support System (DSS) [1]

 

Business Intelligence

 

The emphasis of EIS is on graphical displays and easy-to-use user interfaces. They offer strong reporting and drill-down capabilities. In general, EIS are enterprise-wide DSS that help top-level executives analyze, compare, and highlight trends in important variables so that they can monitor performance and identify opportunities and problems. EIS and data warehousing technologies are converging in the marketplace.

In recent years, the term EIS has lost popularity in favour of Business Intelligence (with the sub areas of reporting, analytics, and digital dashboards).

 

See also

 

External links

 

Information Aggregation Usage

 

Designing and Building the Data Warehouse

Tutorials

 

Readings

 

Seven Principles for Enterprise Data Warehouse Design

 

Steps Involved in Building a Data Warehouse

Building the Data Warehouse: Getting Started

Building a Data Warehouse Part I: When to build your data warehouse

Building a Data Warehouse Part II: Building a new schema

Building a Data Warehouse Part III: Location of your data warehouse

Building a Data Warehouse Part IV: Extraction, Transformation, and Load

Building a Data Warehouse Part V: Application Development Options

Building Business Intelligence Data Warehouses

 

DB2 Data Warehouse OLAP Services, Part 1: Starting out with OLAP services



Basic ODM Concepts

Tutorials

 

Readings

Oracle Data Mining (ODM) is an option of Oracle Corporation's Relational Database Management System (RDBMS) Enterprise Edition (EE). It contains several data mining and data analysis algorithms for classification, prediction, regression, clustering, associations, feature selection, anomaly detection, feature extraction, and specialized analytics. It provides means for the creation, management and operational deployment of data mining models inside the database environment.

 

 

External links

Oracle Data Mining
Oracle Data Mining GUI
Oracle Data Mining
Developer: Oracle Corporation
Latest release: 10gR2 / October, 2006
Use: data mining and analytics
License: proprietary
Website: [1]

 

Oracle9i Data Mining

 

 

k-Means

Tutorials

 

Readings

In statistics and machine learning, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data as well as in the iterative refinement approach employed by both algorithms.

 

 

See also

k-means clustering result for the Iris flower data set and actual species visualized using ELKI. Cluster means are marked using larger, semi-transparent symbols.

 

External links

 

 

ArrayMiner Gaussian clustering to k-Means

 

Experimental Condition Clusters

 

 

Neural Networks

Tutorials

 

Readings

Traditionally, the term neural network had been used to refer to a network or circuit of biological neurons[1]; the modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes. Thus the term has two distinct usages:

Biological neural networks are made up of real biological neurons that are connected or functionally related in the peripheral nervous system or the central nervous system. In the field of neuroscience, they are often identified as groups of neurons that perform a specific physiological function in laboratory analysis.

Artificial neural networks are made up of interconnecting artificial neurons (programming constructs that mimic the properties of biological neurons). Artificial neural networks may either be used to gain an understanding of biological neural networks, or for solving artificial intelligence problems without necessarily creating a model of a real biological system. The real, biological nervous system is highly complex and includes some features that may seem superfluous based on an understanding of artificial networks.

This article focuses on the relationship between the two concepts; for detailed coverage of the two different concepts refer to the separate articles: Biological neural network and Artificial neural network.

 

Neural Networks and Machine Learning

 

 

See also

 

External links

 

Training Neural Networks

 

Principal Component Analysis

Tutorials

 

Readings

Principal component analysis (PCA) involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT), the Hotelling transform or proper orthogonal decomposition (POD).

 

PCA - Principal Component Analysis

 

PCA was invented in 1901 by Karl Pearson.[1] Now it is mostly used as a tool in exploratory data analysis and for making predictive models. PCA involves the calculation of the eigenvalue decomposition of a data covariance matrix or singular value decomposition of a data matrix, usually after mean centering the data for each attribute. The results of a PCA are usually discussed in terms of component scores and loadings (Shaw, 2003).

PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its operation can be thought of as revealing the internal structure of the data in a way which best explains the variance in the data. If a multivariate dataset is visualised as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA supplies the user with a lower-dimensional picture, a "shadow" of this object when viewed from its (in some sense) most informative viewpoint.

PCA is closely related to factor analysis; indeed, some statistical packages deliberately conflate the two techniques. True factor analysis makes different assumptions about the underlying structure and solves eigenvectors of a slightly different matrix.

 

See also

 

 

Data Analysis is the act of transforming data with the aim of extracting useful information and facilitating conclusions. Depending on the type of data and the question, this might include application of statistical methods, curve fitting, selecting or discarding certain subsets based on specific criteria, or other techniques. In respect to Data mining, data analysis is usually more narrowly intended as not aiming to the discovery of unforeseen patterns hidden in the data, but to the verification or disproval of an existing model, or to the extraction of parameters necessary to adapt a theoretical model to (experimental) reality.

 

 

See also

Data Analysis

 

Principal Component Analysis

Principal Component Analysis (PCA) in 3D Visual Environment

 

 

The Future of Data Mining, Warehousing, and Visualization

Tutorials

 

Readings

What is data mining good for?

Data mining software allows users to analyze large databases to solve business decision problems. Data mining is, in some ways, an extension of statistics, with a few artificial intelligence and machine learning twists thrown in. Like statistics, data mining is not a business solution, it is just a technology.  For example, consider a catalog retailer who needs to decide who should receive information about a new product.

The information operated on by the data mining process is contained in a historical database of previous interactions with customers and the features associated with the customers, such as age, zip code, their responses. The data mining software would use this historical information to build a model of customer behavior that could be used to predict which customers would be likely to respond to the new product. By using this information a marketing manager can select only the customers who are most likely to respond.  The operational business software can then feed the results of the decision to the appropriate touch point systems (call centers, direct mail, web servers, email systems, etc.) so that the right customers receive the right offers.

 

Competing on Analytics

My Data Mining Book of the Month is Competing on Analytics by Tom Davenport and Jeanne Harris.  This book is a great addition to the literature of analytics.   Not that it solves any complicated statistical modeling problem; I don't think that there is a single equation in the entire book.  Instead, this book focuses on the (more important) problem of getting an organization to change its approach to problem solving, by increasing the use of analytics across a business. 

This is a trend that I have been involved with over the past fifteen years, and I think that it is a key differentiator between modern companies.  If you are interested in learning how companies like Amazon, Capital One, Harrah's, and Netflix (to name just a few) put analytics into practice, this book is a fantastic resource.   For more books on data mining, take a look at my list of recommended books.

Check the availability and buy your books from our Bookshop.

 

 

Recommended Texts

 

Data Warehousing, Data Mining, and OLAP

Data Warehousing, Data Mining, and OLAP (Data Warehousing/Data Management) (Hardcover)
by Alex Berson, Stephen J. Smith

Check the availability and buy your books from our Bookshop.

 

Data Mining : Practical Machine Learning Tools and Techniques with Java Implementations

Data Mining : Practical Machine Learning Tools and Techniques with Java Implementations (The Morgan Kaufmann Series in Data Management Systems)

 

Check the availability and buy your books from our Bookshop.

 

Relational Data Mining Relational Data Mining

Saso Dzeroski and Nada Lavrac, editors

Springer, Berlin, 2001

Relational data mining studies methods for knowledge discovery in databases when the database has information about several types of objects. This, of course, is usually the case when the database has more than one table. Hence there is little doubt as to the relevance of the area; indeed, one can wonder why most of data mining research has concentrated on the single table case.

Relational data mining has its roots in inductive logic programming, an area in the intersection of machine learning and programming languages. ... The present book Relational Data Mining provides a thorough overview of different techniques and strategies used in knowledge discovery from multi-relational data. The chapters describe a broad selection of practical inductive logic programming approaches to relational data mining and give a good overview of several interesting applications.

(From the foreword by Heikki Mannila)

Check the availability and buy your books from our Bookshop.

 

 

Resources

 

Now is the Right Time for Real-Time BI

 

 

 

 

 

Data Mining Course Notes

 

Datasets

 

Publications

 

 

Source:

KDnuggets