![]()
Contents
Data Mining and Warehousing
Rationale
Data Mining (DMM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, clustering, etc. Data mining is a complex topic and has links with multiple core fields such as computer science and adds value to rich seminal computational techniques from statistics, information retrieval, machine learning and pattern recognition.
- Example
- Use of the term
- Misuse of the term
- Related terms
- Data dredging
- Privacy concerns
- Combinatorial game data mining
- Notable uses of data mining
- Artificial intelligence
- Bayesian network
- CRISP-DM
- Data analysis
- Data farming
- Data warehouse
- Descriptive statistics
- Fuzzy logic
- Hypothesis testing
- k-nearest neighbor algorithm
- Machine learning
- Pattern recognition
- Predictive analytics
- Preprocessing
- Statistics
- Structured Data Mining
- Unstructured Data Mining
- Induction algorithms
- Dimensionality reduction
- Application areas
- Software
- References
- General References
- Books
Today's Videos
- Connect with us on http://www.youtube.com/finntrack
- Google's Playlists
Teaching and Learning Resources

Introduction to Data Mining, Warehousing, and Visualization. Data Mining Techniques
- Introduction to Data Mining, Warehousing, and Visualization
- Introduction to Data Mining and Knowledge Discovery
- Introduction to Data Warehousing
Transforming Data into Useful Information
OPEN is comprised of a tightly integrated set of core system components and application modules that work together to deliver an enterprise power information management solution. These core system functions include:
- Data Collection
- Data Exchange
- Data Warehousing
- Analysis Engine
- Alarming and Reporting
- Visualization
- Administration
- Data Mining Techniques
The Data Warehouse
Tutorials
Readings
A data warehouse is a repository (collection of resources that can be accessed to retrieve information) of an organization's electronically stored data, designed to facilitate reporting and analysis.[1] More simply, a data warehouse is a collection of a large amount of data.
This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed and cataloged and is made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support (Marakas & OBrien 2009). However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata.
Data warehousing arises in an organization's need for reliable, consolidated, unique and integrated analysis and reporting of its data at different levels of aggregation.
The practical reality of most organizations is that their data infrastructure is made up by a collection of heterogeneous systems. For example, an organization might have one system that handles customer-relationship, a system that handles employees, systems that handle sales data or production data, yet another system for finance and budgeting data, etc. In practice, these systems are often poorly or not at all integrated and simple questions like: "How much time did sales person A spend on customer C, how much did we sell to Customer C, was customer C happy with the provided service, did Customer C pay his bills?" can be very hard to answer, even though the information is available "somewhere" in the different data systems.
Another problem is that enterprise resource planning (ERP) systems are designed to support relevant operations. For example, a finance system might keep track of every single stamp bought; When it was ordered, when it was delivered, when it was paid and the system might offer accounting principles (like double bookkeeping) that further complicates the data model. Such information is great for the person in charge of buying "stamps" or the accountant trying to sort out an irregularity, but the CEO is definitely not interested in such detailed information, the CEO wants to know stuff like "What's the cost?", "What's the revenue?", "Did our latest initiative reduce costs?" and wants to have this information at an aggregated level.
Yet another problem might be that the organization is, internally, in disagreement about which data are correct. For example, the sales department might have one view of its costs, while the finance department has another view of that cost. In such cases, the organization can spend unlimited time discussing who has the correct view of the data.
It is partly the purpose of data warehousing to bridge such problems. In data warehousing the source data systems are considered as given: Even though the data source system might have been made in such a manner makes it difficult to extract integrated information, the "data warehousing answer" is not to redesign the data source systems but rather to make the data appear consistent, integrated and consolidated despite the problems in the underlying source systems. Data warehousing achieves this by employing different data warehousing techniques, creating one or more new data repositories (i.e. the data warehouse) whose data model(s) support the needed reporting and analysis.
- History
- Architecture
- Conforming information
- Normalized versus dimensional approach for storage of data
- Top-down versus bottom-up design methodologies
- Data warehouses versus operational systems
- Evolution in organization use
- Benefits
- Disadvantages
- Sample applications
- Future
- Accounting intelligence
- Business Intelligence
- Business intelligence tools
- Data integration
- Data mart
- Data mining
- Data mining agent
- Data Presentation Architecture
- Data warehouse appliance
- Database Management System (DBMS)
- Decision support system
- Data Vault Modeling
- Executive Information System (EIS)
- Extract, transform, and load (ETL)
- Master Data Management (MDM)
- Online Analytical Processing (OLAP)
- Online transaction processing (OLTP)
- Operational Data Store (ODS)
- Data scraping
- Snowflake schema
- Software as a service (Saas)
- Star schema
- Slowly changing dimension
- References
- Further reading
Davenport, Thomas H. and Harris, Jeanne G. Competing on Analytics: The New Science of Winning (2007) Harvard Business School Press. ISBN 978-1422103326
Ganczarski, Joe. Data Warehouse Implementations: Critical Implementation Factors Study (2009) VDM Verlag ISBN 3-639-18589-7 ISBN 978-3-639-18589-8
Kimball, Ralph and Ross, Margy. The Data Warehouse Toolkit Second Edition (2002) John Wiley and Sons, Inc. ISBN 0-471-20024-7
Data Mining and Data Visualization
Tutorials
Readings
Data visualization is the study of the visual representation of data, meaning "information which has been abstracted in some schematic form, including attributes or variables for the units of information".[1]
According to Friedman (2008) the "main goal of data visualization is to communicate information clearly and effectively through graphical means. It doesn’t mean that data visualization needs to look boring to be functional or extremely sophisticated to look beautiful. To convey ideas effectively, both aesthetic form and functionality need to go hand in hand, providing insights into a rather sparse and complex data set by communicating its key-aspects in a more intuitive way. Yet designers often fail to achieve a balance between design and function, creating gorgeous data visualizations which fail to serve their main purpose — to communicate information".[2]
Data visualization is closely related to Information graphics, Information visualization, Scientific visualization and Statistical graphics. In the new millennium data visualization has become active area of research, teaching and development. According to Post et al (2002) it has united the field of scientific and information visualization".[3]
KPI Library has developed the “Periodic Table of Visualization Methods”, an interactive chart displaying various different data visualization methods.[4] This is probably the most comprehensive collection of different types of data visualization. It details 6 types of data visualization methods: data, information, concept, strategy, metaphor and compound.
Data visualization strategies were recently used by the Barack Obama administration as a way to enhance and broaden the support of his campaign. The "Road to Recovery" visualization became famous by showing US job loss figures between December 2007 and January 2010. This enabled people to compare the number of jobs lost during President Obama’s first year in office with the number of jobs lost during President Bush’s last year in office.[5]
Software
- Epic Systems Trend Compass animated charts. sample is Master Card vs Visa performance in UK
- Avizo
- Bime
- Data Desk
- DAVIX
- Eye-Sys
- Ferret Data Visualization and Analysis
- GGobi
- IBM OpenDX
- IDL (programming language)
- InetSoft
- Instantatlas
- OpenLink AJAX Toolkit
- ParaView
- Processing (programming language)
- ScienceGL (www.sciencegl.com)
- Smile (software)
- StatSoft
- Trade Space Visualizer
- Visifire
- VisIt
- VTK
- Yoix
- References
- Further reading
- How to Win Political Debates with Data Visualization
- Periodic Table of Visualization Methods
- Data Visualization blog posts. Archive of many blog posts about data visualization.
- Milestones in the History of Thematic Cartography, Statistical Graphics, and Data Visualization,
Machines That Can Learn
Tutorials
Readings
Machine learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases. A learner can take advantage of examples (data) to capture characteristics of interest of their unknown underlying probability distribution. Data can be seen as examples that illustrate relations between observed variables. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data; the difficulty lies in the fact that the set of all possible behaviors given all possible inputs is too large to be covered by the set of observed examples (training data). Hence the learner must generalize from the given examples, so as to be able to produce a useful output in new cases. Artificial intelligence is a closely related field, as are probability theory and statistics, data mining, pattern recognition, adaptive control, computational neuroscience and theoretical computer science.
- Definition
- Generalization
- Human interaction
- Algorithm types
- Theory
- Approaches
- Applications
- Software
- Journals and conferences
- Computational intelligence
- Data mining
- Explanation-based learning
- Important publications in machine learning
- Multi-label classification
- Pattern recognition
- Predictive analytics
- References
- Further reading
- Ruby implementations of several machine learning algorithms
- Andrew Ng's Stanford lectures and course materials
- The Encyclopedia of Computational Intelligence
- International Machine Learning Society
- Kmining List of machine learning, data mining and KDD scientific conferences
- Machine Learning Open Source Software
- Machine Learning Video Lectures
- Open Source Artificial Learning Software
- The Computational Intelligence and Machine Learning Virtual Community
- R Machine Learning Task View
- Machine Learning Links and Resources
- Fuzzy Logic
- Fuzzy Logic Expert System Design
- Neural Networks
Executive Information Systems
Tutorials
Readings
An Executive Information System (EIS) is a type of management information system intended to facilitate and support the information and decision-making needs of senior executives by providing easy access to both internal and external information relevant to meeting the strategic goals of the organization. It is commonly considered as a specialized form of a Decision Support System (DSS) [1]
The emphasis of EIS is on graphical displays and easy-to-use user interfaces. They offer strong reporting and drill-down capabilities. In general, EIS are enterprise-wide DSS that help top-level executives analyze, compare, and highlight trends in important variables so that they can monitor performance and identify opportunities and problems. EIS and data warehousing technologies are converging in the marketplace.
In recent years, the term EIS has lost popularity in favour of Business Intelligence (with the sub areas of reporting, analytics, and digital dashboards).
Designing and Building the Data Warehouse
Tutorials
Readings
Basic
ODM Concepts
Tutorials
- Basic
ODM Concepts
- Section 1.1, "New Features and Functionality"
- Section 1.2, "Oracle9i Data Mining Components"
- Section 1.3, "Data Mining Functions"
- Section 1.5, "Data Mining Tasks"
- Section 1.4, "ODM Algorithms"
- Section 1.6, "ODM Objects and Functionality"
- Section 1.7, "Missing Values"
- Section 1.8, "Discretization (Binning)"
- Section 1.9, "PMML Support"
Readings
Oracle Data Mining (ODM) is an option of Oracle Corporation's Relational Database Management System (RDBMS) Enterprise Edition (EE). It contains several data mining and data analysis algorithms for classification, prediction, regression, clustering, associations, feature selection, anomaly detection, feature extraction, and specialized analytics. It provides means for the creation, management and operational deployment of data mining models inside the database environment.
|
|
||||||||||||||||
k-Means
Tutorials
- K-Mean Clustering Tutorials
- What is cluster analysis?
- Cluster Analysis - Introduction
- Representative-Based Clustering
- k-Means
and Hierarchical Clustering
Readings
In statistics and machine learning, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data as well as in the iterative refinement approach employed by both algorithms.
|
k-means clustering result for the Iris flower data set and actual species visualized using ELKI. Cluster means are marked using larger, semi-transparent symbols. |
- Code implementations
- K-means application in PHP
- A fast implementation of the K-means algorithm which uses the triangle inequality to speed up computation
- K-means clustering using Perl. Online clustering.
- K-means clustering using C++ by Antonio Gulli
- K-means clustering implementation in Ruby (AI4R)
- Another K-means clustering implementation in Ruby
- k-means clustering implementation in Python with scipy
- k-means in X10
- A parallel out-of-core implementation in C
- An open-source collection of clustering algorithms, including k-means, implemented in Javascript
- Visualization, animation and examples
- An example of multithreaded application which uses K-means in Java
- Visualization of the K-means-algorithm (Applet)
- Interactive demo of the K-means-algorithm (Applet)
- Another animation of the K-means-algorithm
- Numerical Example of K-means clustering
- Application example which uses K-means clustering to reduce the number of colors in images
- Clustergram - cluster diagnostic plot - for visual diagnostics of choosing the number of (k) clusters (R code)
Neural Networks
Tutorials
Readings
Traditionally, the term neural network had been used to refer to a network or circuit of biological neurons[1]; the modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes. Thus the term has two distinct usages:
Biological neural networks are made up of real biological neurons that are connected or functionally related in the peripheral nervous system or the central nervous system. In the field of neuroscience, they are often identified as groups of neurons that perform a specific physiological function in laboratory analysis.
Artificial neural networks are made up of interconnecting artificial neurons (programming constructs that mimic the properties of biological neurons). Artificial neural networks may either be used to gain an understanding of biological neural networks, or for solving artificial intelligence problems without necessarily creating a model of a real biological system. The real, biological nervous system is highly complex and includes some features that may seem superfluous based on an understanding of artificial networks.
This article focuses on the relationship between the two concepts; for detailed coverage of the two different concepts refer to the separate articles: Biological neural network and Artificial neural network.
- Overview
- History of the neural network analogy
- The brain, neural networks and computers
- Neural networks and neuroscience
- Criticism
- LearnArtificialNeuralNetworks - Robot control and neural networks
- Review of Neural Networks in Materials Science
- Artificial Neural Networks Tutorial in three languages (Univ. Politécnica de Madrid)
- Introduction to Neural Networks and Knowledge Modeling
- Another introduction to ANN
- Next Generation of Neural Networks - Google Tech Talks
- Performance of Neural Networks
- Neural Networks and Information
- PMML Representation - Standard way to represent neural networks
Principal Component Analysis
Tutorials
Readings
Principal component analysis (PCA) involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT), the Hotelling transform or proper orthogonal decomposition (POD).
PCA was invented in 1901 by Karl Pearson.[1] Now it is mostly used as a tool in exploratory data analysis and for making predictive models. PCA involves the calculation of the eigenvalue decomposition of a data covariance matrix or singular value decomposition of a data matrix, usually after mean centering the data for each attribute. The results of a PCA are usually discussed in terms of component scores and loadings (Shaw, 2003).
PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its operation can be thought of as revealing the internal structure of the data in a way which best explains the variance in the data. If a multivariate dataset is visualised as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA supplies the user with a lower-dimensional picture, a "shadow" of this object when viewed from its (in some sense) most informative viewpoint.
PCA is closely related to factor analysis; indeed, some statistical packages deliberately conflate the two techniques. True factor analysis makes different assumptions about the underlying structure and solves eigenvectors of a slightly different matrix.
- Details
- Discussion
- Table of symbols and abbreviations
- Properties and limitations of PCA
- Computing PCA using the covariance method
- Organize the data set
- Calculate the empirical mean
- Calculate the deviations from the mean
- Find the covariance matrix
- Find the eigenvectors and eigenvalues of the covariance matrix
- Rearrange the eigenvectors and eigenvalues
- Compute the cumulative energy content for each eigenvector
- Select a subset of the eigenvectors as basis vectors
- Convert the source data to z-scores
- Project the z-scores of the data onto the new basis
- Derivation of PCA using the covariance method
- Computing principal components with expectation maximization
- Relation between PCA and K-means clustering
- Correspondence analysis
- Generalizations
- Software/source code
- Notes
- References
- External links
- The Most Representative Composite Rank Ordering of Multi-Attribute Objects by the Particle Swarm Optimization
- Sub-Optimality of Rank Ordering of Objects on the Basis of the Leading Principal Component Factor Scores
- Spectroscopy and PCA
- An introduction and review of recent developments of PCA
- An introductory explanation of PCA from StatSoft
- A Tutorial on Principal Component Analysis by Jonathon Shlens (PDF)
- Principal Component Analysis using Hebbian learning tutorial
- Presentation of Principal Component Analysis used in Biomedical Engineering
- Application to microarray and other biomedical data
- PCA in functional neuroimaging, free software
- Uncertainty estimation for PCA
- FactoMineR, an R package dedicated to exploratory multivariate analysis
- A web-site with presentations and open source software on exploratory multivariate data analysis
- EasyPCA, a very simple and small PCA program under the GPL license
- R tutorial on cluster and principal component analysis including example data
- Simple COV-based PCA using Eigen Template Library in C++ by Antonio Gulli
- www.powercam.cc/chli by Cheng-Hsuan Li (A Chinese Tutorial on Kernel Method, PCA, KPCA, LDA, GDA, and SVMs)
Data Analysis is the act of transforming data with the aim of extracting useful information and facilitating conclusions. Depending on the type of data and the question, this might include application of statistical methods, curve fitting, selecting or discarding certain subsets based on specific criteria, or other techniques. In respect to Data mining, data analysis is usually more narrowly intended as not aiming to the discovery of unforeseen patterns hidden in the data, but to the verification or disproval of an existing model, or to the extraction of parameters necessary to adapt a theoretical model to (experimental) reality.
|
|
![]() |
The Future of Data Mining, Warehousing, and Visualization
Tutorials
Readings
Data mining software allows users to analyze large databases to solve business decision problems. Data mining is, in some ways, an extension of statistics, with a few artificial intelligence and machine learning twists thrown in. Like statistics, data mining is not a business solution, it is just a technology. For example, consider a catalog retailer who needs to decide who should receive information about a new product.
The information operated on by the data mining process is contained in a historical database of previous interactions with customers and the features associated with the customers, such as age, zip code, their responses. The data mining software would use this historical information to build a model of customer behavior that could be used to predict which customers would be likely to respond to the new product. By using this information a marketing manager can select only the customers who are most likely to respond. The operational business software can then feed the results of the decision to the appropriate touch point systems (call centers, direct mail, web servers, email systems, etc.) so that the right customers receive the right offers.
![]() |
My Data Mining Book of the Month is Competing on Analytics by Tom Davenport and Jeanne Harris. This book is a great addition to the literature of analytics. Not that it solves any complicated statistical modeling problem; I don't think that there is a single equation in the entire book. Instead, this book focuses on the (more important) problem of getting an organization to change its approach to problem solving, by increasing the use of analytics across a business. This is a trend that I have been involved with over the past fifteen years, and I think that it is a key differentiator between modern companies. If you are interested in learning how companies like Amazon, Capital One, Harrah's, and Netflix (to name just a few) put analytics into practice, this book is a fantastic resource. For more books on data mining, take a look at my list of recommended books. Check the availability and buy your books from our Bookshop. |
Recommended Texts
![]() |
Data
Warehousing, Data Mining, and OLAP (Data
Warehousing/Data Management) (Hardcover) Check the availability and buy your books from our Bookshop. |
![]() |
Data Mining : Practical Machine Learning Tools and Techniques with Java Implementations (The Morgan Kaufmann Series in Data Management Systems)
Check the availability and buy your books from our Bookshop. |
![]() |
Relational
Data Mining
Saso Dzeroski and Nada Lavrac, editors Springer, Berlin, 2001 Relational data mining studies methods for knowledge discovery in databases when the database has information about several types of objects. This, of course, is usually the case when the database has more than one table. Hence there is little doubt as to the relevance of the area; indeed, one can wonder why most of data mining research has concentrated on the single table case. Relational data mining has its roots in inductive logic programming, an area in the intersection of machine learning and programming languages. ... The present book Relational Data Mining provides a thorough overview of different techniques and strategies used in knowledge discovery from multi-relational data. The chapters describe a broad selection of practical inductive logic programming approaches to relational data mining and give a good overview of several interesting applications. (From the foreword by Heikki Mannila) Check the availability and buy your books from our Bookshop. |
Resources
- M1: Introduction: Machine Learning and Data Mining (PPT, 211KB)
- M2: Machine Learning and Classification (PPT, 471KB)
- M3: Input: Concepts, instances, attributes (PPT, 259KB)
- M4: Output: Knowledge Representation (PPT, 426KB)
- M5: Classification - Basic methods (PPT, 364KB)
- M6: Classification: Decision Trees (PPT, 738KB)
- M7: Classification: C4.5 (PPT, 373KB)
- M8: Classification: CART (PPT, 477KB)
- M9: Classification: Rules, Regression, K-Nearest Neighbour (PPT, 323KB)
- M10: Evaluation and Credibility (PPT, 650KB)
- M11: Evaluation - Lift and Costs (PPT, 318KB)
- M12: Data Preparation for Knowledge Discovery (PPT, 216KB)
- M13: Clustering (PPT, 474KB)
- M14: Associations (PPT, 216KB)
- M15: Visualization (PPT, 3.2MB),
- M16: Summarization and Deviation Detection (PPT, 1.2MB)
- M17: Applications: Targeted Marketing and Customer Modeling (PPT,320KB)
- M18: Applications: Genomic Microarray Data Analysis (PPT, 1.3MB)
- M19: Data Mining and Society; Future Directions (PPT, 261KB)
Datasets
Source:































