The objective of an exploratory analysis is to examine the data on the basis of their distribution and characteristics before performing any statistical technique.
The main objective is to understand the data and its variables before undertaking any more detailed analysis.
To detect flaws in the data design or collection and applied to univariate or multivariate data.
The process of obtaining information from the data involves a series of associated processes.
“Knowledge Discovery in Databases” ó KDD, is a process oriented to the identification of patterns and the discovery of new and more understandable patterns.
KDD involves the evaluation and interpretation of patterns and models to make decisions regarding what constitutes knowledge and what does not. Therefore, KDD requires a broad and deep knowledge of your area of study.
KDD requires more knowledge about the area of study than Data Mining.
Understanding of the study domain
As in any type of research, it is essential to be very clear about the limits and objectives of what we intend to do. It is very easy to lose our way in the infinite ocean of data at our disposal.
- Developing an understanding of the domain
- Discovery of relevant prior knowledge
- Defining the KDD objective
In this step is when we recognize the most important sources of information and who has control over them. It is also relevant to include all related metadata, dimension the amount of data, and formats.
It is recommended that all the most important information that is only on physical media be digitized, prior to starting the KDD activities.
KDD refers to the general process of discovering useful knowledge from data, and data mining refers to a particular step in this process. Data mining is the application of specific algorithms to extract patterns from data. The distinction between the KDD process and the data mining step (within the process) is a central point of this article. Additional steps in the KDD process, such as data preparation, data selection, data cleaning, incorporation of appropriate prior knowledge, and appropriate interpretation of the mining results, are essential to ensure that useful knowledge is derived from the data. Blind application of data mining methods (correctly criticized as data dredging in the statistical literature) can be a dangerous activity, easily leading to the discovery of meaningless and invalid patterns.
KDD has evolved and continues to evolve from the intersection of research fields such as machine learning, pattern recognition, databases, statistics, AI, knowledge acquisition for expert systems, data visualization, and high performance computing. The unifying goal is to extract high-level knowledge from low-level data in the context of large data sets.
The KDD process can be seen as a multidisciplinary activity that encompasses techniques outside the scope of any particular discipline, such as machine learning. In this context, there are clear opportunities for other AI fields (apart from machine learning) to contribute to KDD. KDD places special emphasis on finding understandable patterns that can be interpreted as useful or interesting knowledge. Thus, for example, neural networks, although a powerful modeling tool, are relatively difficult to understand compared to decision trees. KDD also emphasizes the scaling and robustness properties of modeling algorithms for large noisy data sets.
Data cleaning and processing datos
Currently available datasets are usually incomplete (missing attribute values), have noise (errors and outliers), or have inconsistencies (discrepancies in the data collected).
- Elimination of noise and outliers.
- Use of prior knowledge to eliminate inconsistencies and duplicates.
- Selection and use of strategies to handle missing information in the datasets.
This “dirty data” can confuse the mining process and lead to invalid or unreliable results.válidos o poco confiables.
Pre-processing and cleaning are intended to improve data quality and mining results. Remember that implementing complex analysis and mining large amounts of data can take a long time, so whatever we can do to shorten that time will always be helpful.
Mining is an exploration. We delve into the vastness of the data and gradually discover the patterns or models present in it; the relationships.
And in this exploration, one of our most useful tools are algorithms.
What is an algorithm? Basically, an algorithm is a set of instructions or rules established in a computer program that allow us to arrive at a result or solution.
In the case of data mining, an algorithm allows us to process a dataset to obtain new information about that same dataset.
In general, data mining involves three steps: the selection of the task, the selection of the algorithm (or algorithms) and its use.
The algorithm looks for patterns and models that interest us, following its pre-established rules, which may include classification trees, regression models, clusters, mixed models, among others.
La mayoría de los métodos de minería de datos se basan en técnicas comprobadas de aprendizaje automático o Most data mining methods are based on proven techniques of machine learning, pattern recognition, and statistics: classification, clustering, regression, etc. The formation of different algorithms under each of these headings can often be bewildering to the novice data analyst as well as the expert.
It should be emphasized that, of the many data mining methods advertised in the literature, there are really only a few fundamental techniques.
Interpretation of mined patterns
It is important that we understand the difference between two key terms: patterns and models.
Patterns: are local structures that make statements only over a space constrained by variables. This has important applications in anomaly detection such as detecting faults in industrial processes or fraud in the banking system.
Models: these are global structures that make statements about any point in the measurement space. For example, models can predict the value of some other variable.
In the interpretation stage, we find the patterns and models in the analyzed data.
The results must be presented in an understandable format. For this reason visualization techniques are important to make the results useful, since mathematical models or descriptions in text format can be difficult for end users to interpret.
From this point in the process it is possible to return to any of the previous steps.