Is “low code data mining” feasible?, that is, generating data mining processes with little or no use of programming codes? The short answer is yes.
Several years ago I discovered KNIME (Kostanz Information Miner), a powerful and versatile data mining and data analytics tool for these activities that does not require programming knowledge or computer science expertise to get the most out of it. It is a program originated at the University of Konstanz in Germany and is currently based commercially in Zurich, Switzerland.
It does not require a computer science background, but to get the most out of the data mining and analytics area it is necessary to know the underlying processes, so as to take full advantage of its capabilities in the form of components and functional nodes.
KNIME is free and open source and is built in Java on the Eclipse platform. It integrates a wide variety of components for machine learning and data mining, as well as for data preprocessing and modeling. It started out as a research tool for the pharmaceutical industry and has since expanded into other areas and is now used in a wide range of areas such as CRM, business intelligence, text mining, financial data analysis, life sciences, and others.
KNIME has the capabilities to automate the data capture and analysis processes of any type of organization.
It integrates with Pyhton, R and other programming languages to add a greater number of functions.
It has functionalities for machine learning, natural language processing and other artificial intelligence techniques.
Now, if you are not dedicated to data mining or analytics, also as a process automation tool (almost an RPA), it also provides excellent tools. For this a real and simple use case:
Case 1: Monthly process
Every month (eventually) I received a file from a supplier, which consisted of the accumulated usage levels of certain communication services. These came in an Excel file duly organized in constant columns, with the consumption detail of each client at the date of issuance of the report. Each reception required a series of modifications/corrections to generate the respective usage reports:
- Modification of a column for the elimination of country codes.
- Elimination of non-relevant columns.
- Elimination of records with null consumption.
- Generation of monthly and accumulated consumption graphs.
- Create a new corrected Excel file for later use.
The above routines had to be developed with each delivery. Therefore the KNIME solution for this is summarized in the following flow.
Once the flow was generated, each new delivery consisted of executing the process and in less than 30 seconds the new excel workbook was created with the updated data. In my opinion, this justifies the use of KNIME as a data processing and automation tool.
Case 2: Creating an Excel workbook with additional sheets
For the purposes of this example the data source will be based on 3 separate files, which can actually be obtained from direct queries to one database or multiple databases.
This flow results in an Excel file with 3 separate sheets per payment type. It generates the excel workbook and by flow control (verifying that the file has been created) adds a sheet for cash payments. Then a wait of “n” seconds is generated to add the third and last sheet for check payments.
Extensions for Big Data management include access to Apache Spark and the Apache Hadoop ecosystem. In addition to having extensions for Azure, AWS and Google Cloud Platform.
Connections to the largest data platforms out there ensure that we can extract from almost any data source. This includes the most popular databases on the market.
MySQL, PostgreSQL and others are duly included, as well as almost any file type, including web pages (html).
In terms of Big Data it allows access to Apache Spark, Cloudera and others.
KNIME has the necessary components to perform data cleansing, and transformations of all kinds that allow complex data preprocessing.
For machine learning tasks KNIME has the necessary tools for data cleaning, training model generation, data partitioning, running various ML algorithms and visualization of the results.
In summary, KNIME is a tool worth trying that can solve a number of problems in the ETL domain.
This article is the first in a series of articles that will discuss this tool in greater depth, with concrete examples of use in the field of machine learning and data science.