The difference between machine learning and data mining

Data mining and machine learning are two areas of computer science that are becoming rapidly indispensable to businesses from manufacturing to finance, healthcare to media. Often mentioned within the same breath, it’s easy to think that they are one and the same and yet, they are not interchangeable (although they each can contribute to the success of the other).

They both represent steps within a process that helps businesses with decision making, planning, and the optimisation of systems. Data mining is more closely aligned with data analytics and particularly the analysis of big data, while machine learning is a subset of artificial intelligence. Machine learning uses training datasets in teaching computers to make sense of data on their own and carry out certain tasks.

What is data mining?

Data mining refers to exploring large amounts of existing data (such as a data warehouse) and unearthing insights that we perhaps weren’t even looking for. Computers help us to comb through vast datasets faster than we could manage manually, from which patterns or anomalies emerge.

New data on our behaviour and preferences is constantly emerging and being collected via our activity online. From the search terms we use, to the items we browse and buy, our actions are tracked by cookies which build a digital portrait of who we are. That can then be used for marketing purposes such as trend forecasting and to personalise our experience on various sites.

Data mining uses can sometimes feel intrusive but they can also streamline processes thereby offering greater efficiency and smoother customer service. This could be in sectors as varied as retail and healthcare. In retail, if you shop regularly at a particular site, you want your transaction to be as fuss-free as possible. You may then allow the site to retain your credit card details for future purchases. This can feel risky to some; however fraud detection systems make correlations between fraud traits and can spot uncharacteristic behaviour in your purchasing which will then be flagged in real-time.

In healthcare, it’s important for your existing data to be accessible to any specialist who you may be referred to. They are able to view your medical history, previous test results, and any other vital information that can help them support you and diagnose any conditions.

This is where data mining and machine learning overlap. Data mining methods vary depending on what the aim of the initial data exploration is – what useful information are we looking for?

The knowledge discovery in databases (KDD) process includes data mining techniques such as clustering, classification, and regression. These routes to preparing data for analysis can then be used when formulating machine learning algorithms to help computers go on to unsupervised learning.

Cluster analysis

This kind of analysis sorts data into visualisations in which clusters then become apparent. These clusters are of data objects that are similar to one another but not necessarily the same. The further away they are from one another, the more different they are. Commonly used in customer profiling, the clusters help with segmentation for targeted emails and marketing campaigns.

Classification analysis

This separates data into different classes. It is similar to clustering; unlike clustering though, data scientists create the labels for each class. The data mining algorithms then help the computer understand what class the data object belongs to (e.g. understanding if an email is spam). Complex classification analysis uses decision trees to help with the sorting of data.

Outlier or anomaly detection

What about when data analysis offers up an unexpected pattern or behaviour? These are known as anomalies or outliers. The term outlier has been popularised by Malcolm Gladwell but originated in statistical and data science. Often, outliers offer critical information because they deviate from the common average showing that something requires attention. This method is used in fraud detection, health monitoring and monitoring stocks and shares.

Association rule learning

This shows relationships between variables in a large database, sometimes referred to as dependency modelling. Picking out the concurrence of different variables that appear frequently in a dataset can be useful for forecasting customer behaviour and helping to plan out e-commerce site architecture based on what people put in their baskets and what pages they visit, for example.

Regression analysis

Often used in prediction and forecasting, this method identifies and analyses the relationship among variables. It shows how the dependent variables change when one of the independent variables is altered. This doesn’t always mean that there will be changes if the dependent variables are altered in the opposite direction.

What is machine learning?

Machine learning allows computers to analyse data and start identifying patterns to develop their self-learning without human intervention. Support vector machines (SVMs) are a set of supervised learning algorithms used for training computers in classification and regression, as well as outlier detection. SVMs are responsible for finding the decision boundary that separates different classes and for maximising the margin. This can be a line or a hyperplane, which is a decision boundary that has more than one dimension, depending on the number of features being classified. There are numerous tutorials online about how to set up an SVM in Python which can facilitate pattern recognition. Examples of machine learning that SVMs are used in include facial recognition, intrusion detection, email classification, news article hierarchies, classification of genes, and handwriting recognition.

Decision trees are often used in machine learning algorithms for classification and regression problems. These are like flow charts which offer routes for alternative choices at each level of the decision-making process, ultimately leading to a conclusion. Random forests or random decision forests correct the habit of decision trees to overfit their training set (for example, in cases where there is not an either/or choice for the computer to make). They are frequently used as “blackbox” models for business intelligence, as they generate reasonable predictions across a wide range of data but require little configuration. Blackbox is a word used when the workings of the computer’s “thinking” becomes opaque – we don’t know how the computer has come to the conclusion that it has. This level of autonomy can be witnessed in computers that go beyond the reinforcement learning used to get them to play games like checkers, chess, and Go.

Deep learning is a subset of machine learning that aims to imitate the workings of the human brain with the construction of artificial neural networks. More specifically, deep learning uses neural networks with three or more layers. The largest neural network currently in operation is the GPT-3 created by OpenAI with 175 billion parameters in use. This is still only a fraction of the brain power of a human being and data scientists oversee and edit its output, but it is extremely sophisticated and successful in natural language processing. Neural networks need access to vast amounts of data for training. Using an open-source framework such as Hadoop usually allows large data sets to be processed across clusters of computers using massive cloud computing. GPT-3 was trained with almost all available data on the internet via Common Crawl and can perform tasks it has never been trained on. It even wrote an article for The Guardian in 2020.

Deep learning with an MSc Computer Science

Take a deep dive into machine learning and how data mining assists the process with a 100% online master’s in computer science from the University of Sunderland.

Whether you’re interested in working with leaders in machine learning techniques such as Amazon or the tech start-ups of tomorrow, a master’s course will give you the depth of knowledge to progress in this innovative and exciting space.