The process of data augmentation - Part II

Discover how to perform data augmentation, the related challenges, and how to resolve them
Tue 10 Nov 2020

This article is the second part of our Data Augmentation series. The first part is a general introduction about data augmentation and its general principles, and provides specific examples on how to enrich and improve your datasets. In this second part, we will dig deeper into the technical intricacies of data augmentation.

Understanding the data

Before performing data augmentation, the very first step is to understand your data, its limitations, and features. This is an essential step to guarantee success of a future machine learning model or project but it is sometimes overlooked.

Once you have gained a detailed understanding of your data, you can choose the amount and the range of transformations to apply in the augmentation.

If this initial analysis phase is not performed thoroughly, the risk lies in transforming the data so much that it can be confusing for the machine learning algorithm to learn from it.

Integrating augmentation in the training process

How a machine learning model is trained

Training a machine learning model, in general, is done in the following way:

  • The dataset is passed multiple times through the machine learning model, so the model can learn as much as possible by having seen the data several times.
  • Each sample is loaded, converted to a format understandable by the model (for example, converting the image to black & white). Multiple transformations can be applied in order to help the model to learn (for example, increasing the contrast).
  • After the transformations are applied to the data sample, one can pass the data through the model. It means that the machine learning model sees the sample and predicts the results.
  • The learning process is done by comparing the prediction to the target (the reality, also known as “ground truth”) and then learning from their differences.

Where data augmentation fits in

In general, data augmentation is done during the data conversion/transformation phase of the machine learning algorithm training. The augmentation is applied to the initial data sample, and sometimes also to the data labels. However, one limitation of this approach is the computation time, which can sometimes take too long.

One alternative is to generate a fixed set of augmented samples and save them as a new dataset. This way one can pass the augmented dataset directly to the machine learning model pipeline without needing to apply augmentation while loading the files. Yet, this alternative solution has other limitations - the need for more storage, the model learning too much from data (overfitting), and the risk of going back to the problems of limited data examples.

Another possibility for data augmentation is generating data from scratch, i.e a synthetic dataset. You can either create the entire dataset from scratch or generate new data samples on the go.

The science of choosing which features to augment

What is a feature

In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon being observed.

Examples of features in image data are:

  • Shapes

  • Colors and variations

  • Objects in the background

  • Size of the main object

  • Position/side seen of the main object, etc.

Identifying the right features

When applying image augmentation, one should be able to understand which features are most predictive of the targets, from those that are not. Remember, as explained above, a target is what one is trying to predict.

  • Example of a predictive feature for target: let's take a simple example of augmenting an image of a dog. If one mirrors the image of a dog, the output will still be an image of a dog, so the target will not be altered by the augmentation. But if one changes the color of the dog, then it might seem like the image pictures another animal, to the ‘eyes’ of a machine learning model. So the correct target might not be a dog anymore.

  • Let’s take a more complex example of a text angle detection problem. If one takes an image with angled text, and moves the text up, down, left, and right, the angle will still be the same. Therefore, this augmentation won’t mismatch the target. But, if one starts rotating the image, then the angle of the text will change, and the augmented image won’t match with its initial target. In order to avoid a mismatch in the data, one should also change the target accordingly.

When features are easy to spot

This can be an easy thing to do when one is dealing with easily understandable data, where one can quickly spot the predictive features of the data samples and alter them individually.

For example, let’s say you are interested in the size of the object/item in the image. If you zoom on the image, you will not alter the object features like colour, shape, etc. If your objective is to detect/classify, then the size does not matter and the target of the task will be the same after zooming. But if your objective is to calculate the size of the object, then the apparent size is a predictable feature and the zooming will affect its size - so your target must be changed accordingly. In both cases, the feature (in our case the apparent size of the object) can be altered without affecting other features.

When features are complex to spot

In slightly different circumstances, when one is dealing with complex image data like X-ray images, the predictive features can be hard to spot by a machine learning engineer alone. In these situations, an expert opinion is often needed to help understand the predictive features of the data. In the X-ray case, it can be a radiologist or surgeon, for example.

Technical complexity

In both cases, one may or may not be limited by technical difficulties to manipulate separately each feature: one can change the brightness of an image without altering other features, but one cannot easily remove or move a person from an image in a realistic way!

The careful choice of features

It is important to be careful with the choice of features, and whether or not the target is affected by a transformation. As far as possible, one should not transform the features that have predictive value, but one should focus on the other features of your data - if the objective of the augmentation is to create more data for the same target/objective.

But one can also decide to augment the predictive features. In those cases, one will have to apply the same transformation to the target, to make sure the augmented features are matching the target.

Generally, one should not transform the features that have predictive value.

Data augmentation: until where?

Now, where and how far is it possible to perform data augmentation? The range and the amount of the data augmentation can vary, depending on :

  • The objective: what one wants to achieve. Ex: change the balance/distribution of certain targets, or increase the dataset size.
  • The complexity of the data type: Some data is much harder to manipulate than others.
  • The complexity of separating the predictive features in the data sample: If this is easy to separate, then the range of augmentations can be broader.
  • The nature of the data and the target: For example, one cannot move a tumor on a medical image easily, nor is it realistic to move it anywhere, so you can apply very small transformations here, like image quality.


While data augmentation is a great way to solve issues related to data collection, it will never replace large, balanced datasets. If your organization has never thought about which data to collect, in which format, and for which purpose, it is not too late.

Kantify helps companies succeed in their AI journey. We are experts in transforming data, developing data pipelines and AI models.

If you are curious to discover more about AI, subscribe to our monthly newsletter where we regularly share insights about applied machine learning.