How we can help with Feature Engineering and Augment the AI/ML data needs

Data Augmentation involves increasing the volume of training data by generating synthetic data from the existing sample data. It helps to increase the diversity of data and reduce overfitting while training machine learning models. On the other hand, Feature Engineering transforms raw data into features that improve the accuracy and performance of machine learning models.

As we know, some training data is used to produce results in machine learning, and the success of machine learning algorithms depends on the data used. Usage of proper input data set for each algorithm will considerably influence the performance and accuracy of machine learning models, and this is where feature engineering comes into the picture.

The need for data scientists is in great demand, and according to some statistics, this demand has grown more than 600% compared to 2012. So, what do these data scientists do? They spend 80% of their time in data preprocessing that involves collecting the training data, cleansing data and transforming data (handling of missing data, null values, outlier detection to reduce overfitting, normalizing the data),  examining the patterns, refining and analyzing the datasets.

In the digital era, most companies adapt themselves to data-driven business. Data privacy, quality, and quantity are some crucial aspects of data, and we may have to deal with production data access in most business scenarios. If production data is exposed, it will lead to significant data breaches. iEDPS’ (Infosys Enterprise Data Privacy Suite) product offers various ways of ensuring the privacy of your data using data obfuscation or sanitization. It is achieved either by masking your data (NPI or PHI) in the data source or by using Dynamic Data Masking (DDM) that limits the exposure of sensitive data to non-privileged users whilst ensuring no physical changes to the actual data.

Nowadays, we come across many data privacy violations that happen worldwide. One such report reveals the massive data breach of Netflix, where billions of logins were uncovered that in turn exposed their credentials, card details, watch history and other account-related information.

 

Data Augmentation – iEDPS

iEDPS proposes a solution to augment the data which can be used for AI/ML models to minimize data breaches. As part of Data Augmentation, we will have a system that connects to the datastore/data source. By extracting the metadata and analyzing the sample data, the system will understand the characteristics of the data. Based on these characteristics, the metrics are collected that will help in generating the large datasets with minimal input data.

Generally, the Image dataset is augmented using varied techniques like rotating, cropping, translating, zooming, resizing, and rescaling the images, considering metrics such as edge values and pixel coordinates.

Likewise, our data augmentation system analyses the data and then augments it as required. Metrics play a crucial role in this process. The metrics differ across different data, domains, industry needs, machine learning algorithms, etc. basis which the data is supplied.

 

Features of Data Augmentation in iEDPS

  1. Compares augmented data metrics with original data metrics to ensure the quality and usability of data
  2. Evaluates the attributes based on generated data (percentage of values in each attribute)
  3. Controls the output data by configuring custom algorithms
  4. Augments the data by using preset values or data present in repositories
  5. If input data is unavailable, we can generate or augment the data by configuring the required metrics

 

Author: Gowripadma Murugesan

 

Author Details

Vijayalaxmi Vijayalaxmi

Vijayalaxmi Suvarna is a Senior System Engineer at Infosys Center for Emerging Technology Solutions, she leads the Marketing initiatives for the PrivacyNext iEDPS Platform. Her focus includes User Experience and online branding of Infosys Data Privacy offerings.

Leave a Comment

Your email address will not be published.