What is Large Data Volume, and How can Data Discovery in Large Data Volume help?
With the increase in the reach of social media and most businesses moving online, there is a significant rise in data generation. For example, we can consider searching for a product on any online shopping website and later be getting ads and suggestions for the product. It happens because the moment we search for it, data gets generated, and this data is used later to provide meaningful ads and suggestions. A large volume of data that grows exponentially is known as Large Data Volume. Statistics show that the total amount of data will double every two years compared to the volume of data generated today. The size and complexity of this large volume of data are so high that it is extremely difficult to process it using existing traditional approaches.
Large volumes of data, if processed appropriately, can provide breakthroughs for companies in various fields. For example, when a customer enters a bank, the data analyst can use the Large Data Volume to check the profile and understand the preferences and likes of the customer. It would help the bank provide the customer with relevant offers and products required. If this is applied to all the customers, then the bank’s revenue can rise significantly.
Problems Related to the Growing Volumes of Data and How Privacy is Compromised
On the contrary, there is a tradeoff between data security and the increasing volume of data. If sensitive information like a customer’s personal details, bank account number, credit card details, etc., can be accessed by others, and if it’s shared, then it’s a data privacy breach. There are instances like what if this sensitive data is sold to any other company purposefully by an employee or the data leak occurs on the internet unknowingly or whether it is shared to test and develop applications to third-party companies. Then the privacy of this data is compromised since the customer is unaware of this privacy breach.
What is Data Discovery in LDV?
To avoid this, Data Discovery can be performed on sensitive data. It can either be protected by passwords or have restricted access. If the data needs to be shared with a third-party company for testing, then the sensitive data can be masked so that the testing can continue without any privacy violation. It helps to identify the personal information and sensitive data based on its environment. If used in banks, sensitive information like credit card numbers, account numbers, etc., can be identified from the files. Data discovery helps scan the data present inside the system and identify the sensitive information. Artificial intelligence and Machine learning go beyond scanning just the metadata.
Data discovery in Small Data Sets and Why it is Difficult for Large Data Sets
Most data scientists and data analysts employ Python for data pre-processing and model building. The libraries commonly used are Pandas, NumPy, sci-kit-learn, etc. These libraries work on a single CPU and are not scalable. It can fail while processing large datasets as it may not fit into the available RAM resulting in heating and slowing down the machine. There are some libraries like the VAEX, KOALAS and DASK, amongst which DASK is the most efficient to handle large data volumes in Python.
Data Discovery in Large Datasets using DASK
It can efficiently perform parallel computations on a single machine using multi-core CPUs. It stores data on the disk and uses chunks of this data for processing so that the memory consumption is low for the computations. Furthermore, the values generated during the processing are dumped at the end.
For instance, there are four cards of different colours on a table. The task is to separate the four cards according to their colours, provided only a single person can work on them. If the number of cards is raised to 100 and then to 1000, the same task would be difficult for a single person. If this same task is split among multiple people, it can be completed effortlessly. It resembles the situation of discovery in large data volumes as well – the individual working alone is Pandas, NumPy, etc., whereas the ones working together resemble DASK. DASK can also run on a cluster of machines to process the data efficiently. If there is a mismatch in the number of cores among the machines present in a cluster, the DASK will handle these variations internally with deftness. Since it supports Pandas data frame and NumPy data structures, it’s easier to process large data sets with negligible differences in the coding format.
How is it Handled in iEDPS?
Infosys Enterprise Data Privacy Suite(iEDPS) offers a solution to avoid data breaches by discovering sensitive information in both structured and unstructured data. Data discovery is performed with deterministic and probabilistic techniques using machine learning along with a multi-threaded pipeline approach.
Python processes can get stuck or sometimes fail to complete while handling a large volume of files. To avoid this, iEDPS, by leveraging the library DASK, processes data from large files in less time without any failures. After performing discovery, the user can either view the sensitive information from the report or mask the original sensitive information on the files by using different masking techniques provided by iEDPS.
About the Author:
Visakh Padmanabhan is a Systems Engineer – Python Developer from the iEDPS Data Discovery team.