Introduction:
Nowadays, we are hearing a lot about data being the new oil. But the truth is data collection has been key ever since humanity existed. The knowledge we acquire and the learnings done are all based on past data. With big data technologies and the cloud, organizations can capture information in a much larger volume than ever before. Data captured can help them make informed decisions by performing data analytics. However, it is not without its challenges.
Challenges:
Organizations face critical data challenges, such as:
- Data Privacy Concerns: Does data used for analytics contain Personally Identifiable Information (PII)? Has the privacy of a person been compromised? How accurate are the results of analytics in the case of anonymized data set? Are there biases in the analytical models based on personal data?
- Data Privacy Laws & Regulations: More than 130 countries have laws or are in the process of creating laws for protecting PII data. Since organizations work from geographically distributed regions, they are bound to comply with a wide variety of laws and regulations.
- Data Sharing: Organizations share data with third parties for conducting market research and analysis. Such data must be secured and protected before sharing fails which would lead to data breaches.
- Data Volumes: Due to advancements in storage technologies, organizations can store and keep a large volume of data. However, they are struggling to go through such large data sets to analyze them.
From the challenges described above, one of the major challenges is data sharing. So, you may have data lying in your environment, but to work on data analytics, you need the help of external/third-party vendors. In such a scenario, how do you mitigate the risk of sensitive data being exposed outside? There are multiple options to alleviate this issue:
- Redact: Redact all sensitive data and share it with external vendors. By performing redaction, we are removing all the data that contains sensitive information. In such a scenario, meaningful analysis might not happen since the data shared can be incomplete due to redaction. Another challenge is the need to define processes for redacting heterogeneous data sets. This process needs to be reviewed and vetted for every new data source.
- Data Analytics Team: Build competency within the organization to do data analytics. Create a group of data analytics experts and procure the necessary tools for doing analytics. Getting the right people is the challenge in this option. Given the complex landscape, you might need to procure multiple tools to handle various data sources.
- Differential Privacy: Construct a solution for differential privacy. Using differential privacy techniques, the information about the individuals can be safeguarded by adding noise to the shared datasets. There are many issues with this approach, including having a privacy budget that would limit the number of queries fired, the solution may not guarantee 100% privacy, etc.
- Clean Room: Create centralized storage where your data is ingested and only shared with external vendors based on permissions. It is recommended to choose one of the cloud providers as data can be shared easily, and we can define controls based on need. The challenge here is to select the right cloud provider and build the storage.
Among the above options, a clean room is becoming the most preferred option among many organizations.
What is a Clean Room?
Clean room is a solution where you have data lying in a centralized storage, typically in a cloud data lake which is shared with others for further processing. Clean rooms can also be used for sensitive data identification. The data in the lake would be ingested from heterogeneous data sources.
Key Features of Clean Room include:
Data Discovery: Using clean rooms, organizations can collect data from diverse sources and administer regulatory templates to identify sensitive data. Products like Infosys Organization Data Privacy Suite (iEDPS), Azure Purview or Amazon Macie can be used for this.
Data Sharing: Once data is in the lake, it can then be shared with external vendors via multiple approaches:
- Masked data – Data within the clean room can be masked using products like iEDPS to get realistic but false data. These data can then be shared securely with data analysts for insights by pulling relevant information from the data lake. The challenge with this approach is your data is changed in some manner before data analytics is done. So, the results may not be 100% accurate.
- Secure Enclave – In this approach, original data is used within a secure enclave for data analysis and is encrypted outside the enclave. However, within the enclave, data gets decrypted, and processing is done as per expectation. Products like Opaque can be used to achieve this.
- Original data as it is – Based on location and applicable laws in some regions, original data can be used as is for data analytics. We need to ensure that the correct controls are in place to limit data access to authorized folks. Cloud service providers like AWS, Azure and GCP enable these right controls.
Data Retention: As part of various regulations, including GDPR, CCPA, etc., there is a need to retain data for a specific period. This can be achieved by using a clean room. When data is no longer needed, it can be disposed of. Products like Cohesity can help achieve this.
Data Augmentation: The ingested data in the clean room can be augmented to create new augmented data sets, which can then be plugged into an AI/ML pipeline by data scientists. The iEDPS product can be used for this.
Key Solution Components:
The components of a clean room comprise of but are not limited to below:
Advantages of setting up a clean room
Clean room provides an end-to-end solution for your data identification, data protection, data retention and data sharing needs. Clean room solution helps organizations rethink how data is taken in, used, and shared securely inside and outside the organization. The solution is not tied to a vendor and is cloud agnostic. Depending on the organization’s requirements, the right products and providers can be plugged in.
It can help with the following:
- Compliance with data privacy regulations – Sensitive data is guaranteed to be protected when opting for a clean room solution.
- Cleanroom enables secure & controlled data sharing – Role-based access control can be enabled to the clean room data lake.
- Increased reusability and on-time availability of data – Data, once protected and analyzed, can be reused for similar future requests.
- Increased automation enabling faster provisioning of desired data sets – Can be easily plugged into an AI/ML pipeline to enable self-service for data provisioning.