Synthetic Data Generation
Synthetic Data Generation is the process of generating a bulk amount of data based on the properties specified by the user, computer algorithms or depending on the dataset the data is based on. This data generation approach is extremely useful and downright invaluable in the case of Test Data Management. Often this synthetic data is used where the real data is unavailable. Certain use cases for synthetic data generation are for generating data which can be easily shared and does not cause any privacy or security risk. Also, new data can be generated such that the shared information does not contain any sensitive or personally identifiable information. Another significant use case for this is generating a new dataset for a machine learning model where only a limited amount of data is available.
API Data Generation
API data generation is a method where we generate the data that may be a JSON/XML Payload using the same techniques as synthetic data generation but triggered to a rest API, which is configured. An ideal API data generation workflow should support configuring a flow with which data can be generated for multiple APIs with the memory of previous execution, as the data from previous requests can be used in generating the next payload request. It should support various authentication and authorization methods for accessing the API and custom headers which can be dynamically generated to suit the needs of the infinitely different requirements of the APIs.
Why API Data Generation?
Mostly synthetic data generation is done directly onto the databases/files, which includes RDBMS, document-based databases and files such as JSON/XML/EDI, etc., by generating bulk data depending on the algorithms configured and writing onto their appropriate datastores. It is a very efficient way of data generation, but there are certain cases where this is not applicable or very cumbersome to configure.
Below mentioned are certain situations that are best suited for API data generation:
- Data Generation for Commercial Off The Shelf (COTS) applications like Salesforce, SAP SuccessFactors and other CRM applications: For these applications, either the underlying database is not exposed, or the relationships between the table are very complex, so the generation is difficult. However, these applications generally expose certain APIs for the insertion and retrieval of data which can be leveraged by API Data generation.
- Applications where direct database access is not available: The applications where direct database access is not available due to project restrictions/third-party systems, we can use the rest APIs exposed by the application itself to generate data for databases.
- Databases where relationships are complex: Databases which contain highly complex relationships between several tables can be difficult to configure. But using API data generation, the data can be inserted into these databases using available rest APIs.
For most applications, the rest APIs are exposed that can be leveraged to set up the required data in backend databases. Using rest API data generation ensures the application data is functionally accurate, removes any privacy concerns and generates data in the correct functional sequence.
iEDPS (Infosys Enterprise Data Privacy Suite) is a data privacy and test data management product which offers synthetic data generation as well as robust API data generation capability to set up the right test data as per test requirements.
AUTHORS: Ashique Syed Mohammed and Biju Babukuttan Nair