Introduction
When retiring a mainframe, the biggest risk isn’t losing applications, it’s losing history. Historical data stored on tapes often gets overlooked during cloud migrations, yet it’s critical for audits, compliance, and business continuity. This blog shares how we successfully migrated years of audit data to a secure, searchable environment while meeting governance, security, and operational requirements.
Background & Objective
Mainframe retirement programs are often large, multi‑year initiatives driven by the need to reduce licensing costs, eliminate technical debt, and modernize legacy workloads. As organizations transition critical applications to the cloud. The focus naturally tends to fall on re-platforming or rewriting the operational systems that support current business processes. However, another equally important responsibility lies beneath the surface: preserving decades of historical and audit data generated by mainframe applications.
In many enterprises, this historical data resides in formats such as VSAM files, DB2 extracts, and tape archives accumulated over 10, 20, or even 30+ years. While these datasets are rarely used day-to-day, they are vital for regulatory compliance, audit preparedness, fraud investigations, reconciliation efforts, and long-term record retention mandates. Losing access to them could expose the organization to legal penalties, failed audits, or operational blind spots.
The primary objective of this initiative was not just to migrate the live applications to cloud, but also to ensure that years of historical audit data remained intact, secure, governed, and easily retrievable. This required creating a modern, archival solution capable of handling large data volumes, providing analytics, searchability, enforcing strict access controls, and integrating with enterprise security policies all within aggressive program timelines.
Business Challenges
- Incomplete visibility into source data leading to delays, missed deadlines, and cost overruns due to unexpected discoveries.
- Regulatory compliance risks resulting in legal penalties, reputational damage, and project stoppages.
- Network bandwidth constraints causing extended migration timelines and delayed go live schedules.
- Insufficient temporary storage creates workflow bottlenecks and increases data integrity risks.
- Data security during migration raises the risk of breaches and erosion of customer trust.
- Data accuracy and completeness verification triggering operational disruptions and poor decision making.
Approach for Tape Data Migration
The following steps outline a structured, security‑first approach for migrating tape‑based mainframe data to AWS
- Mainframe Dataset Discovery and Transfer Workflow
We started by analyzing the existing mainframe job scripts to identify which files needed to be moved.
Next, we worked with the storage team to understand the size of these files. For files that were larger than 500 GB, we automated a process to split them into smaller, manageable chunks. This made the transfer more efficient and reduced the risk of errors.
Once the data was split, we transferred these chunks from tape storage to a network-attached storage (NAS) staging area using transfer scripts.
We created a mapping table that included details like file name, month, record count, and file size. This mapping was essential for validation and traceability, ensuring that nothing was lost or mismatched during the migration.
- Comprehensive Approach to Data Classification and Cloud Security
We began by classifying all data fields based on sensitivity. This included grouping them into Sensitive PII (Personally Identifiable Information that needs strong protection), Non-Sensitive PII, and Non-PII (general data without personal identifiers).
Next, we created a catalog file that clearly specifies which fields require encryption and which should be tokenized. This catalog acts as a reference for applying the right security measures consistently.
For tokenization, we replaced raw identifiers (client codes) with tokens to protect client information. These token mappings are securely stored in a private cloud environment (GreenLake), ensuring that sensitive details never leave the secure zone. In the public cloud, only tokens are used, not the original identifiers.
Finally, we implemented multi-layer security controls in AWS to safeguard data at every level. Security methods we used were AWS KMS, File and field-level encryption with PGP, IAM roles, VPC and Subnets for Network.
- Optimized Data Extraction and Storage Management Strategy
We scheduled large data extraction jobs during low-traffic periods to minimize any impact on regular business operations.
To further reduce performance risks, we distributed the workload across multiple mainframe partitions (LPARs).
We also performed capacity planning to estimate storage needs over time. For example, moving 1.5 petabytes of data over seven years translates to roughly 200 GB of NAS storage per year for staging slices, plus about 1.2 PB in HDFS for overlapping processing requirements.
Finally, we implemented purge policies to automatically remove staged files after successful transfer and validation.
- Secure Data Transformation and Cloud Integration Workflow
We ran parser jobs that process each file chunk. These jobs apply field-level encryption or tokenization, normalize the schema, and then write the data into partitioned Parquet files organized by table, year, and month. This structure makes the data easy to query and manage later.
For every Parquet file, we also generated metadata that includes details like the source dataset, record count, checksum, and job ID. This metadata is then registered in a central catalog (such as AWS Glue) for easy tracking and governance.
Finally, a transfer service continuously monitors the NAS staging area. It uploads files to Amazon S3 using multi-part uploads, handles automatic retries in case of failures, and ensures that every file is consistently registered in the catalog for traceability and validation.
- End-to-End Data Validation and Audit Framework
We implemented a comprehensive validation process to ensure data integrity after migration. First, we performed row-count reconciliation, comparing the number of rows in each Parquet file with the original record counts from the mapping table.
Next, we ran checksum and hash validations to verify that the source and target files matched exactly.
We also carried out schema validation and sampling checks to confirm that all field-level transformations—such as encryption or tokenization—were applied correctly and that the data structure was consistent.
Finally, we maintained detailed audit logs, job execution logs, and catalog entries (e.g., in AWS Glue). These records provide full traceability and provenance, making it easy to track every step of the process for compliance and troubleshooting.
Outcomes
This wasn’t just a migration, it was a transformation built on clarity, security, and trust. By addressing challenges upfront and aligning every decision with business needs, we:
- Reduced costs by eliminating unnecessary transfers.
- Accelerated timelines through automation and smart scheduling.
- Ensured compliance with robust security and governance controls.
- Delivered confidence to leadership with full traceability and auditability.
- Optimized storage by generating Parquet files using GSTD algorithm, significantly reducing space requirements.
- Improved query performance for auditing, enabling faster results from UI compared to previous formats.
- Enhanced scalability and efficiency through columnar storage and compression, supporting large-scale data operations seamlessly.
Conclusion
Migrating decades of tape‑based mainframe data to the cloud is more than a technical upgrade. It protects the organization’s history while making the data far easier to use. By following a clear, secure, and well‑validated process, we ensure that all audit information stayed accurate, accessible, and compliant throughout the modernization journey. With strong governance, automation, and multi‑layer security, migration created a modern archive that is faster, lighter, and more reliable than legacy formats. This transformation not only preserves critical institutional knowledge but also unlocked new opportunities for analytics, dashboard creation, and AI‑driven chatbots. Because the data is now organized, searchable, and stored in optimized formats, teams can quickly build dashboards, perform trend analysis, and even use chatbots to answer audit and operational questions instantly providing business, audit, and compliance teams a powerful, scalable platform for future insights.
Congratulations, Pavan. This is a very well‑written blog that clearly highlights the importance of preserving historical data and not leaving it behind, especially for future business and audit needs.