Migration projects often begin with a seemingly straightforward inventory: here are the source systems, here are the target systems, here's how we'll move the data between them. But lurking beneath this clean picture is a challenge that derails timelines and inflates budgets—duplicate data sources that no one fully accounted for.
How Duplicate Sources Emerge
Organizations rarely intend to create duplicate data sources. They accumulate over time through various mechanisms:
- System proliferation: Over years, different departments implemented different solutions for similar needs. Each maintains its own customer list, product catalog, or employee directory.
- Integration workarounds: When systems couldn't share data directly, users created extract files, staging databases, or spreadsheets that became their own sources of truth.
- Historical acquisitions: Mergers and acquisitions brought additional systems that were never fully consolidated.
- Shadow IT: Business units implemented their own solutions outside IT governance, creating parallel data stores.
- Reporting layers: Data warehouses and reporting databases that were supposed to be read-only copies evolved into sources for downstream processes.
The Discovery Problem
The fundamental challenge is that duplicate sources often aren't visible until migration work is underway. Initial discovery focuses on the "official" systems—the ones IT knows about, the ones in architecture diagrams, the ones with support contracts.
But as mapping sessions proceed, stakeholders mention the spreadsheet they've been maintaining for years. Testing reveals that a downstream system was pulling data from a different source than expected. Data reconciliation shows discrepancies because two systems have diverged over time.
Each discovery triggers scope expansion: additional analysis, additional mapping, additional testing, additional reconciliation.
The Impact on Migration Projects
Cost Inflation
Every additional source requires analysis, mapping design, transformation development, and testing. What was scoped as a three-source migration becomes a seven-source migration—more than doubling the work.
Schedule Delays
Additional sources can't be processed in parallel if they weren't planned. Sequential discovery of sources extends timelines unpredictably.
Quality Risks
When the same entity (customer, product, employee) exists in multiple sources with different values, which source is correct? Resolving these conflicts requires business decisions that take time and often reveal deeper data quality issues.
Integration Complications
Systems that consume data from the migrated source need to be identified and updated. Unknown consumers of duplicate sources create hidden dependencies that surface as production issues.
Proactive Discovery Strategies
Cast a Wide Net Early
Don't limit initial discovery to known systems. Actively seek out spreadsheets, access databases, extract files, and unofficial data stores. Ask business users directly: "Where do you get this data?" and "Do you maintain any lists or files of your own?"
Trace Data Flows
For each data element, trace where it originates, where it flows, and where it's consumed. Data lineage analysis often reveals duplicate sources that inventory-based discovery misses.
Reconcile Counts
Compare record counts across systems early. If your CRM has 50,000 customers but your billing system has 55,000, you have a problem to investigate before migration begins.
Interview Broadly
Talk to people who actually use the data, not just system owners. End users often know about data sources that IT hasn't documented.
Check Historical Records
Look at previous integration projects, data requests, and report specifications. They often reference data sources that current documentation doesn't capture.
Dealing with Discovered Duplicates
When duplicate sources are found, you have several options:
- Consolidate before migration: Clean up duplicate sources in the legacy environment first, simplifying the migration.
- Establish golden source: Determine which source is authoritative and migrate from that source while documenting why others are excluded.
- Merge during migration: Combine data from multiple sources with defined conflict resolution rules.
- Migrate all and reconcile: Bring all data to the target and resolve duplicates post-migration.
The right approach depends on timeline, budget, data quality, and business tolerance for uncertainty. But having the conversation explicitly is better than ignoring the problem.
The sources you don't know about will cost more than the sources you planned for. Invest in discovery before you commit to scope.