Source tables change over time. A data mart or data warehouse that is based on those tables needs to reflect these changes. This blog post will explain different solutions for solving this problem.
When moving data into a data warehouse, taking it from a source system is the first step in the ETL process. Once extracted from the source, the data can be cleaned and transformed so it can be loaded into a staging table or directly into the data warehouse.
Designing and creating the process to extract the data from the source system is usually the most time-consuming task in the ETL process if not the entire data warehousing process. Source systems are usually very complex, with tables and fields in the databases that are difficult to understand and poorly documented (many popular ERP systems use numbers for table names). This makes determining the data which needs to be extracted a challenge. And usually the data needs to be extracted on a daily basis to supply all changed data to the data warehouse in order to keep it up-to-date. Moreover, the source systems usually cannot be modified, or its performance or availability adjusted, to accommodate the needs of the data warehouse extraction process.
The focus of this blog will be the technical considerations of having different kinds of sources and extraction methods. We will discuss the most common techniques used for extracting data from source databases and the ways to determine the data that has changed since the last extraction.
Designing this process means answering the following two questions:
1) Which extraction method do I choose? This influences the source system, the transportation process, and the time needed for refreshing the data warehouse
2) How do I make the extracted data available for further processing? This influences the transportation method, and the need for cleaning and transforming the data
What are the different extraction methods?
The extraction method is highly dependent on the source system and well as the business needs in the target data warehouse. Usually there is no possibility to add additional logic to the source system to help with the incremental extraction of data.
Here are the types of extraction methods:
Full Extraction: All the data is extracted completely from the source system. Because this extraction reflects all the data currently available on the source system, there is no need to keep track of changes to the source data since the last successful extraction. The source data will be provided as-is and no additional information (i.e., timestamps) is necessary from the source data. An example for a full extraction may be an export file of a complete table or a SQL SELECT statement that retrieves all the rows from a table. Many times a full extraction will be used for tables that will be used as dimension tables in a cube.
Incremental Extraction: Only the data that has changed from a specific point in time in history will be extracted. This point in time may be the time of the last extraction, or a business event like the last day of a fiscal period. To identify this delta change, there must be the possibility to identify all the changed information since this specific point in time (see “How to determine if the data that has changed since the last extraction?” below). Many times an incremental extraction will be used for tables that will be used as fact tables in a cube. After the data is extracted, you can use a control table to store the max date of the extracted records, and then for the next run you will get all the rows from the source system since that max date. Or you can just query the destination table in the data warehouse and return the max date and get all the rows from the source system since that max date, but this method could take a long time if you have a lot of records in the destination table. Instead of the max date, another option is to instead use the max ID if the source system has a unique integer.
For each of these methods, there are two ways to physically extract the data:
Online Extraction: The data is extracted directly from the source system itself. The extraction process can connect directly to the source system to access the source tables themselves, or to an intermediate system that stores the data in a preconfigured manner (i.e., transaction logs or change tables).
Offline Extraction: Many times direct access to the source system is not available, so instead the data is staged outside the original source system and created by an extraction routine. The data is usually in a flat file that is in a defined, generic format. Additional information about the source object is necessary for further processing.
How to determine the data that has changed since the last extraction?
If a data warehouse extracts data from an operational system on a nightly basis, then the data warehouse requires only the data that has changed since the last extraction (that is, the data that has been modified in the past 24 hours). When it is possible to efficiently identify and extract only the most recently changed data, the extraction process (as well as all downstream operations in the ETL process) can be much more efficient, because it must extract a much smaller volume of data. Unfortunately, for many source systems, identifying the recently modified data may be difficult or intrusive to the operation of the system. Incremental extraction is typically the most challenging technical issue in data extraction. Below are several techniques for implementing incremental extraction from source systems. These techniques are based upon the characteristics of the source systems, or may require modifications to the source systems. Thus, each of these techniques must be carefully evaluated by the owners or the source system prior to implementation. Each of these techniques can work in conjunction with the data extraction techniques discussed previously. For example, timestamps can be used whether the data is being pulled from a flat file or accessed through a query to the source system:
Change Data Capture (CDC): Change Data Capture records INSERTs, UPDATEs, and DELETEs applied to SQL Server tables, and makes a record available of what changed, where, and when, in simple relational ‘change tables’. The source of change data for change data capture is the SQL Server transaction log. As inserts, updates, and deletes are applied to tracked source tables, entries that describe those changes are added to the log. The log then serves as input to the change data capture process. CDC reads the transaction log and adds information about changes to the tracked table’s associated change table. See Improving Incremental Loads with Change Data Capture. Note that CDC is the key-enabling technology for providing near real-time data warehousing. But, it is only available in Enterprise edition of SQL Server 2008 or later. So if you are using an older version of SQL Server or your source is in another technology like Oracle, you can’t use CDC.
Timestamps: The tables in some operational systems have timestamp columns. The timestamp specifies the time and date that a given row was last modified, making it easy to identify the latest data. This is usually the preferable option. In SQL Server, many times this column is given a timestamp data type, along with a column name of “Timestamp”. Or, the column is given a datetime data type, and a column name of “Last Modified”. You can also add database triggers to populate the “Last Modified” column.
Partitioning: Some source systems might use range partitioning, such that the source tables are partitioned along a date key, which allows for easy identification of new data. For example, if you are extracting from an orders table, and the orders table is partitioned by week, then it is easy to identify the current week’s data.
Database Triggers: Adding a trigger for INSERT, UPDATE, and DELETE on a single table and having those triggers write the information about the record change to ‘change tables’. Note this process is similar to Change Data Capture, so use CDC if you are on SQL Server 2008 or later. Otherwise, triggers are an option.
MERGE Statement: The least preferable option is to extract an entire table from the source system to the data warehouse or staging area, and compare these tables with a previous extract from the source system to identify the changed data (using the MERGE statement in SQL Server). You will need to compare all the fields in the source with all the fields in the destination to see if a record has changed (or use a hash function such as MD5 and Sha-2 (SHA512)). This approach will likely not have a significant impact on the source system, but it can place a considerable burden on the data warehouse, particularly if the data volumes are large. This option is usually the last resort if none of the other options are possible.
Column DEFAULT value: If you have a source table that won’t have updates to any rows, only inserts, you can add a “Created Date” column that has a default value of the current date. Of course this is only an option if you have permissions to add columns to the source system.
One final idea: To prevent the ETL process that is reading the source system from bogging down a production server, use should think about using replication or database snapshots in SQL Server (or the equivalent technologies for other DBMS). Then have the ETL hit the replicated database instead of the production database. This is especially useful if your ETL process will be running multiple times a day while users are on the system.