Data Integration (DI)

Data Integration focuses on a number of different architectural techniques and capabilities in the movement of data from one provisioned data source to another. This category of capabilities covers batch and real-time data movement as well as bi-directional replication and synchronization.

Bulk Data Integration

Bulk Data Movement of structured data has traditionally been associated with a process known as ETL (Extract, Transform and Load) but more recently a more effective and scalable approach has been utilized known as E-LT (Extract, Load, and Transform).

ETL (Extract, Transform, Load) – is a middle-tier engine centric architecture and is the more traditional approach to data movement. ETL initially extracts the data from various source data stores, then transforms the data in a middle-tier ETL engine within a staging area, and then loads the transformed data into the target data store. The data transformation step of the ETL process is by far the most compute-intensive, and is performed entirely by the ETL engine. The ETL engine performs data transformations (and sometimes data quality checks) on a row-by-row basis, and hence, can easily become the bottleneck in the overall process. In addition, the data must be moved over the network twice – once between the sources and the ETL server, and again between the ETL server and the target data store.
ELT (Extract, Load, and Transform) – is database centric architecture that provides a flexible and optimized approach to data movement. There are several advantages to this E-LT approach such as utilizing parallel processing features of the database to perform the transformation and reuse of common in-house skills such as SQL. In addition if the staging area and target data source are located on the same database then less data needs to cross the network.
Logical data Movement – Apart from physical Data Movement capabilities such as EL-T and ETL some consideration should be given to logical data movement capabilities such as partition exchange. Partitioning allows a table, index, or index-organized table to be subdivided into smaller pieces, where each piece of such a database object is called a partition. By manipulating partitions in your target data store, you can use Partition Exchange to instantly move data. Partition Exchange has the effect of moving data simply by swapping definitions between two data partitions. It appears to have moved even though no transmission of data has occurred.
Parallel Copying – Moving a large amount of files in a sequential manner can take a considerable amount of time. Parallel copying is the capability to parallelize copying of large amounts of data therefore speeding up the process. This capability is applicable for both native and distributed file systems.

Incremental Data Integration

Informed decision making requires access to more real-time data. As the latency of data integration increases it starts to lose relevance to ongoing operations, and thus its value decreases.

Change Data Capture – plays a key role in keeping data consistently updated without overly impacting the target or source performance. Change Data Capture is accomplished by capturing just the changed records and not the full data set and delivering this to the target data store, thus dramatically reducing time and resources for the movement of data. While there are different methods to capture only change data from source systems (e.g. log based, table trigger based), reading transaction logs (log based CDC) has the lowest impact on the source systems, in addition to being non-intrusive. For maximum performance (speed and scalability), it is important that CDC provides low impact capture, routing, transformation, and delivery of transactional data across heterogeneous environments in real time without the need for an ETL engine. Any CDC approach must guarantee referential integrity. Changes frequently involve several data stores at one time. For example, when an order is created, updated, or deleted, it involves both the orders table and the order lines table. When processing a new order line, the new order to which this line is related must be taken into account.
Data Stream Capture – addresses the need to lower latency when the latency of batch movement of unstructured/semi-structured data is unacceptable. Data Stream Capture provides the capability to move data as it is generated by describing the way a single stream of data is delivered and processed from its source to its target data store (e.g. distributed file system). Streaming data are captured via a number of mechanisms (e.g. tail of a text file) and then transmitted to their destination. Multiple Data Stream Capture engines can be wired together to support a wide variety of data movement requirements (e.g. scalability, routing, consolidation, and reliability). Data Stream Capture not only addresses low latency requirements but also provides a low impact capture, routing, transformation, and delivery of un/semi-structured data across heterogeneous environments in real time.

Data Replication

Data Replication enables the creation and maintenance of one or more synchronized standby data stores that protect data from failures, disasters, errors, and corruptions. It can address both High Availability (HA) and Disaster Recovery (DR) requirements. In addition, it can address off-loading reporting requirements from the production data stores.

When the data being sourced and targeted are the same, and requirements are focused on high availability and disaster recovery, then a Data replication approach is best suited. If the source and target sources are heterogeneous then a combined bulk data movement and incremental data movement approach is best suited. A standby data store is initially created from a backup copy of the primary data store and then the replication engine keeps the data synchronized.