Full And Incremental Data Loading Concept From Source To Destination Using Azure Data Factory

Objective: Our objective is to load data incrementally or fully from a source table to a destination table using Azure Data Factory Pipeline. First, we will go through the concept and design, and later the implementation steps using Azure Data Factory.
Prerequisites: Following basic knowledge are required to understand the data flow properly.
1. Basics of ETL Process : ETL stands for Extract, Transformation and Load. Extract is the process of reading data from a file system or database. Transform is the process of validating, modifying data base on various business rules. Load is the process of writing the data into the destination or target or sink file system/database.
2. Azure Data Lake Gen 2, Azure SQL DB and Azure Data Factory Components understanding.
Various questions are arising in your mind that what is full or incremental load? Why both are required?
Suppose, you are planning to move data from a source system to a target system. So, for the first time there is nothing in your target system. You should build tables (if it is SQL DB) in the destination and build ETL Pipeline to move data. Suppose, you are moving data from an online transaction system which can create 1 million data regularly. Now, for the first time you will load all data which are available in the source. You can say this is a Full Load of data from source to destination. From the next data, you can plan to load only yesterday’s data to the destination. Actually, if you have all the data in you destination except the yesterday’s data, you should load the new/extra data which is available in your source system and this is a type of Incremental Load. Basically, incremental load helps to keep your destination system up to date with your source system.
e.g. Suppose I have a customer table in source which have 100k records as per today and we have performed a full load and loaded 100k records in to the customer table of target. As the source is a online transaction system, I can expect few new records into my source system. Suppose, there are 100 new rows and now the total records in the source is (100k + 100). Now, if we go for a full load again, we have to load (100k + 100) into the target which is costly for me, it will take more time. So it will be good that I should identify those 100 new records in source system and only load them into the target system which is less expensive for me.
Scenario: Here, I have tried to explain a scenario using that we can do full or incremental load of multiple files of a source system. For this purpose we should create a configuration table. Based on that we will go forward.
The attributes of the configuration file should be following:
Table_Name: Name of the table which we want to load from source & target.
Incremental_Full_Load: If ‘0’ , go for Full Load if ‘1’, Incremental Load.
Max_Last_Updated_Date: Latest added_date/updated_date of records for a particular table in target system.
Active_Indicator: Binary value(0/1), identify to perform load for which tables. If ‘1’, select table names for ETL process.
During the first run, the load activity is the full load of tables. Then, insert few records and perform the load activity once again which should be the incremental load. The configuration table data will updated automatically after the first run(full load), so that in second run it will automatically go for incremental load.
Design of the Process: Please go through the following Diagram

Development Process using Azure Data Factory:
1. First, create two Azure SQL DB resources: a. Source DB and b: Target DB
2. Create the Azure Data Factory resource.
3. Create the configuration Table.
4. Create two tables in the source DB e.g. customer table and transaction table.
5. Load data into the configuration table for customer table and transaction table. Load also an extra row for promotion table. After loading data it should be look like.

6. Load few data into the customer table and transaction table.
7. Go to the Azure Data Factory > Manage > Linked services > New and create two new linked services one for source and another for target.

8. Then, create two parameterized data set (Azure Data Factory > Author > Datasets > New dataset), one for source and another for target so that we can pass the table names on runtime from configuration table and this will help to handle multiple table using a single pipeline.

9. Go to the Azure Data Factory > Author > Pipeline > New Pipeline
10. First, create a lookup activity with query:
select * from [schema].[configuration table name] where Active_Indicator=1;
11. Create a For Each activity to iterate each table and perform activity(full or incremental load) for each table.

12. Then, create a If Condition activity, if ‘Incremental_Full_Load’ is 1, go for the ‘Incremental Load Activity’(associated with True value of if condition) else ‘Full Load Activity’(associated with False value of if condition).

13. Within the Incremental Load Activity,
a. first create a lookup to get the ‘Max_Last_Updated_Date’ from the configuration table for each desire table.
b. Then, using Copy Data activity, move data from source to target.
c. After that, using lookup activity, get the max value of the ‘added_date’ from the target table and update that value in the configuration table for the particular row for the particular ‘Table_Name’ value.

14. Within Full Load Activity, we will do step ‘b’ and step ‘c’. Only a small addition that we also update the ‘Max_Last_Updated_Date’ value to 1 in step ‘c’. So that, the fully loaded tables will automatically perform incremental load from second time.

Conclusion: Using the above steps, you can easily build a Full & Incremental Loading Process easily. These are some core idea and design to implement full & incremental load in any ETL system. May be you can addition few new process, ideas, optimization to make it more robust.
visit to the https://github.com/CoolexxAmitava/Full_Or_Incremental_Load
**********git repository: Azure Data Factory implementation**********
Hope, the above idea will help you to understand the concept of Full And Incremental Data Loading.
Thank you.