Working of an ETL Tool

ETL stands for Extract, Transform and Load. ETL is one of the main components in a BI solution backed by a data warehouse. ETL is also used in other projects such as data migration and data integration projects. ETL flows/jobs can be built by scripting or by using ETL tools. Most of the companies currently use one of the existing ETL tools to build ETL flows/jobs.

In short, ETL tools abstract the technical complexity and thereby enables developers to focus on ”what needs to be done” than on “how it needs to be done”, For example a developer doesn't have to bother about developing a connector to a database, developers design the ETL flow/job using drag and drop, click and configure GUI of the ETL tool and also run, test, debug, schedule the ETL jobs using GUI of the ETL tool.

ETL tools provide a visual framework for ETL developers to design the ETL jobs. It provides a level of abstraction of the code/script. There are many features that ETL tool provides but I won’t go in those details here.

ETL tool allows the ETL developers to focus on the business logic/rules instead of the technical implementation. So, ETL developers focus on what to implement than on how to implement.

ETL tool is expected to create optimized code for an ETL flow and is expected to either bring the data to the ETL server or perform operations in the databases based on whichever is better for performance reasons. Ideally ETL developers shouldn't have to optimize the code but only configure the jobs/transformations/tool correctly such that optimized code is created for every ETL flow.

ETL tool has pre-created connectors/drivers to almost all type of data sources, For example to connect to a DB, all that you provide is the DB server details so you as a ETL developer don’t have to know how to code to connect to a data source.

ETL tool has mappings for each of the commonly used functions. If you see the metadata repository of an ETL tool you will see that there are predefined mapping for every common data transformation for every possible data source. I don’t remember exactly the mappings so I will give a simple example just to give you an idea how an ETL tool works. Lets say, business transformation requires to find out date from timestamp. In one DB it could be date(timestamp column), in other it could be dt (timestamp column) and in another one some other convention. When you do this in an ETL tool you just select the date from timestamp function from the ETL tool and then the ETL tool checks which type of data source it is and takes the corresponding mapping so that the underlying data source understands it. Now extrapolate this logic to every single transformation to every type of data source/target.

Like other tools ETL tools are also available in both open source and proprietary flavors.
  • SSIS is a proprietary ETL tool from Microsoft
  • Datastage - IBM
  • Ab initio - Ab initio
  • Informatica - Informatica
  • BODS - SAP
  • Popular open source ETL tools are Pentaho Data Integration and Talend

Comments

Popular posts from this blog

ETL developer vs Data engineer

3 years of IBI