What is data profiling?

I wanted to profile data,  CSV files with 60+ columns and 1 million plus rows. I started searching for a easy to use tool that I could use for data profiling of these files.  That's when I noticed that data profiling is not clearly explained anywhere.  So here is my attempt to cover this important topic, and I will also introduce the tool, that I found out as part of my search, which helped me a lot to get to know the data.

What is data profiling? 

In simple words data profiling is a process in which we try to understand the characteristics of the data without associating it with a business process.

So basically anyone can carry out data profiling for any data. You don't have to know who generates the data, where and how the data is generated, what is the context of that data.

What are the answers we are looking for? 

Some are listed below to give you an idea
data profiling
  • How many columns are actually there in the file? Does it match specification/documentation if available?
  • How many rows? 
  • What are the data types? 
  • How many values are unique within the columns?
  • Are there missing or null values?
  • Are each of the values within the column unique or is this an enumeration
  • Are fields across files consistent?
Based on the the answers for above, we can come to a conclusion about which fields could be useful, which fields require cleansing, etc. 

About the tool I used

I was searching for a web-based tool that offers data profiling features and came across this tool called Ataccama ONE , I am still exploring various features but I already found it very useful for my purpose of data profiling. I understand that the tool offers much more than data profiling features.


Popular posts from this blog

ETL developer vs Data engineer

3 years of IBI