What is data profiling?
I wanted to profile data, CSV files with 60+ columns and 1 million plus rows. I started searching for a easy to use tool that I could use for data profiling of these files. That's when I noticed that data profiling is not clearly explained anywhere. So here is my attempt to cover this important topic, and I will also introduce the tool, that I found out as part of my search, which helped me a lot to get to know the data. What is data profiling? In simple words data profiling is a process in which we try to understand the characteristics of the data without associating it with a business process. So basically anyone can carry out data profiling for any data. You don't have to know who generates the data, where and how the data is generated, what is the context of that data. What are the answers we are looking for? Some are listed below to give you an idea How many columns are actually there in the file? Does it match specification/documentation if available?