Posts

Showing posts from December, 2018

What is data profiling?

Image
I wanted to profile data,  CSV files with 60+ columns and 1 million plus rows. I started searching for a easy to use tool that I could use for data profiling of these files.  That's when I noticed that data profiling is not clearly explained anywhere.  So here is my attempt to cover this important topic, and I will also introduce the tool, that I found out as part of my search, which helped me a lot to get to know the data. What is data profiling?  In simple words data profiling is a process in which we try to understand the characteristics of the data without associating it with a business process. So basically anyone can carry out data profiling for any data. You don't have to know who generates the data, where and how the data is generated, what is the context of that data. What are the answers we are looking for?  Some are listed below to give you an idea How many columns are actually there in the file? Does it match specification/documentation if available?

Don't use front-end where it is not required

Image
Regular manual download of data from a portal (frontend) and loading it into a backend system is like exiting an airport and then reentering the same airport to catch a connecting flight from the same terminal when you have no business outside the airport.  I have seen people doing this, downloading data from portals and then loading into other systems manually. If there is no one looking at the data in the UI and no decision is to be made and only regular data feeds are required then don't do it via frontend (GUI), just get a data extract job created that will automatically load the backend system.  This is partly related to one of the previous posts -  BI can take you to places

Should business users spend their time in creating reports?

Image
A marketing manager, or a HR manager, a sales manager, or an account manager should he (or she) be spending time in creating reports or using reports to make decisions? On one side, total dependence on BI team for all information needs can slow down business users. On the other hand if business users have to create their own reports or work their way through the dashboards or self-service BI to get to the numbers they are looking for, it could kill their time and thereby decreasing the time for their real work, part of which is to take decisions based on information and insights. And that's why there needs to be a balance to ensure basic first level information can be self-served and for complex requirements BI teams spend time in delivering the information. 

Prediction vs Forecast

Image
In the context of data analysis/BI  when is something called as a prediction and when is something called as a forecast? Quite a lot of people use these terms interchangeably. Dictionary also can't help here, see below. Source :  https://en.oxforddictionaries.com/definition/prediction So in the context of data analysis/BI I would say forecast is based on past trends.  Time series  is involved. Based on previous behavior future behavior is forecasted for a specific time period.  On the other hand prediction may or may not be based on past trends. So all predictions are not forecasts, but all forecasts are predictions. In this way forecast is like a subset of prediction. Example of a prediction which is a forecast - No of books that will be sold each month in the next 6 months. Example of a prediction which is not a forecast - Country X will win the world cup because they are a good team and in the best form compared to other teams. What do you think?

New Year Resolution - Starting with IBI?

Image
A few people who have attended my presentations this year and became aware of the concept of IBI  (Individual Business Intelligence) and a few people with whom I have been in contact this year have shared their interest to start with IBI starting from 2019 or have already started. It feels nice to get feedback like below.  " I attended your presentation at the publication office. I just read your article on IBI and I love your idea."   -   Message in LinkedIn by a senior professional who attended my presentation of PublicBI BI solution for EU Public Procurement at the Publication Office of the European Union, Luxembourg. This is really great that people have started it or plan to start.  I wish you all the best. For those who are still not sure what IBI is and how to use it, below links should be useful. Basically in this post I am placing all the important IBI links in one post in a sequence so it is easy for people to find and understand.   Short (9 minutes) presen

BI can take you to places

Image
Using BI team only as a data extract team is similar to using a car headlamp to light a room.  You are in a dark room in the ground floor, somebody starts a car outside the room and the light from headlamp of the car enters your room through the glass windows. You can now see some of the things in your room. You now order (you have authority unfortunately) the driver to keep the car on with headlamp on. Driver tries his best to convince you to please get a bulb soon for your room so that he can take the car to go places, but you don't understand, because you have never seen a car and don't know that it can move.  This is how some of the uninformed business users and uninformed non-BI technical people view BI teams. They think BI team has the expertise in moving data so let us use them for moving data. No, BI teams move data to consolidate, to combine, to integrate data, to harmonize, etc., so that users can get full picture of the business based on the information an

Open data is the low hanging fruit within Public data

Image
Public data is all the data that is publicly available for everyone to make use for any purposes they wish to use it. And Open data is the subset of Public data. Open data is well-defined, maintained, generally more reliable, and there is some sense of assurance that there will be continuous availability of data as the data and the related documentations, APIs, access points, portals, etc., are  made available by the generator (source institution) or by authorized data aggregator organization. In this sense, from my point of view open data is the low hanging fruit within Public data. Open data is a subset of public data.  

Tool to auto populate data based on a dimensional model

Image
Are there any tools that can be used to auto populate data into tables based on a dimensional model? If it doesn't exist, may be this is  something some company can build and offer. Various tools including ETL tools provide feature/component that can be used to generate dummy data based on schema definition.  What I am looking for is a tool to which we can provide a dimensional model (Facts and Dimensions), and target DB connection and that the tool is able to auto populate dummy data into the tables (Dimensions first and then Facts) accordingly maintaining all the relationships in tact. Tool should be able to create data for all SCD types and all types of Fact tables.  A tool like this would help in speeding up prototyping, testing, visualizations, etc., so basically would speed the development and delivery. 

Information from data is like bread from wheat

Image
When people are hungry and in a hurry and lack bread-making skill, you can't give them wheat and ask them to make bread and then eat.  You need bakers who know how to transform wheat to bread in a scalable way and keep the bread ready for hungry people. This is exactly how BI professionals are required to transform data to information to derive insights and keep them ready for knowledge hungry users to consume. 

Popular posts from this blog

ETL developer vs Data engineer

3 years of IBI