Why big data is actually small?

Posted an article with full details in Linkedin

Why big data is losing importance

"Compared to the Universe anything and everything is very small" , wait I am definitely not going to use that line as a supporting point to present my point of view on why the so-called big data (#bigdata) is actually small. It's just a reminder of the fact that me, you, us, planet earth and this whole solar system is so tiny compared to the Universe and we shouldn't be making things look bigger than it really is. To say in other words we shouldn't be giving something more importance than it actually deserves.

For those, who don't have the time (10 minutes) or patience to read the whole article, the reality and the gist is, big data is just a subset of data , a subset cannot be bigger or larger than the whole. No matter how big or large or complex big data is, it has to be smaller, less complex than data. Data is an asset, one of the most important asset in the current times, use it to the max to create value out of it, don't get carried away by buzzwords and sheer marketing. You have to watch the full movie to comprehend it. Watching only a part of the movie and making a conclusion and decisions based on that can lead to unexpected results. This is exactly what will happen when you focus only on a subset of data.

Why big data is actually small?My main intention with this article is; to present an unbiased view about big data, to explain what I mean by "big data is actually small", and explain where big data fits in the context of Business Intelligence in order to help everyone starting from freshers to the CXOs get a better and clear understanding about big data. As big data is not called big data because of its size I am going to use the same premise and present why big data is actually small.

First, we have to examine the definition itself and get some insights on that, so let's start with the definition of big data. Yes, but which one? everyone has their own definition of big data. Let's keep it simple and take the Wikipedia definition, according to Wikipedia "Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them". Do you see a big flaw in the definition? This definition cannot stand the test of time. Because, today's advanced systems that are handling big data will become traditional or normal systems in the next few years, but the size and complexity of data will only keep growing, this is what has been happening, and we can safely say this trend will continue. What will that data that is bigger / larger and more complex than the current big data be called? very big data? very very big data ? mega data? How far will we take it? What will the currently called big data be called in 2022 or 2025? and is the current big data smaller (in size and complexity) compared to the larger and more complex data in the next few years? Or are we saying that data will not get larger and more complex in the future? please have a think or rethink.

Now, let's look at Gartner's 3Vs definition of big data as mentioned in Wikipedia "Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation" . I really wonder why it was not named high data when the word "high" has been used 3 times in the definition? At least there would have been some logic in the naming. And actually how high is high? Today's high is tomorrow's normal or even low. Why in the world would anyone come up with a technical term with an adjective prefixed in current times when everything changes so fast? especially on the technology side? Again, this definition is also ambiguous and will not stand the test of time. Again, is the current big data smaller ( lower in volume, velocity and variety) compared to the higher volume, velocity and variety of data in the next few years? For now, even if we consider these definitions this (below image) is how it looks.

Why big data is actually small?

Some people probably started realizing above mentioned flaws and are now trying to add predictive analytics, user behaviour analytics, customer 360 degree views, fraud prevention analytics and many more useful and totally in-use concepts that was already existing, into the definition of big data to give big data a larger context and a bigger meaning, to glorify it and to make people believe that it is much more than a subset of data. I don't know what is driving them to do such things but as you can see, this is what is happening in the industry. Just check out some of the posts in LinkedIn and you will see what I mean. Some are even trying to equate big data with Machine learning and Artificial Intelligence.

why big data is losing importance?It's like claiming, here is a new movie which has action, comedy, romance, songs, dance and drama, when all that, that exists, is a small part of the story that can be made into couple of movie scenes, written 10000 times on 10000 sheets stored in a big (read it biiiiig) bag, but many people already call it big (hit) scenes because it's in a big bag and marketed in a big way. And they even believe and try to make others believe that the movies can only be made because of those big scenes. No, sorry to disappoint, they are wrong, movies were always made, yes, adding those couple of scenes could enhance the movie, but without it, it doesn't stop anyone from making good and hit movies. Unfortunately the impact it's having on the other side is very bad, there is a story for rest of the movie in fewer pages and it will definitely take lesser time to make most part of the movie, but that is not given importance, why? because the sheets are not in a big bag. And everyone seems to be behind the big bag because everyone believes everyone else is behind the big bag, all credits to marketing experts, they know very well how to make things look bigger than it really is. Remember, as mentioned above, you have to watch the full movie to comprehend it, to get the full message the movie has to offer. Watching only a part of the movie and making a conclusion and decision based on that can really lead to unexpected results.

Just in case someone didn't get the analogy above, the point is, the concept of deriving value out of data was always there in different forms with different names (Business Intelligence, Business Analytics, Data Analytics), and it is not because of big data that we have predictive analytics, user behaviour analytics, customer 360 degree views, fraud prevention analytics or any other useful concepts, techniques or processes that has anything to do with creating value out of data. You can do all of these and much more without adding these to the definition of big data.

Unfortunately the term big data has gained so much popularity that even for people like me who don't like to use the word big data, are forced to use it to communicate with some people who have not heard the word data and have heard only big data thanks to unimaginable marketing practices. What was/is wrong in calling data as data? Why do we have to add big to it? Based on what, where and how the data is getting generated we can always differentiate / categorize / classify the data as per our convenience. For example if sensor is generating data we can call it sensor data, if a mobile app is generating data we can call it app data. But now it's too late to get rid of this term. So the best we can do is to just bring clarity about the definition and not to add other existing concepts into the definition of big data. Keep it simple, big data is just a subset of data.

So, for all those young minds coming fresh out of Universities joining big data training institutes and all those new CXOs who are constantly targeted by consulting and software companies don't get fooled when someone tries to expand the definition of big data and starts saying things like big data = data analytics or big data = predictive analytics or big data = new version of business intelligence. Don't get fooled, don't fool yourself, and please don't fool others. Based on my discussion with various people at work, at conferences, Quora Q&A's and content both in mainstream media and social media, I get these images (below sketches) in my mind that reflect the current trend in big data. [Thanks to Rekha (my wife) for the sketches, note that she is not a professional sketcher]

Company A

looking at data is an art

In company A, They have already derived value from that set of data that was like low-hanging fruit (e.g. Customer data in Oracle DB, Supplier data in SQL server, excel sheets from couple of departments, HR data in SAP HANA, xml and csv downloads from couple of websites etc.). And now they think it's time to get the user behaviour data like web tracking, mobile app usage and social media data for example Twitter, Facebook, etc. This is very logical and makes perfect sense.

Company B

On the other side in company B, they haven't even started looking at low-hanging fruit on their tree, fruits are almost hitting their eyes, but they want to be blind to that or want to push it away and go for the fruits on the top of the tree directly. As the tree is too tall and because they don't know how to climb a tree they can't see if there are really fruits on their tree. But they can very well see fruits on top of neighbour's tree and based on that they are made to believe (of course by highly paid consultants) that their tree also has lot of fruits on the top and that they should not waste time because their neighbour is already going for fruits on the top. They don't realize that their tree may not have fruits on the top and even if there are fruits, there are possibilities those fruits could be fully or partially rotten and that the tree climber's company is making all the money in the meantime. In case of low-hanging fruit it's easy to examine it, just turn it around and see if the fruit is still good, and you don't need expert tree climbers, so your investment is less and return is more compared to getting expert tree climbers go up the tree, crushing all the low-hanging fruits on their way up and then realizing there was only rotten fruit at the top. Hope the CXOs get the point at least now. Focus on your tree which is different from your neighbour's tree. Focus on the low-hanging fruit first.

discover yourself with Individual Business IntelligenceI hope with all of those explanations above it is clear now that if you take out all of the stuff that is non big data then you are left with a part of the whole data about any subject. Subject could be customers, or sales or campaigns or all of these together. And now you see big data is just a small thing in the overall data landscape. Companies like google, facebook, twitter, linkedin etc have no other way but to focus on big data because they have to focus on user data. And note that all users are not customers. We all are users of facebook, linkedin, twitter etc but we all are not customers of facebook, linkedin or twitter. We don't pay these companies, so we are not customers. But our usage data becomes raw material for these companies with which they are able to offer products and services to their paying customers. So depending on which type of business you are in, you need to have your data strategy in place and not copy paste the data strategy of another company. There is no denying that data is an important asset, And from point of view data is not only an asset to commercial businesses but also to Nations and Individuals (see Individual Business Intelligence).

Data on its own, doesn't matter how big or small it is, doesn't matter how simple or complex it is, doesn't provide information and insights, it's a very important raw material from which information and insights can be derived. And for that to happen, i.e if any value has to be derived from that data, from that raw material, then either a human or a software has to work on that raw material to create value out of it. As big data is a subset of data, big data is also a raw material. Big data analytics is analytics performed on big data. Data analytics is analytics performed on all data. And note that big data analytics doesn't necessarily mean big insights. And this is one more reason to call out that big data is actually small, small compared to data and small compared to the value derived from the analytics.

At this point I would like to bring you to the small changes that have happened in the area of business intelligence after big data. Business Intelligence in short, is about deriving Information and Insights from data efficiently to enable better and quicker decision making in order to improve Business.

Business Intelligence before big data

Architecture of business intelligence before big data

Above architecture is a typical business intelligence solution architecture backed by a data warehouse without big data source. Before big data, all of the mostly structured data like sales, transactions, HR, user excels, downloads from websites was already considered in the data warehouse. These are the sources in which most important data resides. For example actual sales, customer, employee, etc., data from which information and insights can be derived.

Business Intelligence after big data

Architecture of business intelligence with big data

For BI nothing much has changed, because BI is not limited by data size or type, or technology or methodology. As long as there will be Businesses, BI will be there. People may try to give it different popular and fancy names but the concept of deriving information and insights from data will remain. In this case, big data sources are just some more data sources that has to be integrated with the Data Warehouse. If anything of real value has to be delivered then the data from big data sources has to be integrated with non big data sources only then there is right meaning to the data that is present in the big data sources. For example let's say we get mobile app data from one of the big data sources. This is of not much use if we cannot combine it with customer or transaction non big data source data.

And now some people are trying to call this BI as big data analytics just because one of the sources is big data. And some want to focus only on big data analytics. What do they mean? Won't they perform analytics on data that is not big data? If you have employees or customers information in Oracle or SQL Server DB you will intentionally be blind to it and focus only on web tracking or mobile app or twitter? So here is one more reason why big data is actually small. Big data in the context of BI is just another data source which can be used to derive information and insights. BI professionals generally will not bother to look at the complexity of the source applications like ERP or CRM or any other source applications. What BI professionals look at is how can I get the data from that application and not how that source application is built. So for us it doesn't matter if the data is in csv, excel, xml, DB (Oracle or SQL server or Greenplum or SAP HANA, etc.), proprietary data repositories, websites or HDFS or log files, and it doesn't matter if the data is stored on-premises or on the cloud. For us it is just another source of data, if we have the right connectors (most tools already have) to connect to sources and pull/extract the data and then we will get it, arrange it and provide or present it in the way required by business.

so to conclude
  1. Big data is just a subset of data. The current industry definitions are not good enough and they are trying to glorify big data by adding concepts that are not big data into the definition of big data.
  2. Data is an important asset not just for businesses but also for nations and individuals.
  3. The value derived from big data is not necessarily big compared to value derived from non big data sources.
  4. Focus on your tree which is different from your neighbour's tree. Focus on the low-hanging fruit first.
  5. Business intelligence will exist as long as businesses will exist. People may use fancy or popular names. Concept remains same.
  6. For business intelligence big data is just another source of data.
  7. And finally, as big data is a subset of data, it has to be smaller than data.
If there is someone who can't agree with the reasoning, I kindly request them to please provide for all of us reading this article just one real example of your implementation of so-called big data project successfully which has been in-use for at least 3-4 years. What was the real challenge? Why exactly was it called a big data project? Was it because of volume, velocity or variety or combination of these? What were the volumes, what was the velocity and which variety of data did you deal with? How much was the total spend? What was the outcome? Of course we don't need any confidential information. Please don't give examples of what someone else has done. Too much (but without real outcome) has been already written about what others have done.

Disclaimer: All points here are my own views and it does not represent the views of my current or previous employers or any of my clients.

I do not have funding to market such unbiased views. If you think such unbiased information should reach more people then please share it with your circle.


Popular posts from this blog

ETL developer vs Data engineer

3 years of IBI