Big Data, Déjà Vu All Over Again
For the last few months I’ve been hunched down in my bunker trying to figure out Big Data. Let me tell you, it is an elusive animal. Trying to explain Big Data to people is difficult. Ask two different people what it is and you will get two different answers. Ask the same person on two different days and you will likely get two different answers. Heck, I may give you two different answers in this blog post. About six months ago, I moved to the a group within Microsoft called Big Data Center of Expertise. Sometimes we call ourselves the Big Data / Business Intelligence Center of Expertise. Now that’s a mouthful. But I think its kind of instructive because what we did was create a COE around two topics that have been notoriously hard to define. Let’s take ourselves back a decade…
Explaining Business Intelligence (BI)
Remember the early days of Microsoft Business Intelligence?
“Which product is it?”
Well, its not a product at all, it’s a concept. On the other hand, it’s a collection of products. Microsoft BI includes SQL Server Analysis Services, SQL Server Reporting Services, and SQL Server Integration Services. In reality it also included Excel as a key component. As a concept Business Intelligence is a set of practices, architectures, processes, aligned with specific technologies to transform transactional into better decision making reporting. It grew from there into actionable dashboards, alerts, data mining, and many of other things we now think of as BI.
In the beginning this was hard to describe, but over time people seemed to understand that BI was most everything you did to get data either into or out of a data warehouse and provide it back to the business in some really cool ways. Back in the day, sometimes I would get so exasperated trying to explain BI, I would give up and say “It’s basically a marketing term to describe a set of products that many customers don’t know yet that they need to have to stay competitive.” Let me tell you, that didn’t help them understand it at all.
Now we mostly know that BI is often on top of a data warehouse, but it doesn’t have to be. Additionally, not all data warehouses have any BI on top of them. Sometimes BI is driven directly out of the transactional system and Microsoft has gone to great pains to make sure that organizations can do this with many of our products (PowerPivot is a great example). Also, our stack of BI products has grown to include PowerPivot, PowerView, SSAS Tabular Mode. Many of these tools can now be hosted natively in SharePoint providing a central location for users to access the intelligence we are extracting. BI has grown up and people generally get it.
Explaining Big Data
Here we are in the early days of the Big Data Revolution.
“Which product is it?”
Well, its not a product at all, it’s a concept (we’ll come back to this). On the other hand, it’s a collection of products. The Microsoft Big Data stack includes HDInsight Service, HDInsight Server, StreamInsight, and some would include SQL Server Parallel Data Warehouse. In reality, Excel is also a key component as we include a Hive ODBC driver that allows excel to import data directly from Hive on HDInsight. One might argue that to describe Big Data you would say that “Big Data is a set of products that many customers don’t know yet that they need to have to stay competitive.” Many of these tools will be hosted natively in Microsoft Azure and SharePoint.
See what I did there?
Explaining Big Data, Part II
Big Data is any data that is difficult to process using traditional relational database technologies and methods due to the large size and complexity of the data. What that means is that we have to have new tools to handle it. In order to scale out appropriately, we will use technologies that are massively parallel in nature. Organizations may use a distributed data warehouse appliance such as SQL Server Parallel Data Warehouse in their first foray into big data. Some will argue this isn’t truly big data yet as it can’t scale into the dozens of petabytes yet. I say lets apply the previous definition to it: If a customer has gotten to the point that their traditional technologies and methodologies are no longer allowing them to handle the data sizes that they are currently experience and SQL Server PDW helps alleviate this pain, then this is Big Data for them. We can call it Medium Data if you want. As the complexity of the data increases or the sizes of the data continue to increase we will likely need another scalable technology such as Hadoop (HDInsight). There are other tools available for specific scenarios like handling fast moving streams of data. For this kind of data you may need a complex event processing tool like StreamInsight that allows you to report on the data before you store it in the warehouse.
At the risk of being too cute with analogies, this might help out those who frequent Starbucks understand the scope of data :
A more appropriate way to visualize it may be this:
Big Data is the natural evolution of new sets of tools and methodologies to handle new data sizes and complexities. We’ll slowly learn them and incorporate them into our organizations. A decade from now we’ll wonder what all the hubbub was about. Sometime between now and then I’ll probably be blogging about the new tool to handle Trenta sized data.