Channel 9 has an amazing abundance of informative material around SQL Server, Microsoft BI, and Big Data. Saptak Sen (Microsoft) and Bill Ramos (Advaiya) have produced the newest offering of videos that cover Microsoft’s Azure and HDInsight offerings and most importantly how they integrate with Microsoft’s Business Intelligence stack. I’ve watched three of the videos so far and really enjoyed the #4 Mahout video.
Check out the entire Big Data Analytics course here:
If you want to go directly to any of the videos, here you go:
In this module, you will learn how to use Microsoft Excel Power Query with PowerPivot to mash up data from a variety of sources including Hive tables, Windows Azure Data Marketplace, and web sources. [01:45] – Power Query Excel Add-In [04:21] – Excel Power Pivot Add-In [06:01] – Demo Big Data
This module explains how to use Microsoft Excel Power View and Power Map add-ins to visualize data mash-ups from a PowerPivot model to create charts and map-based analysis. [01:36] – Excel Power View Add-In [07:31] – Demo Creating Power View Reports [12:38] – Power Map Excel Add-In [13:52] – Demo…
In this module, you will find out how to use SQOOP to perform high-speed data transfers from a Hive table on an HDInsight cluster to a Windows Azure SQL database. You will then see how create and deploy reports on Windows Azure Reporting Services. [01:11] – Working with SQOOP in Microsoft HDInsight…
This module shows how to use the Microsoft Excel Data Mining add-in along with SQL Server Analysis Services to perform key influencers and categorization data mining techniques. You’ll learn how to install and use Apache Mahout on HDInsight. [01:02] – Data Mining [07:00] – Demo Excel Data Mining…
In this module, you will learn how to use Windows Azure tables and MongoDB as NoSQL technologies for your Big Data solutions. You’ll see how to create a .Net application for accessing Azure tables. You’ll also learn how to install and use MongoDB on a server. [00:54] – Windows Azure Table Storage…
Hortonworks gets credit for the release of the month with their Hortonworks Data Platform 2.0. What I like about the HDP vernacular is that it truly is a data platform as they don’t release anything piecemeal but work tirelessly to ensure that all the parts play nicely together for you. This release of Hadoop 2.0 includes all the YARN additions along with many improvements to technologies you may have been using already like Hive and HBase.
Want to know about YARN?
Hortonworks refers to YARN as the new OS for Hadoop. It provides the flexibility for additional data processing initiatives beyond mapreduce. Additionally, YARN provides for improved management and monitoring, multi-tenancy, improved security, high availability, and improved disaster recovery.
Take a look at this link to get more information on HDP 2.0 and YARN.
What about Hive?
Hive 0.12 is included in HDP 2.0 that provides a host of improvements to Hive. Specifically, query speed and SQL compatibility were major focuses of the Hive improvements in HDP 2.0. Query plan generation, Group BY’s, and Optimizations to COUNT stick out for me. SQL support improvements include VARCHAR support, DATE support, and Truncation support. There are literally dozens of other improvements. You can get a deeper read on the improvements here.
Apache Ambari and HBase saw significant improvements. Ambari allows you to provision, manage, and monitor a cluster running on Hadoop 2, including support for NameNode High Availability. More Ambari 1.4.1 information can be found here. HBase improvements include Snapshots, support for Windows(!), and reduced mean time to recovery. Check out this page for more HBase 0.96 information.
Want to learn more?
Join Hortonworks on November 12 for a webinar outlining the YARN based architecture of HDP 2.0. They’ll discuss all the latest improvements to HDP 2.0 for technologies like Hive, Ambari, and HBase. Jump here to register for the webinar.
Obviously much has gone into Hortonworks HDP 2.0. This is a release of momentous occasion. Its impressive that they have been able to package it all up together like this in one release. They should soon have a sandbox of the GA release here. I heard it should be available sometime the week of Oct. 28th, so if you don’t see it yet keep checking back.
Looking for a solution that helps drive operational data from your SQL Server, Oracle, or DB2 Environment? Need it to be a agentless environment for both source and destination? Read on about Attunity’s new Click-2-Load solution.
Warning: I have not yet used this product so I can’t vouch for it yet. I’m simply relating what I’ve read from their materials. Once I’ve personally tested the product, I will write a review.
Essentially Attunity is providing a product that through a point and click environment you can set Attunity to do one of two things.
1. Bulk Read changes directly out of the source database and bulk load them into PDW.
2. Through the use of Attunity’s CDC product, read the Transaction Log from the source for streaming changes into PDW.
This is all done through SQL Server PDW’s Bulk Loader API which has been published for a while but this is the first vendor product that I’ve seen to take advantage of it. If anyone has used it, please comment on your experiences.
Here is the link:
Barbara Kess and Dan Kogan have written an amazing document on the state of SQL Server 2012 Parallel Data Warehouse. This document steps through PDW’s design and describes why you should expect amazing results for your data warehouse experience. More importantly, what I liked about the document was the many customer references about the what their experience is like with SQL Server PDW. These next few quotes just blew me away:
Our queries are completing 76 times faster on PDW. This was after PDW compressed 1.5 TB to 134 GB.
Our daily load now takes 5.5 minutes on PDW whereas it used to take 2.5 hours without PDW. The load is 27 times faster on PDW.
5.5 TB (uncompressed) compressed to 400 GB (14x compression).
You’ll have to read the document yourself to find out what other customers are saying, including some good links to full reference stories.
Download the whitepaper today
I have a couple sessions in which I’ll be speaking about data warehousing at the PASS Summit in Charlotte, NC. One of them will be a deep dive into SQL Server 2014 Clustered Columnstore Indexes. If you are interested in how clustered columnstore indexes can dramatically improve the performance of your data warehouse on SQL Server 2014, please join us.
SQL Server 2014 Clustered Columnstore Indexes: Friday, Oct. 18th, 2013 1:00 pm Room Ballroom B
Have you experienced the blazingly fast query performance enabled by columnstore indexes and batch mode processing? Are you wondering what’s next for these revolutionary data warehouse features? In this session, we’ll examine new query processing enhancements to extend the benefits of batch mode processing, including updatable clustered columnstore indexes. More query types will benefit from batch mode, and larger proportions of your complex queries will be executed in batch mode. Come learn about these new capabilities for processing data from columnstore indexes and how to take advantage of the benefits.
Here’s a link to the session: