Skip to content

PDW: Dealing with Nullable Distribution Keys

2013 March 7
by Brian Mitchell

Choosing your distribution key is one of the most important things you do during the architecture of any database on SQL Server PDW. A distributed table in SQL Server PDW is a table with rows distributed across several physical storage destinations. Each of these physical locations is known as a distribution. For each compute node you have in the appliance, you have 8 distributions and any row that is inserted into the distributed table will live in one and only one of those distributions. The location of that row is based on a deterministic function that is assigned to that rows based on the column you choose for the distribution key.

Guidance for choosing a good distribution key includes:

  • Choose a column with high cardinality
    • Choose a column with an order of magnitude of several more than the number of distributions that you have in your appliance. If you have 80 distributions, choose a column that has only has a several hundred or more distributions. If you have 320 distributions in your appliance, you should be looking for a distribution key that has at least 500 unique values. Either way, you should validate your chose by using DBCC PDW_SHOWSPACEUSED.
  • Choose a column that is not skewed
    • Skew means there are many more rows on average in one or more distributions than in the others.
  • Choose a column that is not usually limited in the where clause
    • For example, if you choose a date key as the distribution key and then filter on OrderDate = 20130214, then you are querying a single distribution while the others lay idle.
  • Choose a column that does not allow NULLs.
  • If you do choose a column that does allow nulls, ideally, the number of nulls will not exceed the average number of rows for any other distribution value.

What happens when the distribution key is nullable and you know there will be many nulls? If there are many more nulled keys than other keys, you will be introducing skew into your database. For example, if 25% of the rows in your table allow a null for the distribution key, then all of those rows will fall into a single distribution. If it is significant enough, this will cause performance degradation for queries hitting this distributed table.

Here you have a couple options:

  • Choose a different distribution key
    • Depending on your situation, this may be an easy option. Usually the question comes down to: are you using that same distribution key in many other distributed tables hoping to get locality when doing Distributed table to Distributed table (often fact to fact) joins. If you are, it may be more difficult to change this one table to another key because you will lose the locality that you desire.
  • Keep the Nulls but break out your table.
    • Create two tables.
      • The first is with the original schema including the nullable distribution key.
      • The second is a new one with the same schema except choosing another column for the distribution key based on the rules above.
    • Change your ETL (ELT)
      • All rows with a value for the distribution key go into the original table. This table now is now evenly distributed.
      • All rows with a NULL value for this column go into the second table with the new distribution key. This table will be evenly distributed.
    • Create a View over both the tables with the original table name.
      • This will minimize data movement when you join this table to the other distributed tables based on the original distribution key.

Let’s use the venerable AdventureWorksDW database to illustrate this. But let’s assume that instead of selling just bikes, they sold all sorts of products: most physical and some virtual like games and music downloads. What would our FactInternetSales table look like? There are lots of keys

Imagine a scenario where for whatever reason we really wanted SalesTerritoryKey to be our distributed key. Most likely because we are using it in many other tables in the database because it satisifies all the criteria above and provides us a usefull distribution key except in this one table. In FactInternetSales, we don’t know the SalesTerritoryKey for digital downloads, so that value is null for 20% of our rows in this table. This makes SalesTerritoryKey a poor choice for a distribution key in this table. But we really don’t want to change it because it works so well in all the other tables in our data warehouse. This is where you can use the solution above to create two new tables with two different distribution keys and place the view above it.

Original Table

New View

 

 

 

 

 

Do I suggest using this solution on a regular basis? No. I would always try to refactor the table to use an appropriate distribution key that handles the situation without breaking it up into two tables. There is usually a good solution that doesn’t require this kind of major architectural change to your database. But every once in a while you will find yourself stuck and this can be an effective solution.

No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS