Using Sqoop to load data into HDInsight
I’m going to have a series of posts that essentially answer questions to use cases. “How do I…” kind of posts. Running through my head, I want to know how to load data into HDInsight Service from SQL Server and how to load data into SQL Server from HDInsight Service. Should I use SSIS? Should I use Sqoop? Should I do something else? We’ll answer these one at a time. Today the use case is going to be that I have data in a SQL Server VM on Windows Azure and I want to load it into a Hive table on the HDInsight Service. There are a few configuration steps I need to take to set up connectivity between the two systems and then we’ll load data.
The data set I’m using is from Research and Innovative Technology Administration Bureau of Transpiration Statistics. Essentially a compilation of 20 years’ worth of airline arrival ontime statistics, a topic near and dear to many of us consultants.
To get started setting up a VM in Windows Azure, check out this post: http://www.windowsazure.com/en-us/manage/windows/common-tasks/install-sql-server/. This provides great guidance around opening up the proper firewall rules and creating a sql user that you will need to use later when accessing the data remotely. If you can connect SQL Server Management Studio from your laptop to the VM in the cloud and browse the SQL Server objects, then sqoop will be able to access it also.
On the SQL VM, I created a database called AirStats and then a table called ontime (below) and loaded two years’ worth of data (1.3 GB of data). I’ll do a larger test later, but just wanted to get something working for this test.
create table ontime ( Year int, Month int, DayofMonth int, DayOfWeek int, DepTime int, CRSDepTime int, ArrTime int, CRSArrTime int, UniqueCarrier varchar(5), FlightNum int, TailNum varchar(8), ActualElapsedTime int, CRSElapsedTime int, AirTime int, ArrDelay int, DepDelay int, Origin varchar(3), Dest varchar(3), Distance int, TaxiIn int, TaxiOut int, Cancelled int, CancellationCode varchar(1), Diverted varchar(1), CarrierDelay int, WeatherDelay int, NASDelay int, SecurityDelay int, LateAircraftDelay int );
Next it’s time to run sqoop from the head node on the HDInsight cluster. Log into the remote desktop console of HDInsight (Click on the big square that says Remote Desktop) and launch the Hadoop Command Line prompt from the desktop. A couple notes about the code below….I’ve starred out the username and password for obvious security reasons, you would actually include your username and password in the command. –table is the table that will be imported. –target-dir is the HDFS destination directory. The –m switch is the number of map tasks to import in parallel.
c:\apps\dist\sqoop-1.4.2\bin>sqoop-import.cmd --connect "jdbc:sqlserver://brianwmitchell1.cloudapp.net;username=*****;password=*****;database=AirStats" –table ontime --target-dir asv://brianwmitchell/ontime -m 1
It took 4 minutes and 11 seconds to move the data over. We’ll worry about performance later by playing with the parallelism switch, adding a clustered index to the source table, and whatever else I can think of. Once the data is safely on azure storage, you can then create a Hive table over it. Here I create an external Hive table over the data:
create external table ontime (
) Row format delimited
Fields terminated by ','
Stored as textfile
Now let’s check on the data. Running a quick query from the Hive interactive console shows that I now have data.
Now I’ve got to figure out something to do with it. That’s next.