Thursday, July 23, 2015

Hacking your Data Warehouse with Notebooks

Notebooks are great for doing exploratory analytics iteratively over large and complex data sets. The emphasis here is on iterative exploration  - letting you construct and run queries interactively over and over, building on previous queries and their results, till the exploration yields the insights you are after.

If you are using a next gen big data platform like Apache Spark you know the value of using notebook style analytics. But what if your data is in a data warehouse like dashDB? If you are still using traditional SQL query tools, good news is that you now can easily leverage IPython or Jupyter notebooks to hack your warehouse data interactively.

Let me first briefly describe the tools we'll be using for our demonstration and how to access them.

dashDB - is a next gen column-store based relational data warehousing as a service platform available on Bluemix. For those getting started with dashDB, it comes with an Entry plan that includes a perpetually free tier for up to 5GB of pre-load data - perfect for our little experiment. For serious users there are several Enterprise plans that let you manage and analyze multi-terabytes of data. There is now even an MPP version of dashDB which lets you scale your data over a cluster of several bare metal nodes.

To get dashDB, simply head over to bluemix.net and if you aren't already a user it takes just a minute to setup your free trial account.Once signed up go into the catalog, scroll down to Data and Anlytics category, click on dashDB and then Create an instance of the service using the Entry plan:



Data Scientist Workbench - provides a browser-based notebook as a service environment and 
supports R, Python, and Scala. It lets you import files, load data, explore data, develop analytics scripts and models, and plot & visualize results by leveraging a variety of pre-installed libraries and allowing you to easily import new ones as needed. Moreover, Data Scientist Workbench is a collaborative environment enabling you to easily share your notebooks and build on an abundant wealth of notebooks and analytics models built by others. And as a bonus, you also get a local Spark environment to help you get started with developing big data applications without having to deploy an expensive Spark cluster.

The Workbench is currently available as tech preview and you can sign-up for a free account at DataScientistWorkbench.com by clicking the Get started now button:



Once you've signed up and logged into Data Scientist Workbench click-on the My Notebooks tab on the top nav: 

There are several tutorials to get you up to speed quickly on Data Scientist Workbench and it is highly recommend that you follow through a first few, especially if you haven't had much experience with notebooks previously.


Lets now get to accessing your dashDB warehouse data from notebooks in Data Scientist Workbench. To make it easy, my colleague @bsteinfe has created a sample notebook that downloads and installs the Python driver for dashDB within Data Scientist Workbench, imports the relevant libraries and then runs a test query.

So the first thing is to import the sample notebook. In the right corner of the top nav bar of the Workbench is a search/import box. Simply copy and paste this URL in the box and press enter: https://share.knowledgeanyhow.org/#/api/v1/workbench/10.114.214.50/shares/oxF0Doaejcz7slI/Setup%20ibm-db%20python%20driver.ipynb


Once the driver notebook is imported, run  (or Ctrl + Enter) the first few cells of the notebook to download, extract and install the driver. You will see the progress and if any unexpected errors occur. Warnings can be ignored. Note that these steps of downloading and installing the driver need only be performed once on your Workbench and you do not need to do it in every notebook. In fact, going forward, this step will not be needed at all as the Wokbench will be updated to have the driver pre-installed.

Before you get to the next cell which imports the driver in the notebook:


if it is the first time you are doing the import after installing the driver, it is advisable to first restart the kernel (otherwise you might see an error). There is a Restart option under Kernel menu item in the notebook menu bar:
For the next step you will need credentials of your dashDB instance. Head over to Bluemix, and in the dashboard click on your instance of dashDB and then Launch the dashDB console:

In the dashDB console go into the Connect menu and select Connection Information


From here you will need to copy the  hostname, userid and password:


and paste them into the relevant cell in the notebook:


That's it. If it all went smoothly, executing the cell of the notebook will connect to dashDB and retrieve contents from a sample table into a data frame:


You are now ready to play with other data in your warehouse and manipulate it in Python notebooks.

In future posts we will look at some fun hacks of your warehouse data.

The forecast for dataville is partly cloudy.

The forecast for dataville is partly cloudy.