Thursday, July 23, 2015

Hacking your Data Warehouse with Notebooks

Notebooks are great for doing exploratory analytics iteratively over large and complex data sets. The emphasis here is on iterative exploration  - letting you construct and run queries interactively over and over, building on previous queries and their results, till the exploration yields the insights you are after.

If you are using a next gen big data platform like Apache Spark you know the value of using notebook style analytics. But what if your data is in a data warehouse like dashDB? If you are still using traditional SQL query tools, good news is that you now can easily leverage IPython or Jupyter notebooks to hack your warehouse data interactively.

Let me first briefly describe the tools we'll be using for our demonstration and how to access them.

dashDB - is a next gen column-store based relational data warehousing as a service platform available on Bluemix. For those getting started with dashDB, it comes with an Entry plan that includes a perpetually free tier for up to 5GB of pre-load data - perfect for our little experiment. For serious users there are several Enterprise plans that let you manage and analyze multi-terabytes of data. There is now even an MPP version of dashDB which lets you scale your data over a cluster of several bare metal nodes.

To get dashDB, simply head over to and if you aren't already a user it takes just a minute to setup your free trial account.Once signed up go into the catalog, scroll down to Data and Anlytics category, click on dashDB and then Create an instance of the service using the Entry plan:

Data Scientist Workbench - provides a browser-based notebook as a service environment and 
supports R, Python, and Scala. It lets you import files, load data, explore data, develop analytics scripts and models, and plot & visualize results by leveraging a variety of pre-installed libraries and allowing you to easily import new ones as needed. Moreover, Data Scientist Workbench is a collaborative environment enabling you to easily share your notebooks and build on an abundant wealth of notebooks and analytics models built by others. And as a bonus, you also get a local Spark environment to help you get started with developing big data applications without having to deploy an expensive Spark cluster.

The Workbench is currently available as tech preview and you can sign-up for a free account at by clicking the Get started now button:

Once you've signed up and logged into Data Scientist Workbench click-on the My Notebooks tab on the top nav: 

There are several tutorials to get you up to speed quickly on Data Scientist Workbench and it is highly recommend that you follow through a first few, especially if you haven't had much experience with notebooks previously.

Lets now get to accessing your dashDB warehouse data from notebooks in Data Scientist Workbench. To make it easy, my colleague @bsteinfe has created a sample notebook that downloads and installs the Python driver for dashDB within Data Scientist Workbench, imports the relevant libraries and then runs a test query.

So the first thing is to import the sample notebook. In the right corner of the top nav bar of the Workbench is a search/import box. Simply copy and paste this URL in the box and press enter:

Once the driver notebook is imported, run  (or Ctrl + Enter) the first few cells of the notebook to download, extract and install the driver. You will see the progress and if any unexpected errors occur. Warnings can be ignored. Note that these steps of downloading and installing the driver need only be performed once on your Workbench and you do not need to do it in every notebook. In fact, going forward, this step will not be needed at all as the Wokbench will be updated to have the driver pre-installed.

Before you get to the next cell which imports the driver in the notebook:

if it is the first time you are doing the import after installing the driver, it is advisable to first restart the kernel (otherwise you might see an error). There is a Restart option under Kernel menu item in the notebook menu bar:
For the next step you will need credentials of your dashDB instance. Head over to Bluemix, and in the dashboard click on your instance of dashDB and then Launch the dashDB console:

In the dashDB console go into the Connect menu and select Connection Information

From here you will need to copy the  hostname, userid and password:

and paste them into the relevant cell in the notebook:

That's it. If it all went smoothly, executing the cell of the notebook will connect to dashDB and retrieve contents from a sample table into a data frame:

You are now ready to play with other data in your warehouse and manipulate it in Python notebooks.

In future posts we will look at some fun hacks of your warehouse data.

Friday, May 22, 2015

Learn Spark or Die! Apache Spark Fundamentals.

What is Spark?
I would simply say - Spark is a big data analytics platform designed for speed. Here are a few other definitions for colour:
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics.
 Apache Spark is a highly versatile open-source cluster computing framework with fast in-memory analytics performance. -

How hot is Spark? 
Spark was pretty high on the hot trending list even a year ago and you may have seen stats about code commit activity, amount of twitter chatter, posts on stackoverflow, etc., but the way I gauge a technology is red hot (and not just hyped) is when the skills for it are high in demand and job postings for it are offering good money. But don't take my word for it ... take O'reilly's. As per a recent survey (Nov 2014) by O'Reilly among data scientists, Spark users earn the highest median salary:


So why is Spark is so hot? 
For one, its easy to use and quite versatile - letting you run a variety of workloads under a single hood. You can leverage existing skills (e.g. python, java, scala, R, SQL) and works with diverse data sets (web, social, text, graph, nosql, IoT, etc.) saved previously in repositories or streamed in real-time. And good news for Hadoop users is that Spark runs on existing HDFS infrastructure.  

But one of the biggest claims to fame for Spark is the ability to run analytic applications up to 100 times faster compared to Hadoop MapReduce, thanks to in-memory processing. Hadoop has been the technology of choice for big data processing for almost a decade. While Hadoop is good for batch oriented MapReduce type jobs, Spark works well with iterative algorithms that build upon each other and is good at interactive queries and can even do real-time analysis.

Like MapReduce, Spark provides parallel distributed processing, fault tolerance on commodity hardware, scalability, etc. Spark adds to the concept with aggressively cached in-memory distributed computing, low latency, high level APIs and stack of high level tools . This saves time and money.

Apache Spark 101
Whether you are a developer, data architect, or a data scientist, it is a good idea to add Spark to your toolbox. My favourite place to learn about big data technologies is Big Data University. Like any other free courses on this site, the course on Spark Fundamentals is also free to take and to complete at your own leisure. It comes with lab exercises and Docker image for Spark that you can download to your Mac/Windows/Linux desktop/laptop and use it for applying theory to practice. There is a test at the end of the course and you get a certificate if you pass it, and you can also get a badge that you can display on your social profile.

 Learn Spark Now!

Learn Spark now or die (being trampled by the elephant in the room ;-)

The forecast for dataville is partly cloudy.

The forecast for dataville is partly cloudy.