Thursday, July 23, 2015

Hacking your Data Warehouse with Notebooks

Notebooks are great for doing exploratory analytics iteratively over large and complex data sets. The emphasis here is on iterative exploration  - letting you construct and run queries interactively over and over, building on previous queries and their results, till the exploration yields the insights you are after.

If you are using a next gen big data platform like Apache Spark you know the value of using notebook style analytics. But what if your data is in a data warehouse like dashDB? If you are still using traditional SQL query tools, good news is that you now can easily leverage IPython or Jupyter notebooks to hack your warehouse data interactively.

Let me first briefly describe the tools we'll be using for our demonstration and how to access them.

dashDB - is a next gen column-store based relational data warehousing as a service platform available on Bluemix. For those getting started with dashDB, it comes with an Entry plan that includes a perpetually free tier for up to 5GB of pre-load data - perfect for our little experiment. For serious users there are several Enterprise plans that let you manage and analyze multi-terabytes of data. There is now even an MPP version of dashDB which lets you scale your data over a cluster of several bare metal nodes.

To get dashDB, simply head over to and if you aren't already a user it takes just a minute to setup your free trial account.Once signed up go into the catalog, scroll down to Data and Anlytics category, click on dashDB and then Create an instance of the service using the Entry plan:

Data Scientist Workbench - provides a browser-based notebook as a service environment and 
supports R, Python, and Scala. It lets you import files, load data, explore data, develop analytics scripts and models, and plot & visualize results by leveraging a variety of pre-installed libraries and allowing you to easily import new ones as needed. Moreover, Data Scientist Workbench is a collaborative environment enabling you to easily share your notebooks and build on an abundant wealth of notebooks and analytics models built by others. And as a bonus, you also get a local Spark environment to help you get started with developing big data applications without having to deploy an expensive Spark cluster.

The Workbench is currently available as tech preview and you can sign-up for a free account at by clicking the Get started now button:

Once you've signed up and logged into Data Scientist Workbench click-on the My Notebooks tab on the top nav: 

There are several tutorials to get you up to speed quickly on Data Scientist Workbench and it is highly recommend that you follow through a first few, especially if you haven't had much experience with notebooks previously.

Lets now get to accessing your dashDB warehouse data from notebooks in Data Scientist Workbench. To make it easy, my colleague @bsteinfe has created a sample notebook that downloads and installs the Python driver for dashDB within Data Scientist Workbench, imports the relevant libraries and then runs a test query.

So the first thing is to import the sample notebook. In the right corner of the top nav bar of the Workbench is a search/import box. Simply copy and paste this URL in the box and press enter:

Once the driver notebook is imported, run  (or Ctrl + Enter) the first few cells of the notebook to download, extract and install the driver. You will see the progress and if any unexpected errors occur. Warnings can be ignored. Note that these steps of downloading and installing the driver need only be performed once on your Workbench and you do not need to do it in every notebook. In fact, going forward, this step will not be needed at all as the Wokbench will be updated to have the driver pre-installed.

Before you get to the next cell which imports the driver in the notebook:

if it is the first time you are doing the import after installing the driver, it is advisable to first restart the kernel (otherwise you might see an error). There is a Restart option under Kernel menu item in the notebook menu bar:
For the next step you will need credentials of your dashDB instance. Head over to Bluemix, and in the dashboard click on your instance of dashDB and then Launch the dashDB console:

In the dashDB console go into the Connect menu and select Connection Information

From here you will need to copy the  hostname, userid and password:

and paste them into the relevant cell in the notebook:

That's it. If it all went smoothly, executing the cell of the notebook will connect to dashDB and retrieve contents from a sample table into a data frame:

You are now ready to play with other data in your warehouse and manipulate it in Python notebooks.

In future posts we will look at some fun hacks of your warehouse data.

Friday, May 22, 2015

Learn Spark or Die! Apache Spark Fundamentals.

What is Spark?
I would simply say - Spark is a big data analytics platform designed for speed. Here are a few other definitions for colour:
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics.
 Apache Spark is a highly versatile open-source cluster computing framework with fast in-memory analytics performance. -

How hot is Spark? 
Spark was pretty high on the hot trending list even a year ago and you may have seen stats about code commit activity, amount of twitter chatter, posts on stackoverflow, etc., but the way I gauge a technology is red hot (and not just hyped) is when the skills for it are high in demand and job postings for it are offering good money. But don't take my word for it ... take O'reilly's. As per a recent survey (Nov 2014) by O'Reilly among data scientists, Spark users earn the highest median salary:


So why is Spark is so hot? 
For one, its easy to use and quite versatile - letting you run a variety of workloads under a single hood. You can leverage existing skills (e.g. python, java, scala, R, SQL) and works with diverse data sets (web, social, text, graph, nosql, IoT, etc.) saved previously in repositories or streamed in real-time. And good news for Hadoop users is that Spark runs on existing HDFS infrastructure.  

But one of the biggest claims to fame for Spark is the ability to run analytic applications up to 100 times faster compared to Hadoop MapReduce, thanks to in-memory processing. Hadoop has been the technology of choice for big data processing for almost a decade. While Hadoop is good for batch oriented MapReduce type jobs, Spark works well with iterative algorithms that build upon each other and is good at interactive queries and can even do real-time analysis.

Like MapReduce, Spark provides parallel distributed processing, fault tolerance on commodity hardware, scalability, etc. Spark adds to the concept with aggressively cached in-memory distributed computing, low latency, high level APIs and stack of high level tools . This saves time and money.

Apache Spark 101
Whether you are a developer, data architect, or a data scientist, it is a good idea to add Spark to your toolbox. My favourite place to learn about big data technologies is Big Data University. Like any other free courses on this site, the course on Spark Fundamentals is also free to take and to complete at your own leisure. It comes with lab exercises and Docker image for Spark that you can download to your Mac/Windows/Linux desktop/laptop and use it for applying theory to practice. There is a test at the end of the course and you get a certificate if you pass it, and you can also get a badge that you can display on your social profile.

 Learn Spark Now!

Learn Spark now or die (being trampled by the elephant in the room ;-)

Friday, April 12, 2013

Future of Big Data in 3 Prezis

Once every few years comes along a revolutionary piece of technology that causes massive disruptions in the existing landscape because it caters to an unfulfilled need in a remarkable way at the right time and carves out a new market in the process. In my “relatively” short IT career I have been fortunate enough to be involved up-close with several of these game changing technologies including – MPP databases, Linux,  cloud computing, Hadoop, and something I am really excited about these days – Big SQL!

If you are already familiar with Big SQL – that’s great … you can skip over to further in this post. If not, all I’m going to say is that Big SQL is a natural extension to Hadoop and big data analytics, which Leon Katsnelson explains very succinctly in this first prezi:

(Click on Start Prezi, wait for it to load and use arrow buttons below the prezi to navigate forward ... better viewed fullscreen - use button on bottom right)

If your interest is peaked, you might as well get to know a bit more about Big SQL by going through this second prezi titled Big SQL Overview:

And finally this third prezi will tell you about how to go about getting some hands-on experience with  Big SQL:

Useful links:
Take the free online course on SQL for Hadoop

Wednesday, December 7, 2011

small data, BIG data on clouds

I have been talking about databases on the cloud for quite some time now. In earlier blog posts I've mentioned how cloud-o-nomics can accelerate testing and development of database solutions, get databases up-and-running quickly on public clouds, as well as using private clouds for deploying enterprise class database workloads in-house.

In a recent Chat with the Lab webinar experts in data management and cloud technologies - Leon Katsnelson (@katsnelson) from IBM and Uri Budnik (@uribudnik) from RightScale - reviewed several options for running databases (specifically DB2) in both public and private clouds, and even mentioned no cost options for doing so. They also presented an option for test-driving next generation database technology in the cloud with free credits thrown in by Amazon.

But the thrust of this webinar was around "Big Data". Its not surprising why analysts like IDC rate cloud computing and big data amongst the hottest technologies for 2012. Sure they are hot on their own but combine the two and even more magic happens. When IBM talks about big data you hear about the 3Vs that characterize this space - Volume, Variety, and Velocity (learn more about these when you watch the webcast recording below). The presenters in this webcast went further and talked about the 4th V, i.e. Value that is unleashed for big data by utilizing cloud economics.

The logic behind it is actually quite simple. We know big data sets involve very large volumes of data that require dozens, sometimes hundreds or even thousands of servers to process these data sets in parallel using paradigms like MapReduce for deriving insights. Traditional data center computing would require quite intensive capital investment to purchase and setup a cluster for processing big data, that may be hard to justify if the hardware utilization is low e.g. running a Hadoop job for only a few hours a day. Instead if you use cloud computing you can start up 100 servers costing 30 cents each per hour so an hour long big data job would only cost about $30, and after an hour you could shutdown the servers without incurring further costs.

The speakers also talked about BigInsights (IBM's hadoop powered solution for big data) and setting up a Hadoop cluster in minutes on IBM or Amazon cloud using pre-built, pre-configured BigInsights cloud images and server templates. And if you are interested in big data but don't have the skills or want learn how to quickly run hadoop clusters on the cloud, you can take free courses online on Big Data University.

One question that came up during this webcast was around how do you get your big data sets loaded into the cloud. So when you watch the recording be sure to listen to the Questions and Answers part towards the end.

Below is the recording of this webinar titled Leveraging Clouds for Small and Big Data...

If you want to be informed about new big data webinars from IBM, do sign up for the Big Data Insights Newsletter.

Monday, June 20, 2011

Quickly deploying Private Clouds for Database workloads

While the IT industry at large is flocking towards cloud computing (as is evident from triple digit growth in cloud usage) it is a rarity (at least so far) to see Enterprise IT departments touting cloud usage for workloads such as databases. And there are good reasons for it.

The situation is not much different from early 2000's when I was championing use of Linux for running database servers. While a large number of people I talked to understood the benefits of Linux, and many were starting to use Linux for file and print servers, they were invariably hesitant to use Linux for database workloads.

After all database systems are a backbone of virtually every enterprise application and hold sensitive data that needs to be secured, protected, backed-up reliably, and be available for recovery in the event of a disaster. Almost no enterprise IT decision maker wants to put their job on the line and start deploying leading edge technologies that have not yet been proven for enterprise-class workloads.

This attitude is quite prevalent when I talk to IT managers and DBAs at my company's clients, that tend to be some of the largest companies in banking, insurance, and retail sectors. Most of these folks get Cloud Computing, and understand its advantages but shy from being the trailblazers. So public domain clouds are typically not an option (other than for skills building and experimentation).

But what if the data was on the cloud and yet within the confines of your company's IT security parameter? I'm talking about Private Clouds residing in your company's datacenter, within your firewall where you can enforce your company's IT policies such as access and authentication protocols. The reaction I tend to get to this question is that building Clouds from scrtach is not an easy proposition. "We don't have the skills or the budget to hire cloud consulting services" or "do you know how hard it is to get any budget approved for new hardware these days?"

So what if you could cobble up a private cloud without any special skills and within couple of days or even hours? And what if it could be done by mostly repurposing your existing hardware and only adding one small server that does the cloud automation, virtualization, and provisiong magic right out of the box? By this time in the conversation I can see the interest building up, and thats no surprise because the recently launched Workload Deployer from IBM offers pretty compelling value pretty quickly. This is a follow-on offering to what was previously known as the WebSphere Cloudburst Appliance.

This 2U form-factor appliance comes pre-loaded with enterprise-class middleware packaged as virtual machine images and patterns that can be deployed onto virtualized servers in a private cloud with just a few mouse clicks. Examples of pre-loaded middleware included WebSphere Application Server and DB2 database server.

These pre-bundled DB2 images and patterns are well suited for rapidly provisioning standardized vitual database servers that can be used for development, test and variety of other web and other workloads. Databases can also be deployed in pairs in a high availability (HADR) configuration.

If you want further details about how to build and rapidly deploy databases in a private cloud, be sure to attend this free webinar on June 29.

Thursday, April 28, 2011

Cloud turns software release cycle on its head

I recall when the iPad first came out it was only available in the US and those of us closer to the North Pole (i.e. Canada) had to wait a few months to get one in the stores here. In fact it is not uncommon to have products launched for core markets first and then released for secondary markets.

So I was quite surprised and impressed when a colleague pinged me to say that he had just published DB2 Express-C 9.7.4 database in the Cloud. And that too available on all of Amazon's regions i.e. US, Europe and AP. This is probably the first time in the history of IBM that a product has been released on the Cloud first before being available through traditional channels.

At first I wondered how could that be ... the DB2 Express-C 9.7.4 install images aren't expected to be generally available for download for at least a few more days. So how could there be server templates for this newly updated version already available for running in the cloud? After all, Cloud is not where we get most of the usage of DB2 i.e. its not our primary market (at least not yet).

Well I suppose Cloud-o-nomics changes everything. Before new install images of a product are made available for download they typically go through several stages of testing and then some final sanity checking. In case of DB2 Express-C, which is the free community edition of the enterprise-class DB2 database from IBM, the final stages of testing involves installing it on several free/low-cost operating systems that would typically not get coverage during test cycles of the paid editions of DB2. In the normal test cycles for the paid editions of DB2 it is exercised on operating systems that IBM officially supports and are used in a business environment. However since DB2 Express-C is used by the community at large, and many members of this community prefer to use Windows Home Editions and free community versions of Linux distributions, IBM does try to test it on some of those systems.

If you're wondering what does the Cloud have to do with this? Well, in the case of DB2 Express-C it helped us cut down a significant amount of effort testing on operating systems that are not standard in an enterprise setting. While our test labs have lots of servers with Enterprise class operating systems and we can grab any of those to do testing, for the lower-cost/free OSes we need to spend time setting up systems/VMs with these OSes and their newer versions. The Cloud however saves us the effort of setting up such systems from scratch. We can just grab an existing OS image or server template on the cloud, launch it instantaneously, install our code, and make sure everything is okay - all in a matter of minutes. And it only costs us 10's of cents doing this as opposed to scrounging for free machines or putting in budget requests many months in advance to get new systems (which these days typically get rejected because these system would only be utilized for a very short duration).

Another reason we could get the DB2 Express-C 9.7.4 images out on the Cloud earlier than the regular images for download, was the use of RightScale cloud management platform and their concept of Server Templates. We already had older versions of DB2 Express-C for running on Amazon EC2 using RightScale Server Templates. DB2 was just an attachment to this template, and at launch time the server template would use an existing OS image to install and configure DB2. So for this new release, all we had to do was replace the attachment with newer version of DB2 code, and the template would then automatically install and configure this new DB2 release on the underlying operating system. So once this updated server template was launched, not only could we test the new version rather easily on this OS, but without any additional effort we also got a newer version of the server template that could be made public (with a click of just a few buttons) so others could also run this new DB2 Express-C code in the Cloud.

In conclusion, if you want to try out the latest release of DB2 Express-C, you can do so on the Amazon Cloud by running this DB2 Express-C 9.7.4 Database Server Template for Ubuntu 10.04** LTS, or you can wait a few days for the regular install images to become available for download. BTW, if you want to go the Cloud route you will need to have Amazon and (free) RightScale accounts and this video on from Bradley Steinfeld walks you through this simple process.

BTW, did I mention that there is no charge for using DB2 Express-C database or Ubuntu linux or RightScale (if you have a free account), but you would still pay for use of Amazon's cloud services, statring at 8.5cents an hour.

** Yes we would have liked to go out with Ubuntu 11.04 but the underlying OS image was not yet available on RightScale, so we fell back on any OS image that we had previously worked with.

Monday, March 21, 2011

Hypervisor Edition for Database on your Private Cloud

While Cloud Computing is all the rage these days, most enterprises are wary of putting their sensitive or confidential data on an internet accessible cloud. So how do you take advantage of Cloud Computing and reap its benefits while still keeping your valuable data protected in a secure vault-like environment that you've built over the years in your enterprise data-centers?

That's where Private Clouds come in, enabling your Enterprise IT department to take the efficiency of your in-house data-centers resources to new levels and delivers IT infrastructure (server, networking and storage) and possibly even enterprise middleware as on-demand services.

A key component of a private cloud is a workload deployer that can provision these resources as virtual machines on-the-fly with custom configurations as and when required by project leads in the various lines of business.

As far as private cloud workload deployers are concerned, IBM WebSphere Cloudburst Appliance (WCA) requires a special mention because it comes preloaded with variety of middleware for provisioning enterprise-class applications. The software comes bundled in what are called Hypervisor Editions, which are essentially virtual images based on VMware ESX (for Linux and Windows) or PowerVM (AIX). These images however can be launched with custom parameters and options and can be configured to connect with other middleware to create complete enterprise application environments with web, app, and database servers.

Preloaded middleware includes WebSphere Application Server and DB2 database server. DB2 is a highly available and secure database management system that leading enterprises all over the world are using for their most mission critical needs. And DB2 pureScale is a highly scalable and continuously available edition of DB2 that can cluster dozens of servers for extreme demand workloads.

I will likely do future posts to delve into greater details about various items introduced in this post, however in the interim you can learn more about Hypervisor Editions, and WebSphere Cloudburst Appliance, and watch this demo to see how you use WCA to do pattern based workload deployments to provision application and database servers in your private cloud:

The forecast for dataville is partly cloudy.

The forecast for dataville is partly cloudy.