Friday, May 22, 2015

Learn Spark or Die! Apache Spark Fundamentals.

What is Spark?
I would simply say - Spark is a big data analytics platform designed for speed. Here are a few other definitions for colour:
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. -apache.org
Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. -infoq.com
 Apache Spark is a highly versatile open-source cluster computing framework with fast in-memory analytics performance. - ibm.com

How hot is Spark? 
Spark was pretty high on the hot trending list even a year ago and you may have seen stats about code commit activity, amount of twitter chatter, posts on stackoverflow, etc., but the way I gauge a technology is red hot (and not just hyped) is when the skills for it are high in demand and job postings for it are offering good money. But don't take my word for it ... take O'reilly's. As per a recent survey (Nov 2014) by O'Reilly among data scientists, Spark users earn the highest median salary:

source: http://www.oreilly.com/data/free/files/2014-data-science-salary-survey.pdf


So why is Spark is so hot? 
For one, its easy to use and quite versatile - letting you run a variety of workloads under a single hood. You can leverage existing skills (e.g. python, java, scala, R, SQL) and works with diverse data sets (web, social, text, graph, nosql, IoT, etc.) saved previously in repositories or streamed in real-time. And good news for Hadoop users is that Spark runs on existing HDFS infrastructure.  

But one of the biggest claims to fame for Spark is the ability to run analytic applications up to 100 times faster compared to Hadoop MapReduce, thanks to in-memory processing. Hadoop has been the technology of choice for big data processing for almost a decade. While Hadoop is good for batch oriented MapReduce type jobs, Spark works well with iterative algorithms that build upon each other and is good at interactive queries and can even do real-time analysis.

Like MapReduce, Spark provides parallel distributed processing, fault tolerance on commodity hardware, scalability, etc. Spark adds to the concept with aggressively cached in-memory distributed computing, low latency, high level APIs and stack of high level tools . This saves time and money.

Apache Spark 101
Whether you are a developer, data architect, or a data scientist, it is a good idea to add Spark to your toolbox. My favourite place to learn about big data technologies is Big Data University. Like any other free courses on this site, the course on Spark Fundamentals is also free to take and to complete at your own leisure. It comes with lab exercises and Docker image for Spark that you can download to your Mac/Windows/Linux desktop/laptop and use it for applying theory to practice. There is a test at the end of the course and you get a certificate if you pass it, and you can also get a badge that you can display on your social profile.

 Learn Spark Now!


Learn Spark now or die (being trampled by the elephant in the room ;-)

No comments:

Post a Comment

The forecast for dataville is partly cloudy.

The forecast for dataville is partly cloudy.