Some may argue that there are only three certainties in life, death, taxes and Hadoop. In its report on Hadoop predictions for 2015, tech analyst firm Forrester, calls Hadoop a “rising star” in data analytics and claims “Hadooponomics,” the economics of storing and analyzing data in Hadoop, as the trigger that feeds a robust ecosystem of new tools and new distributions. In short, the report predicts that the Hadoop ecosystem will continue to grow and become many things to many new people, from analytics to an applications platform. Maybe they’re right, but that’s one story. There’s another.
This second narrative posits that in our digital, distributed and connected world that spews innovation and spits out old technologies, where a crowd-funded smart watch project can raise more than $18M in record time and a exasperating game about a flappy bird can go from 0-50M downloads in 28 days, open source projects form, stretch and sometimes rip apart faster than a new version of anything from Microsoft needing to be patched. It is about a technology hype machine with attention deficit disorder, where all eyes seem to shift in a blink from Hadoop MapReduce to newer open source projects like Spark, Kafka and Ceph.
This story, which is reaching some critical mass, posits a world after Hadoop, as we know it. It asks the solemn question, has Hadoop peaked in maturity, already? One thing seems certain that the batch processing MapReduce foundation that gave rise to today’s Hadoop ecosystem may be on its last legs. I was somewhat apprehensive about MapReduce back in 2008 – I was at eBay at the time. Back then I exchanged emails with industry watcher Curt Monash who wrote at the time, “eBay doesn’t love MapReduce.” At eBay, we thoroughly evaluated Hadoop and came to the clear conclusion that MapReduce is absolutely the wrong technology long-term for big data analytics. MapReduce solved problems that couldn’t be solved without a parallel system, but the problem was there were already more powerful and more mature parallel systems in market, like Teradata, for example. Now the whole industry agrees. MapReduce, for us at eBay, didn’t challenge status quo, and, at best, it was an incremental step in the right direction for open source technology.
What’s Spark? And, Why the Hype?
Since then, the open source community has matured and taken lots of newer, larger steps, with names like Spark and Ceph. The counter narrative to the general Hadoop hype, I mentioned, focuses on those two projects primarily.
First, Spark. The hype that Hadoop MapReduce created for years is now switching fast to Spark – it is everything that MapReduce was meant to be. Make no mistake about it; Spark was built as a competitor to the Hadoop tools ecosystem. Spark has been called “the next big thing” in big data and you can see the Hadoop vendors shifting their posture to address the new kid in town.
For sure, it is immature like MapReduce and Hadoop was a few years ago. And, as such, and smartly so, Databricks, the creators of Spark, have ducked around the question of whether Spark will replace Hadoop. They say that ‘we’re not going to replace Hadoop’ and ‘we’re going to run in Hadoop.’ But, guess what, you don’t even need Hadoop to run Spark. In fact, some are running it in an Open Stack cluster, rather than in Hadoop – the commercial product from Databricks runs on Amazon’s AWS. And, the same is true for other open source projects, like the distributed messaging system Kafka. Can you run Kafka in a Hadoop cluster? Sure. But, look at how many people run it. They run it in something like Open Stack.
Ceph and Red Hat’s Data Management Ambitions
Which brings me to the second project that could bring about an entirely new ecosystem of big data tools and data management options. That’s Ceph. Ceph is an open source distributed storage system that also includes its own high-performance POSIX-compatible (better compatibility with Linux and other operating systems) file system. That means it can ingest, update and delete on system.
In April 2014, Red Hat, a standard of success in commercializing open source technologies, bought Inktank, the creators and providers of Ceph. Red Hat is a dominant force in the open source market, and has a whole other level of experience and resources needed to popularize a project like Ceph. And, all signs point to Red Hat not stopping at big data with Ceph. They could very well start using Ceph as perhaps the standard file system of Red Hat Linux.
Read the tealeaves around the momentum of Spark and Ceph, open source tools that are clear alternatives to Hadoop MapReduce and HDFS respectively, and you start to understand this counter narrative. For all the hype, Hadoop market growth has been relatively stagnant. InformationWeek recently wrote a story on just this topic, pointing out the fact that machine learning and IT operations intelligence vendor Splunk leads a smallish big data (read Hadoop) market.
The question isn’t whether some critical Hadoop components are being replaced by new open source technologies. It is, and that’s a fact. The question is which path will Hadoop ultimately take?
Will the Hadoop ecosystem, as Forrester and others suggest, grow to encompass the newer open source technologies? Or, will technologies like Spark, Ceph, Kafka and others evolve into something entirely new? All of a sudden, when you pair Ceph with Open Stack, which in itself is fast growing, and takes analytics to the cloud, is this where the world of big data is heading? Is that where we will see the industry in the next 5-10 years?
There are certainly a lot of “what ifs” at this point and I don’t purport to absolutely know the answers to these questions – time will answer them with certainty. But I do know it is important to seriously consider this counter narrative to the deafening Hadoop MapReduce hype. Or, else, some might find themselves stuck inside a withering ecosystem.
(Author):
Oliver Ratzesberger
Mr. Ratzesberger has a proven track record in executive management, as well as 20+ years of experience in analytics, large data processing and software engineering.
Oliver’s journey started with Teradata as a customer, driving innovation on its scalable technology base. His vision of how the technology could be applied to solve complex business problems led to him joining the company. At Teradata, he has been the architect of the strategy and roadmap, aimed at transformation. Under Oliver’s leadership, the company has challenged itself to become a cloud enabled, subscription business with a new flagship product. Teradata’s integrated analytical platform is the fastest growing product in its history, achieving record adoption.
During Oliver’s tenure at Teradata he has held the roles of Chief Operating Officer and Chief Product Officer, overseeing various business units, including go-to-market, product, services and marketing. Prior to Teradata, Oliver worked for both Fortune 500 and early-stage companies, holding positions of increasing responsibility in technology and software development, including leading the expansion of analytics during the early days of eBay.
A pragmatic visionary, Oliver frequently speaks and writes about leveraging data and analytics to improve business outcomes. His book with co-author Professor Mohanbir Sawhney, “The Sentient Enterprise: The Evolution of Decision Making,” was published in 2017 and was named to the Wall Street Journal Best Seller List. Oliver’s vision of the Sentient Enterprise is recognized by customers, analysts and partners as a leading model for bringing agility and analytic power to enterprises operating in a digital world.
Oliver is a graduate of Harvard Business School’s Advanced Management Program and earned his engineering degree in Electronics and Telecommunications from HTL Steyr in Austria.
He lives in San Diego with his wife and two daughters.
View all posts by Oliver Ratzesberger