View Praveen Gandluri's profile on LinkedIn
Home Posts

Big Data on Small Machines!!


Hadoop runs on commodity hardware, that's the most common phrase in the Hadoop world. Commodity doesn't mean cheap or unreliable hardware, rather, it's not a specialized hardware with massive CPU or memory. Commodity hardware is affordable, typically can be built from components from various hardware providers. Checkout Cloudera's blog on selecting the right hardware for an idea of cluster setup. When we start exploring Hadoop, we just install a single node in Local or Pseudo-Distributed Mode in our laptops, or to make it much simpler just use a Cloudera VM or Hortonworks VM. Then we go on to have POCs with on-premise clusters or Cloud (probably AWS EMR).
I have been running MapReduce algorithms on my laptop in a Pseudo-Distributed Mode, and thought it's hightime to setup a cluster. So, to test, I bought three refurbished PCs from, set up a cluster with one master and two slave nodes. The use case is to run an analytic function(Java MapReduce) – to find the maximum close price for each stock from NYSE End-of-Day data for the past 7 years (since 2006 till 2013).

Cluster Hardware Details:

3 Refurbished PCs($89 each): HP Pentium 4 2GB RAM 40GB HDD Capacity Desktop PC 32-Bit DC7100

PersonalCluster config

Input Data: CSV files total of 1957 for all the days between 2006 and 2013 from The total rows are about 4.9 Million from all the files.

MapReduce: Simple method to find the maximum close price for each stock.

Conclusion: The total MapReduce processing took 50 minutes for about 5 million rows, which is not bad at all. Latest versions of Oracle or SQL Server can't even be installed on these pcs, not sure how MySQL will perform. So if you have few tables (files) with 10s of millions of rows (which is on lower end of data in a typical DW), and join them and run a Hive query it might take about 15 to 20 hours (guesstimate) on this cluster. This is awesome. A fully functional Data Warehouse for 270 Bucks.

Here is a detailed video with Demo:

Disclaimer: This is a personal blog. Any views or opinions represented in this blog are personal and belong solely to the author and do not necesserily represent the author's employer or the clients the author works for. All content provided on this blog is for informational purposes only. The author will not be liable for any errors or omissions in this information nor for the availability of this information. All trademarks, logos,icons and images cited herein are the property of their respective owners.