Wednesday, December 26, 2012

Episode 23: Hadoop


Download

News
Tool of the Show
Book of the Show


Hadoop

History
  • Jeff Dean & Sanjay Ghemawat wrote the paper MapReduce
  • Created by Doug Cutting while he was at Yahoo!.
  • Intended to support Lucene (search engine reverse indexing).
  • Facebook announces their hadoop filesystem has grown to 100 petabytes. 
Features
  • HDFS: Hadoop Distributed Filesystem
  • HBase: A distributed, column-oriented database
  • Zookeeper: Distributed coordination service
  • Crunch: Simplified API for creating mapreduce pipelines.

        Strengths
        • Scale-free
        • Fault Tolerant
        • Can add/remove hardware in real-time.
        Weaknesses
        • Long spin up / spin down time.
          • Worker Pools
        • Excessive Serialization/deserialization
        • Excessive Materialization

        Tools
        • Avro: A serialization framework
        • Pig & Hive: querying and storing large datasets

        Uses
        • Storing/Manipulating Big Data.