Working of Hadoop ecosystem

Arbaj Khan
3 min readOct 6, 2020

--

Introduction:

Big data is the most discussed topic in the technology world today. The total potential of big data is much higher than one used today because of the amount of big data being generated everyday and traditional data management system. To solve the challenges related to big data management and processing Apache software foundation introduced Hadoop. Hadoop is an open-source framework to store and process big data in distributed environment. The Hadoop ecosystem contains large set of modules or tools for handling different tasks related to big data processing.

Understand the working of Hadoop and its ecosystem.

Hadoop Ecosystem :- It is a platform to solve different problems related to big data which includes various commercial and Apache projects. It works on four major stages of big data problems.

1) Data storage: Data is simply stored on the Hadoop cluster as raw files. As such, the core components of Hadoop itself have no special capabilities for cataloging, indexing, or querying structured data.

2) Data processing: Hadoop has become the de-facto platform for storing and processing large amounts of data and has found widespread applications. In the Hadoop ecosystem, you can store your data in one of the storage managers and then use a processing framework to process the stored data.

3) Data access: HDP 3.0 improves data access and includes a redesigned Apache Hive, new Apache Druid capabilities, Apache HBase, and Apache Phoenix, plus configuration and access to cloud data.

4) Data management: As a distributed file system, all data and the MapReduce system are housed on every machine in a Hadoop cluster, which creates redundancy and increased processing speed.

Following are the components that together form a Hadoop ecosystem and work on all the above stages of big data solutions.

· HDFS [Hadoop Distributed File System]:The Hadoop Distributed File System ( HDFS ) is a distributed file system designed to run on commodity hardware.

· YARN [Yet Another Resource Negotiator]: YARN helps to open up Hadoop by allowing to process and run data for batch processing, stream processing, interactive processing and graph processing which are stored in HDFS.

· MapReduce: Hadoop MapReduce is a software framework for distributed processing of large data sets on computing clusters.

· Spark [Data processing]: It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.

· PIG, HIVE: Hive Hadoop Component is used mainly by data analysts whereas Pig Hadoop Component is generally used by Researchers and Programmers.

· HBase: HBase is a distributed column-oriented database built on top of the Hadoop file system.

· Solar, Lucene: Searching and Indexing

· Zookeeper: Managing cluster

Hadoop Architecture

Hadoop Architecture

Hadoop has a Master-Slave Architecture for data storage and distributed data processing using MapReduce and HDFS methods.

NameNode: NameNode represented every files and directory which is used in the namespace

DataNode: DataNode helps you to manage the state of an HDFS node and allows you to interacts with the blocks

MasterNode: The master node allows you to conduct parallel processing of data using Hadoop MapReduce.

Slave node: The slave nodes are the additional machines in the Hadoop cluster which allows you to store data to conduct complex calculations

Conclusion:

. Hadoop Addresses the Big Data challenges, proving to be efficient framework of tool.

. We live in the information era where everything is connected and generates huge amount of data. Such data, if well analyzed, could aggregate value to society

--

--