Hadoop

1 minute de lecture

Mis à jour :

When data gets too large to be dealt with in memory (most computers have up to 32 GB in RAM usually), it is possible to use a distributed system.

A distributed system send the data and the computations to multiple machines / computers.

Specifically:

  • Local process will use the computation resources of a single machine
  • Distributed process has access to the computational resources across a number of machines connected through a network

In practice, we use many lower CPU machines, than to try to scale up to a single machine with a high CPU. This provide tremendous scaling possibilities as you can always add more machines (except financially lol).

They also include fault tolerance, if one machine fails, the whole network can still go on.

Hadoop is a way to distribute very large files across multiple machines. It uses the Hadoop Distributed File System (HDFS) HDFS allows a user to work with large data sets HDFS also duplicates blocks of data for fault tolerance It also then uses MapReduce MapReduce allows computations on that data

HDFS

Let’s discuss the typical format of a distributed architecture that uses Hadoop

HDFS will use blocks of data, with a size of 128 MB by default Each of these blocks is replicated 3 times The blocks are distributed in a way to support fault tolerance

Smaller blocks provide more parallelization during processing Multiple copies of a block prevent loss of data due to a failure of a node

MapReduce is a way of splitting a computation task to a distributed set of files (such as HDFS) It consists of a Job Tracker and multiple Task Trackers

The Job Tracker sends code to run on the Task Trackers The Task trackers allocate CPU and memory for the tasks and monitor the tasks on the worker nodes

What we covered can be thought of in two distinct parts: Using HDFS to distribute large data sets Using MapReduce to distribute a computational task to a distributed data set Next we will learn about the latest technology in this space known as Spark. Spark improves on the concepts of using distribution


Laisser un commentaire