Hadoop Cluster Optimizations

1. Number of mappers/ reducer tasks

- The number of mapper and reducer tasks in each node of a cluster directly depends on the total available memory and various other services running in the node (for e.g. HBase)

- By default the memory allocated to JVM is 512MB. Hence, each mapper and the reducer task will get the same memory.

- The JVM memory for a Hadoop job can be controlled through any of the three following options:

o Hadoop Cluster Configuration
It is accomplished by changing the mapred.map.child.opts, mapred.reduce.child.opts properties. The effect will be reflected in every job executed in the Cluster unless the properties are overridden by the job in its job configuration.

o Job Arguments
The most efficient way of allocating necessary memory for a mapper/reducer is by passing the required value as Hadoop job arguments. (The Driver class of each job should implement the ToolRunner.)
For example, if the job requires a map task to be allocated with 2GB of memory, following is the argument that needs to be passed.

§ -Dmapred.map.child.opts=-x2048m

o Set the configuration properties inside the driver class of the Job
If the amount of the memory required for map/reduce task is expected to be consistent and known beforehand, then setting them in job configuration object inside the driver class is the good approach.

- Blindly increasing the memory might cause adverse effect if the node running the task is unable to allocate the requested memory. This will cause ‘out of memory’ errors, in turn resulting in job failures.

- Always consider the nature of the job, size of data flowing into and the resources available in the cluster before deciding on the memory to be used.

2. Oversubscription

- Over subscription of resources is driven by the fact that the total available tasks in the cluster will not require the maximum allocated memory for each of it. This can be very helpful in the way that we get more number of tasks (with less resources than required). Works well when we have large number of tasks with less memory intensive activities.

- However, over subscription can cause memory issues too. If the memory required by the tasks running exceeds the memory that can be allocated by the system, then it will cause over subscription and job failures with error code 137.

- The workaround for the above issues is to control the memory requested by the tasks does not exceed the total available memory. This can be accomplished by any of the following:

o Reduce the number of inputs. (Execute multiple jobs by splitting the input if possible)

o Control the number of mapper/reducer tasks (This can be obtained by either reducing the amount of input or controlling the number of tasks spawn by task queues (cluster level configuration)).

3. Hadoop job arguments and code compatibility

- The most efficient way of setting job specific configuration is passing them via Hadoop arguments. This is done by ‘-D’ option.

- The important point to remember is that the job should complement by taking in the Hadoop arguments. It is a common mistake to create a new configuration object inside each job that will override the configuration passed through arguments.

- ToolRunner needs to be implemented for reading the Hadoop arguments.

4. Speculative execution

- If a particular drive is taking a long time to complete a task, Hadoop will create a duplicate task on another disk. Disks that finish the task first are retained and disks that do not finish first are killed. The disk will continue to be used for other tasks, as Hadoop does not have the ability to 'fail' a disk, it just keeps it from being used for a particular task.

- The speculative execution is a smart idea in the fact that if tasks are running in a degraded node, the time taken for completion might be longer and another parallel task running on the same data on different node might be able to succeed faster. The Hadoop framework is intelligent enough to take the faster running task and kill the slower tasks. This will help in improvement of overall job performance.

- However, on the other hand, speculative execution will use more cluster resources than actually required for completion of a job. This may in turn cause slower performance and memory issues. As a thump rule, it is good to have speculative execution ON for large number of small jobs and OFF for large jobs.

- Always make sure that the speculative execution is turned OFF for any HBase intensive jobs. If it is ON, it will burden the HBase with double the actual number of requests. This will create god chance of region servers exceeding the allocated memory (by default 4GB) and go into dead state.

- How to disable Speculative execution in Hadoop

- Speculative execution is enabled by default in Hadoop. You can disable speculative execution for the mappers and reducers in mapred-site.xml

<name>mapred.map.tasks.speculative.execution</name>

<value>false</value>

</property>

<name>mapred.reduce.tasks.speculative.execution</name>

<value>false</value>

</property>

and Set in job configuration

jobconf.set(“mapred.map.tasks.speculative.execution”,false);

jobconf.set(“mapred.reduce.tasks.speculative.execution”,false);

or passing the same through Hadoop job arguments

-Dmapred.map.tasks.speculative.execution=false

-Dmapred.reduce.tasks.speculative.execution=false

mapreduce.map.speculative and mapreduce.reduce.speculative are the newer version alternative for

mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution

5. Multiple Outputs, Task attempts and Speculative Execution

- The Multiple Outputs class simplifies writing output data to multiple outputs. It helps in writing to additional outputs other than the job default output and to write data to different files provided by user.

- When the job tracker is informed that a particular task fails, it is rescheduled to execute another attempt. By default 4 attempts will be made.

- In cases where we use multiple outputs to be pointed outside the default job output, only one task attempt will be made. All the subsequent attempts will fail with ‘File Exists’ error. This is because, the failed attempt does not automatically cleanup the files.

- A work around is to write the multiple outputs to separate directories inside the default job output and when the jobs completes successfully, move it to the required location.

- Always take care to turn OFF speculative execution when there is multiple outputs written to directories other than the default job output folder. Else, it will cause failed tasks due to ‘file exists’ and might eventually lead to job failures.