Wednesday, December 21, 2011

MapReduce Performance Tuning

Administration Lab 4: MapReduce Performance Tuning
Restore the Last State of VM
  1. Open Virtual Box application
  1. Start the last VM state
  1. Bounce Hadoopcluster
for x in /etc/init.d/hadoop-* ; do sudo $x stop; done
for x in /etc/init.d/hadoop-* ; do sudo $x start; done
  1. 2 mappers, 2 reducers
  1. 1 CPU 2 Cores
In Windows: All Programs => Accessories => System Tools => System Information
Execute Waiting Job
hadoop-0.20 jar
/usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u0-*examples.jar sleep -m 4 –r 4 -mt 10000 -rt 10000
  1. -m number of mappers
  1. -r number of reducers
  1. -mt milliseconds to sleep at map step
  1. -rt milliseconds to sleep at reduce step
Map Scheduling
  1. Only two mappers will initialized in the same time
  1. Go to port 50030 and click on running job and pick map step
Reduce Scheduling
  1. Only two reducers will initialized in the same time
  1. Go to port 50030 and click on running job and pick reduce step
Overall Job Summary
  1. Take a note of your job running time
How can we improve that???
Reduce number of  mappers and reducers
hadoop-0.20 jar
/usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u0-*examples.jar sleep –m 2 –r 2 –mt 20000 –rt 20000
Increase number of mappers/reducers
  1. Go to /etc/hadoop-0.20/conf (please, use tab for auto completion )
  1. Open mapred-site.xml with sudo permissions
  1. Increase number of reducers and mappers to 4
Config File Example
<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>4</value>
 </property>
 <property>  
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>
 </property>
FAQ: Where are Defaults ?
  1. Defaults are located inside hadoop-core.jar
  1. Locate hadoop-core.jar
  1. Default: /usr/lib/hadoop-0.20/hadoop-core.jar
  1. Copy jar to home directory: cp hadoop-core.jar ~/
  1. Check content: jar tfhadoop-core.jar | grep default
  1. Extract content: jar xfv hadoop-core.jar
Default files are a good source of information
<property>
<name>hadoop.job.history.location</name>
<value></value>
  <description> If job tracker is static the history files are storedin this single well known place. If No value is set here, by default, it is in the local file system at ${hadoop.log.dir}/history.
</description>
</property>

<property>
<name>hadoop.job.history.user.location</name>
<value></value>
<description> User can specify a location to store the history files ofa particular job. If nothing is specified, the logs are stored inoutput directory. The files are stored in "_logs/history/" in the directory.User can stop logging by giving the value "none".
</description>
</property>
Bounce cluster and wait for safemode exit
for x in /etc/init.d/hadoop-0.20-*; do sudo $x stop; done;
  1. Shortcut: history | grep stop
  2. !<number of the command>
  1. The same command with start
hadoop-0.20 dfsadmin –safemode wait
Note: You might see an error while stopping cluster. This is related to a current open bug that should be fixed in next release of hadoop. It is safe to ignore it and can proceed.
Config File Example
<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>4</value>
 </property>
 <property>  
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>
 </property>
FAQ: Where are Defaults
  1. Defaults are located inside hadoop-core.jar
  1. Locate hadoop-core.jar
  1. Default: /usr/lib/hadoop-0.20/hadoop-core.jar
  1. Copy jar to home directory:
cp hadoop-core.jar ~/
  1. Check content:
jar tfhadoop-core.jar | grep default
  1. Extract content:
jar xfv hadoop-core.jar
Default files are a good source of information
<property>
<name>hadoop.job.history.location</name>
<value></value>
  <description> If job tracker is static the history files are storedin this single well known place. If No value is set here, by default, it is in the local file system at ${hadoop.log.dir}/history.
</description>
</property>

<property>
<name>hadoop.job.history.user.location</name>
<value></value>
<description> User can specify a location to store the history files ofa particular job. If nothing is specified, the logs are stored inoutput directory. The files are stored in "_logs/history/" in the directory.User can stop logging by giving the value "none".
</description>
</property>
Bounce cluster and wait for safemode exit
for x in /etc/init.d/hadoop-0.20-*; do sudo $x stop; done;
  1. Shortcut: history | grep stop
  2. !<number of the command>
  1. The same command with start
hadoop-0.20 dfsadmin –safemode wait
Note: You might see an error while stopping cluster. This is related to a current open bug that should be fixed in next release of hadoop. It is safe to ignore it and can proceed.
New Capacity
  1. You should see 4 mappers and 4 reducers
Let’s execute the same job
hadoop-0.20 jar
/usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u0-*examples.jar sleep -m 4 -r4 -mt 10000 -rt 10000
All four tasks are initializing in same time
Mappers
Reducers
Total Summary
All Summary
Request
Map
Request
Reduce
Avail
Map
Avail
Reduce
Average
Map
Average
Reduce
Total
Time
Case 1
4
4
2
2
33 sec
33 sec
100 sec
Case 2
2
2
2
2
26 sec
34 sec
69 sec
Case 3
4
4
4
4
123 sec
47 sec
172 sec
Case 4
1
1
4
4
42 sec
47 sec
94 sec
FAQ: How to kill a job?
  1. Retrive job id:
hadoop-0.20 job -list
  1. Kill the job:
hadoop-0.20 job –kill <job-id>
  1. Note: this is a hard kill, some additional clean up might be required 

No comments:

Post a Comment