MapReduce
Hadoop MapReduce well explains the pain is writing too much code for simple MapReduce in Java. This organic blog explains how to use MRJob package in Python to write and execute Movie ratings.
Introduction
MapReduce is natively Java-based. Hadoop streaming interface allows to written MapReduce in other languages such as C++ and Python.
This is the good explanation found in section 4.3 “Understand how MapReduce works”1:
NOTE: HDFS examples are given in the HDFS Basics.
Prepare Python
I am using Hortonworks 2.6.5 for the following RatingsBreakdown.py2 example. First download:
wget http://media.sundog-soft.com/hadoop/RatingsBreakdown.py
Here the python source:
from mrjob.job import MRJob
from mrjob.step import MRStep
class RatingsBreakdown(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings)
]
def mapper_get_ratings(self, _, line):
(userID, movieID, rating, timestamp) = line.split('\t')
yield rating, 1
def reducer_count_ratings(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
RatingsBreakdown.run()
and
wget http://media.sundog-soft.com/hadoop/ml-100k/u.data
Install pip:
sudo yum install python-pip
Now upgrade the pip version:
sudo pip install --upgrade pip
Install the MRJob:
pip install mrjob==0.5.11
When the installation is finished, run the MapReduce job as follows.
Run on Hadoop
In the Hortonworks 2.6.5, streaming jar is available at /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar
.
First, copy the u.data to data directory as shown in the above screenshot. To run the MapReduce job:
python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar hdfs://172.18.0.2:8020/user/maria_dev/data/u.data
Investigate the Job
If you want to investigate the job ran above. Find the job as follows:
mapred job -list all
From the above command, you will get the job number. Now find the URL for the job executing the following command:
mapred job -status job_<number>
REF