MapReduce

Hadoop MapReduce well explains the pain is writing too much code for simple MapReduce in Java. This organic blog explains how to use MRJob package in Python to write and execute Movie ratings.

Introduction
Prepare Python
Run on Hadoop
Investigate the Job

Introduction

MapReduce is natively Java-based. Hadoop streaming interface allows to written MapReduce in other languages such as C++ and Python.

This is the good explanation found in section 4.3 “Understand how MapReduce works”¹:

MapReduce explained

NOTE: HDFS examples are given in the HDFS Basics.

Prepare Python

I am using Hortonworks 2.6.5 for the following RatingsBreakdown.py² example. First download:

wget http://media.sundog-soft.com/hadoop/RatingsBreakdown.py

Here the python source:

from mrjob.job import MRJob
from mrjob.step import MRStep

class RatingsBreakdown(MRJob):
    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_ratings,
                   reducer=self.reducer_count_ratings)
        ]

    def mapper_get_ratings(self, _, line):
        (userID, movieID, rating, timestamp) = line.split('\t')
        yield rating, 1

    def reducer_count_ratings(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    RatingsBreakdown.run()

and

wget http://media.sundog-soft.com/hadoop/ml-100k/u.data

Install pip:

sudo yum install python-pip

Now upgrade the pip version:

sudo pip install --upgrade pip

Install the MRJob:

pip install mrjob==0.5.11

When the installation is finished, run the MapReduce job as follows.

Run on Hadoop

In the Hortonworks 2.6.5, streaming jar is available at /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar.

First, copy the u.data to data directory as shown in the above screenshot. To run the MapReduce job:

python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar hdfs://172.18.0.2:8020/user/maria_dev/data/u.data

Investigate the Job

If you want to investigate the job ran above. Find the job as follows:

mapred job -list all

From the above command, you will get the job number. Now find the URL for the job executing the following command:

mapred job -status job_<number>

REF

Hadoop and Spark Fundamentals, Douglas Eadline, Addison-Wesley Professional 2018 ↩
The Ultimate Hands-on Hadoop, Frank Kane, Packt Publishing 2017 ↩