COSC 416 - Special Topics in Databases
Assignment 2 - Hadoop, HDFS, and MapReduce

In this assignment we will experiment with Hadoop by creating Hadoop MapReduce programs.

Tutorial

To use Hadoop, login using SSH to gpu1.ddl.ok.ubc.ca with your Novell account. Some commands:

Command	Purpose
hadoop	Verify you can run Hadoop. Get welcome usage message.
hadoop fs -put local target	Copies the directory called `local` into the DFS with directory name `target`.
hadoop fs -rmr mydir	Recursively deletes a directory called `mydir` from DFS.
hadoop jar /usr/share/hadoop/hadoop-examples-1.0.4.jar grep /user/hduser/data output 'c[a-z.]+'	Run grep example.
hadoop jar /usr/share/hadoop/hadoop-examples-1.0.4.jar wordcount /user/hduser/data output	Run WordCount example.
hadoop fs -get output output	Output files from DFS to local file system.
cat output/* \| more	Display files in a directory.

The Apache Hadoop 1.0.4 file system shell guide lists more commands.

The NameNode is gpu1.ddl.ok.ubc.ca. The HDFS cluster status and file system can be browsed at: http://gpu1.ddl.ok.ubc.ca:50070/.

The job tracker URL is http://gpu1.ddl.ok.ubc.ca:50030/jobtracker.jsp.

Steps to compile and run the test WordCount program: (terminal Window)

Download the source code at WordCount.java.
Put the code in a directory like lab2 on gpu1.
Compile the code with the command: javac -classpath /usr/share/hadoop/hadoop-core-1.0.4.jar WordCount.java
Create a JAR file packaging your program: jar -cvf wordcount.jar *
Submit the JAR file for execution: hadoop jar wordcount.jar WordCount /user/hduser/data /user/rlawrenc/outputwc

Steps to compile and run the test WordCount program from Eclipse (client-side submission):

Download the Eclipse project 416Hadoop.zip.
Create a new Java project in Eclipse.
Copy all files in the zip into the new Java project.
Adjust the build path appropriately.
Run the program.

Task #1 (5 marks) - List all Games

Write a Hadoop MapReduce program that will list all the game records. The data set is available at /user/rlawrenc/416/lab2/small/games.txt. There should be 100 records printed.

Task #2 (5 marks) - Find a Game by its Id

Write a Hadoop MapReduce program that will take a game id as a run-time parameter and return the game record if found. The data set is available at /user/rlawrenc/416/lab2/small/games.txt.

Task #3 (10 marks) - Players over 18

Write a Hadoop MapReduce program that lists only the players over 18. The output should be sorted by age ascending. The data set is available at /user/rlawrenc/416/lab2/small/players.txt.

Task #4 (10 marks) - Number of Players per Game

Write a Hadoop MapReduce program that will calculate the number of players per game. The output does not have to be sorted. The data set is available at /user/rlawrenc/416/lab2/small/player_games.txt.

Task #5 (10 marks) - Given a Game Id - List the Top 10 Scores for the Game

Write a Hadoop MapReduce program that will output the top 10 scores in descending order for a given game id. The data set is available at /user/rlawrenc/416/lab2/small/player_games.txt.

Task #6 (10 marks) - Players in Common

Write a Hadoop MapReduce program that will output pairs of game ids and the number of players they have in common. For instance, if game X and game Y have 2,000 players in common (play both games), then output X, Y, 2000. The data does not have to be sorted. The data set is available at /user/rlawrenc/416/lab2/small/player_games.txt.

Submission

Submit all code and files using Connect. You can demonstrate your work at any time for feedback and marking.

Home

COSC 416 - Special Topics in Databases Assignment 2 - Hadoop, HDFS, and MapReduce