COSC 416 - Special Topics in Databases
Assignment 2 -  Hadoop, HDFS, and MapReduce

In this assignment we will experiment with Hadoop by creating Hadoop MapReduce programs.

Tutorial

To use Hadoop, login using SSH to gpu1.ddl.ok.ubc.ca with your Novell account. Some commands:

CommandPurpose
hadoop
Verify you can run Hadoop. Get welcome usage message.
hadoop fs -put local target
Copies the directory called local into the DFS with directory name target.
hadoop fs -rmr mydir
Recursively deletes a directory called mydir from DFS.
hadoop jar /usr/share/hadoop/hadoop-examples-1.0.4.jar grep /user/hduser/data output 'c[a-z.]+'
Run grep example.
hadoop jar /usr/share/hadoop/hadoop-examples-1.0.4.jar wordcount /user/hduser/data output
Run WordCount example.
hadoop fs -get output output
Output files from DFS to local file system.
cat output/* | more
Display files in a directory.

The Apache Hadoop 1.0.4 file system shell guide lists more commands.

The NameNode is gpu1.ddl.ok.ubc.ca. The HDFS cluster status and file system can be browsed at: http://gpu1.ddl.ok.ubc.ca:50070/.

The job tracker URL is http://gpu1.ddl.ok.ubc.ca:50030/jobtracker.jsp.

Steps to compile and run the test WordCount program: (terminal Window)

Steps to compile and run the test WordCount program from Eclipse (client-side submission):

Task #1 (5 marks) - List all Games

Write a Hadoop MapReduce program that will list all the game records. The data set is available at /user/rlawrenc/416/lab2/small/games.txt. There should be 100 records printed.

Task #2 (5 marks) - Find a Game by its Id

Write a Hadoop MapReduce program that will take a game id as a run-time parameter and return the game record if found. The data set is available at /user/rlawrenc/416/lab2/small/games.txt.

Task #3 (10 marks) - Players over 18

Write a Hadoop MapReduce program that lists only the players over 18. The output should be sorted by age ascending. The data set is available at /user/rlawrenc/416/lab2/small/players.txt.

Task #4 (10 marks) - Number of Players per Game

Write a Hadoop MapReduce program that will calculate the number of players per game. The output does not have to be sorted. The data set is available at /user/rlawrenc/416/lab2/small/player_games.txt.

Task #5 (10 marks) - Given a Game Id - List the Top 10 Scores for the Game

Write a Hadoop MapReduce program that will output the top 10 scores in descending order for a given game id. The data set is available at /user/rlawrenc/416/lab2/small/player_games.txt.

Task #6 (10 marks) - Players in Common

Write a Hadoop MapReduce program that will output pairs of game ids and the number of players they have in common. For instance, if game X and game Y have 2,000 players in common (play both games), then output X, Y, 2000. The data does not have to be sorted. The data set is available at /user/rlawrenc/416/lab2/small/player_games.txt.

Submission

Submit all code and files using Connect. You can demonstrate your work at any time for feedback and marking.


*Home