In this assignment we will experiment with Hadoop by creating Hadoop MapReduce programs.
To use Hadoop, login using SSH to gpu1.ddl.ok.ubc.ca with your Novell account. Some commands:
Command | Purpose |
---|---|
hadoop | Verify you can run Hadoop. Get welcome usage message. |
hadoop fs -put local target | Copies the directory called local into the DFS with directory name target. |
hadoop fs -rmr mydir | Recursively deletes a directory called mydir from DFS. |
hadoop jar /usr/share/hadoop/hadoop-examples-1.0.4.jar grep /user/hduser/data output 'c[a-z.]+' | Run grep example. |
hadoop jar /usr/share/hadoop/hadoop-examples-1.0.4.jar wordcount /user/hduser/data output | Run WordCount example. |
hadoop fs -get output output | Output files from DFS to local file system. |
cat output/* | more | Display files in a directory. |
The Apache Hadoop 1.0.4 file system shell guide lists more commands.
The NameNode is gpu1.ddl.ok.ubc.ca. The HDFS cluster status and file system can be browsed at: http://gpu1.ddl.ok.ubc.ca:50070/.
The job tracker URL is http://gpu1.ddl.ok.ubc.ca:50030/jobtracker.jsp.
Steps to compile and run the test WordCount program: (terminal Window)
Steps to compile and run the test WordCount program from Eclipse (client-side submission):
Write a Hadoop MapReduce program that will list all the game records. The data set is available at /user/rlawrenc/416/lab2/small/games.txt. There should be 100 records printed.
Write a Hadoop MapReduce program that will take a game id as a run-time parameter and return the game record if found. The data set is available at /user/rlawrenc/416/lab2/small/games.txt.
Write a Hadoop MapReduce program that lists only the players over 18. The output should be sorted by age ascending. The data set is available at /user/rlawrenc/416/lab2/small/players.txt.
Write a Hadoop MapReduce program that will calculate the number of players per game. The output does not have to be sorted. The data set is available at /user/rlawrenc/416/lab2/small/player_games.txt.
Write a Hadoop MapReduce program that will output the top 10 scores in descending order for a given game id. The data set is available at /user/rlawrenc/416/lab2/small/player_games.txt.
Write a Hadoop MapReduce program that will output pairs of game ids and the number of players they have in common. For instance, if game X and game Y have 2,000 players in common (play both games), then output X, Y, 2000. The data does not have to be sorted. The data set is available at /user/rlawrenc/416/lab2/small/player_games.txt.
Submit all code and files using Connect. You can demonstrate your work at any time for feedback and marking.