In this assignment we will experiment with Apache Hive to simplify executing Hadoop Map-Reduce programs. We will do all the questions from lab 3 and compare the difficulty with Apache Pig and regular MapReduce programs.
Sample outputs are below:
To use Hive, login using SSH to gpu1.ddl.ok.ubc.ca with your Novell account. Type /srv/hive/bin/hive to get a shell then enter Hive commands. Data files are available at:
I would recommend copying the files from my directory into your home directory rather than using the files in my home directory. Hadoop commands to copy files and create your own copies:
hadoop fs -get /user/rlawrenc/416/lab4/games/games.txt . hadoop fs -get /user/rlawrenc/416/lab4/players/players.txt . hadoop fs -get /user/rlawrenc/416/lab4/playergames/player_games.txt . hadoop fs -mkdir /user/yourid/416 hadoop fs -mkdir /user/yourid/416/lab4 hadoop fs -mkdir /user/yourid/416/lab4/games hadoop fs -mkdir /user/yourid/416/lab4/players hadoop fs -mkdir /user/yourid/416/lab4/playergames hadoop fs -put games.txt /user/yourid/416/lab4/games/games.txt hadoop fs -put players.txt /user/yourid/416/lab4/players/players.txt hadoop fs -put player_games.txt /user/yourid/416/lab4/playergames/player_games.txtHere are CREATE TABLE definitions for the files. Note that we are using EXTERNAL tables.
CREATE EXTERNAL TABLE Games (gid INT, gname STRING, pubname STRING, releaseDate STRING, rating DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\011' LOCATION '/user/rlawrenc/416/lab4/games'; CREATE EXTERNAL TABLE Players (pid INT, fname STRING, lname STRING, bdate STRING, gender STRING, imagePath STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\011' LOCATION '/user/rlawrenc/416/lab4/players'; CREATE EXTERNAL TABLE PlayerGames (pid INT, gid INT, score INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\011' LOCATION '/user/rlawrenc/416/lab4/playergames';To output to a local file in a directory rather than the console you can use:
INSERT OVERWRITE LOCAL DIRECTORY 'q1' SELECT * FROM Games
Write a HiveQL query that will list all the game records. There should be 100 records printed.
Write a HiveQL query that given some game id (hard-coded constant) will return the game record if found.
Write a HiveQL query that lists only the players that are 18 and over. The output should be sorted by age ascending. You can do this question without a UDF if you look at the date function support of Hive.
Write a HiveQL query that will calculate the number of players per game. The output does not have to be sorted.
Write a HiveQL query that will output the top 10 scores in descending order for a given game id.
Write a HiveQL query that will output pairs of game ids and the number of players they have in common. For instance, if game X and game Y have 2,000 players in common (play both games), then output X, Y, 2000. The data does not have to be sorted.
Write a HiveQL query that will list all games along with the count of the number of players in each game with a score over 98,000. If a game does not have a player with a score over 98,000, it should still appear in the output with a count of 0.
Write a HiveQL query that will list all players that either have a score in some game over 90,000 or play a game published by 'Electronic Arts'.
Write a HiveQL query that for each publisher will list two records. The first will be the publisher id, "female", and the maximum number of women that play one of its games. The second row will be the publisher id, "male", and the maximum number of men that play one of its games. Note that the games may be different.
For each game, show the total number of players, the total number of female and male players, and the percentage of female and male players.
Submit a single HiveQL query file with answers to all questions using Connect or by email. You can demonstrate your work at any time for feedback and marking.