COSC 416 - Special Topics in Databases
Assignment 4 -  Apache Hive

In this assignment we will experiment with Apache Hive to simplify executing Hadoop Map-Reduce programs. We will do all the questions from lab 3 and compare the difficulty with Apache Pig and regular MapReduce programs.

Sample outputs are below:

Tutorial

To use Hive, login using SSH to gpu1.ddl.ok.ubc.ca with your Novell account. Type /srv/hive/bin/hive to get a shell then enter Hive commands. Data files are available at:

I would recommend copying the files from my directory into your home directory rather than using the files in my home directory. Hadoop commands to copy files and create your own copies:

hadoop fs -get /user/rlawrenc/416/lab4/games/games.txt .
hadoop fs -get /user/rlawrenc/416/lab4/players/players.txt .
hadoop fs -get /user/rlawrenc/416/lab4/playergames/player_games.txt .

hadoop fs -mkdir /user/yourid/416
hadoop fs -mkdir /user/yourid/416/lab4
hadoop fs -mkdir /user/yourid/416/lab4/games
hadoop fs -mkdir /user/yourid/416/lab4/players
hadoop fs -mkdir /user/yourid/416/lab4/playergames

hadoop fs -put games.txt /user/yourid/416/lab4/games/games.txt
hadoop fs -put players.txt /user/yourid/416/lab4/players/players.txt
hadoop fs -put player_games.txt /user/yourid/416/lab4/playergames/player_games.txt
Here are CREATE TABLE definitions for the files. Note that we are using EXTERNAL tables.
CREATE EXTERNAL TABLE Games (gid INT, gname STRING, pubname STRING, releaseDate STRING, rating DOUBLE)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\011'
LOCATION '/user/rlawrenc/416/lab4/games';

CREATE EXTERNAL TABLE Players (pid INT, fname STRING, lname STRING, bdate STRING, gender STRING, imagePath STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\011'
LOCATION '/user/rlawrenc/416/lab4/players';

CREATE EXTERNAL TABLE PlayerGames (pid INT, gid INT, score INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\011'
LOCATION '/user/rlawrenc/416/lab4/playergames';
To output to a local file in a directory rather than the console you can use:
INSERT OVERWRITE LOCAL DIRECTORY 'q1'
SELECT * FROM Games

Task #1 (1 mark) - List all Games

Write a HiveQL query that will list all the game records. There should be 100 records printed.

Task #2 (1 mark) - Find a Game by its Id

Write a HiveQL query that given some game id (hard-coded constant) will return the game record if found.

Task #3 (4 marks) - Players 18 and over

Write a HiveQL query that lists only the players that are 18 and over. The output should be sorted by age ascending. You can do this question without a UDF if you look at the date function support of Hive.

Task #4 (2 marks) - Number of Players per Game

Write a HiveQL query that will calculate the number of players per game. The output does not have to be sorted.

Task #5 (2 marks) - Given a Game Id - List the Top 10 Scores for the Game

Write a HiveQL query that will output the top 10 scores in descending order for a given game id.

Task #6 (2 marks) - Players in Common

Write a HiveQL query that will output pairs of game ids and the number of players they have in common. For instance, if game X and game Y have 2,000 players in common (play both games), then output X, Y, 2000. The data does not have to be sorted.

Task #7 (3 marks) - Count Top Players for Each Game

Write a HiveQL query that will list all games along with the count of the number of players in each game with a score over 98,000. If a game does not have a player with a score over 98,000, it should still appear in the output with a count of 0.

Task #8 (4 marks) - List All Players with Certain Properties

Write a HiveQL query that will list all players that either have a score in some game over 90,000 or play a game published by 'Electronic Arts'.

Task #9 (3 marks) - List Most Popular Games for Each Publisher by Gender

Write a HiveQL query that for each publisher will list two records. The first will be the publisher id, "female", and the maximum number of women that play one of its games. The second row will be the publisher id, "male", and the maximum number of men that play one of its games. Note that the games may be different.

Task #10 (3 marks) - Show Game Breakdown by Gender

For each game, show the total number of players, the total number of female and male players, and the percentage of female and male players.

Submission

Submit a single HiveQL query file with answers to all questions using Connect or by email. You can demonstrate your work at any time for feedback and marking.


*Home