COSC 416 - Special Topics in Databases
Midterm

You will have 90 minutes to complete the midterm. Submit your answer via email to ramon.lawrence@ubc.ca. You may also have it checked during the midterm time.

Data Files

The data files describe a standard order schema called TPC-H. Here is the DDL in the relational model. The files are on HDFS at:

/user/rlawrenc/416/tpch/region/region.tbl
/user/rlawrenc/416/tpch/nation/nation.tbl
/user/rlawrenc/416/tpch/part/part.tbl
/user/rlawrenc/416/tpch/supplier/supplier.tbl
/user/rlawrenc/416/tpch/partsupp/partsupp.tbl
/user/rlawrenc/416/tpch/customer/customer.tbl
/user/rlawrenc/416/tpch/orders/order.tbl
/user/rlawrenc/416/tpch/lineitem/lineitem.tbl

Quick Links

Task #1 (5 marks) - Apache Hive

  1. (2 marks) Return the top 10 customers by total number of items ordered for products manufactured by Manufacturer#3. Sample output.
  2. (3 marks) Return the number and percentage of customers in each nation with an order over $400,000 (o_totalprice). Sample output.

Apache Hive create table statements:

CREATE EXTERNAL TABLE region (r_regionkey INT, r_name STRING, r_comment STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/rlawrenc/416/tpch/region';

CREATE EXTERNAL TABLE nation (n_nationkey INT, n_name STRING, n_regionkey INT, n_comment STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/rlawrenc/416/tpch/nation';

CREATE EXTERNAL TABLE part (p_partkey INT, p_name STRING, p_mfgr STRING, p_brand STRING, p_type STRING, p_size INT, p_container STRING, p_retailprice DOUBLE, p_comment STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/rlawrenc/416/tpch/part';

CREATE EXTERNAL TABLE supplier (s_suppkey INT, s_name STRING, s_address STRING, s_nationkey INT, s_phone STRING, s_acctbal DOUBLE, s_comment STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/rlawrenc/416/tpch/supplier';

CREATE EXTERNAL TABLE partsupp (ps_partkey INT, ps_suppkey INT, ps_availqty INT, ps_supplycost DOUBLE, ps_comment STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/rlawrenc/416/tpch/partsupp';

CREATE EXTERNAL TABLE customer (c_custkey INT, c_name STRING, c_address STRING, c_nationkey INT, c_phone STRING, c_acctbal DOUBLE, c_mktsegment STRING, c_comment STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/rlawrenc/416/tpch/customer';

CREATE EXTERNAL TABLE orders (o_orderkey INT, o_custkey INT, o_orderstatus STRING, o_totalprice DOUBLE, o_orderdate STRING, o_orderpriority STRING, o_clerk STRING, o_shippriority INT, o_comment STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/rlawrenc/416/tpch/orders';

CREATE EXTERNAL TABLE lineitem (l_orderkey INT, l_partkey INT, l_suppkey INT, l_linenumber INT, l_quantity INT, l_extendedprice DOUBLE, l_discount DOUBLE, l_tax DOUBLE, l_returnflag STRING, l_linestatus STRING, l_shipdate STRING,l_commitdate STRING,l_receiptdate STRING, l_shipinstruct STRING, l_shipmode STRING, l_comment STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/rlawrenc/416/tpch/lineitem';

Task #2 (10 marks) - Apache Pig

  1. (4 marks) Return the average number of lineitems per order with a quantity over 5. Sample output.
  2. (6 marks) Return the top 5 part suppliers by dollar amount (l_quantity*l_extendedprice) to Canadian customers. Sample output.

Task #3 (10 marks) - Hadoop Map-Reduce

  1. (10 marks) Write a Hadoop MapReduce program that given a partkey (command-line parameter), returns the total number of times it was in an order, the total quantity ordered over all orders, and the average quantity ordered per order. Sample output.

*Home