$\newcommand{\ones}{\mathbf 1}$ Back to HP More quizzes: Hadoop Graph Pig Streaming
All errors are my own! Though if you find any, an email would be appreciated ....
Be aware that some of these questions may not make a lot of sense outside of the taught course.
-Caudia Hauff

Pig and Pig Latin

Which of the following statements is correct?
  1. Pig is an execution engine that replaces the MapReduce core in Hadoop.
    Incorrect.
  2. Pig is an execution engine that utilizes the MapReduce core in Hadoop.
    Correct!
  3. Pig is an execution engine that compiles Pig Latin scripts into database queries.
    Incorrect.
  4. Pig is an execution engine that compiles Pig Latin scripts into HDFS.
    Incorrect.

Which of the following statements about Pig are not correct?

Lets take the following file dataset.txt:
Frank,19,44,1st_year,12
John,23,,2nd_year,-1
Tom,21,,,0
and the following Pig Latin script:
A = load 'dataset.txt' using PigStorage(',');
B = filter A by $1>20;
C = group B by $2;
dump C;
How many records will be generated as output when running this script?
  1. 0
    Incorrect.
  2. 1
    Correct!
  3. 2
    Incorrect.
  4. 3
    Incorrect.

Let's consider the file above once more. You are tasked with writing a Pig Latin script that outputs the unique names (first column) occurring in this file. Which Pig Latin operators do you use (choose the minimum number)?
  1. foreach, distinct
    Correct!
  2. filter, distinct
    Incorrect.
  3. foreach, filter
    Incorrect.
  4. foreach
    Incorrect.
  5. filter
    Incorrect.

Which of the following definitions of complex data types in Pig are correct?
  1. Tuple: a set of key/value pairs
    Incorrect.
  2. Tuple: an ordered set of fields.
    Correct!
  3. Bag: a collection of key/value pairs.
    Incorrect.
  4. Bag: an ordered set of fields.
    Incorrect.
  5. Map: an ordered set of fields.
    Incorrect.
  6. Map: a collection of tuples.
    Incorrect.

Which guarantee that Hadoop provides does Pig break?
  1. Calls to the Reducer's reduce() method only occur after the last Mapper has finished running.
    Incorrect.
  2. All values associated with a single key are processed by the same Reducer.
    Correct!
  3. The Combiner (if defined) may run multiple times, on the Map-side as well as the Reduce-side.
    Incorrect.
  4. Task stragglers due to slow machines (not data skew) can be sped up through speculative execution.
    Incorrect.

The file 'complex.txt' contains the following two lines (tab delimited):
TUDELFT   EWI  [adres#Mekelweg,number#4,buildingColor#redblue]    {(computer science), (mathematics), (electronics)}
TUDELFT    3ME [number#2,adres#Mekelweg,postcode#2628CD] {(mechanical engineering), (maritime engineering), (materials engineering)}
What is the output of the following Pig script?
complex = load 'complex.txt' as (uni:chararray, faculty:chararray, location:map[], departments:bag{dlist:(d:chararray)});
A = foreach complex generate uni, flatten(location#'street');
dump A;
  1. ()
    Incorrect.
  2. ()
    ()
    Incorrect.
  3. (TUDELFT,)
    (TUDELFT,)
    Correct!
  4. (TUDELFT)
    Incorrect.

Assume you want to join two datasets within a Pig script: Data set 1 consists of all Wikipedia edits (information about how a single Wikipedia page is edited) captured for all languages in one log file across all the year's of Wikipedia's existance (billions of lines of log data); one line contains the following fields $[Unique\ ID, Wikipedia\ URL, Edit\ Timestamp, Editing\ UserID, Number\ of\ Words\ Added]$. The lines are ordered in ascending order by the $Editing\ UserID$. Data set 2 consists of information about Wikipedia articles written in Danish (less than 100,000 articles overall): $[Unique\ ID, Wikipedia\ URL, Wikipedia\ Title]$. The join should be performed on $[Wikipedia\ URL]$ and the generated data set should look as follows: $[Edit\ Timestamp, Wikipedia\ URL, Wikipedia\ Title]$. Which join is the most efficient one to use here (assuming a Hadop cluster with 20 machines, each one with about 4GB of memory and 1TB of disk space)?
  1. sort-merge join
    Incorrect.
  2. skew join
    Incorrect.
  3. fragment-replicate join
    Correct!
  4. default join
    Incorrect.

Assume you want to join two datasets within a Pig script: Data set 1 has an entry per Wikipedia article with the following information: $[Wikipedia\ URL, Last\ Edit\ Timestamp, Editing\ UserID, Number\ of\ Words\ Added]$. The lines are ordered in ascending order by URL. Data set 2 consists also has one line per Wikipedia article and contains the following: $[Unique\ ID, Wikipedia\ URL, Wikipedia\ Title]$. The lines are ordered in ascending order by URL. The join should be performed on $[Wikipedia\ URL]$ and the generated data set should look as follows: $[Last\ Edit\ Timestamp, Wikipedia\ URL, Wikipedia\ Title]$. Which join is the most efficient one to use here (assuming a Hadop cluster with 20 machines, each one with about 4GB of memory and 1TB of disk space)?
  1. sort-merge join
    Correct!
  2. skew join
    Incorrect.
  3. fragment-replicate join
    Incorrect.
  4. default join
    Incorrect.

Which of the following statements about Pig is correct?
  1. Pig always generates the same number of Hadoop jobs given a particular script, independent of the amount/type of data that is being processed.
    Incorrect.
  2. Pig replaces the MapReduce core with its own execution engine.
    Incorrect.
  3. Pig may generate a different number of Hadoop jobs given a particular script, dependent on the amount/type of data that is being processed.
    Correct!
  4. When doing a default join, Pig will detect which join-type is probably the most efficient.
    Incorrect.

Specific static Java functions can be used in Pig like UDFs. Take a look at the following Pig script:
define hex InvokeForString(‘java.lang.Integer.toHexString’,’int’);nums = load 'numbers' as (n:int);
inHex = foreach nums generate hex(n);
Apart from these three lines of code, what additional coding responsibilities do we have as developer here?
  1. We need to write InvokeForString().
    Incorrect.
  2. We need to register the jar containing java.lang.Integer.
    Incorrect.
  3. We need to write the toHexString() functionality, extending java.lang.Integer.
    Incorrect.
  4. There is nothing else to be done.
    Correct!

Which of the following definitions of complex data types in Pig are correct?