Quiz

Which of the following statements is correct?

Pig is an execution engine that replaces the MapReduce core in Hadoop.

Incorrect.
Pig is an execution engine that utilizes the MapReduce core in Hadoop.

Correct!
Pig is an execution engine that compiles Pig Latin scripts into database queries.

Incorrect.
Pig is an execution engine that compiles Pig Latin scripts into HDFS.

Incorrect.

Which of the following statements about Pig are not correct?

In general, to implement a task, the number of lines of code in Pig and Hadoop are roughly the same. ✓ ✗

This option is correct.

Pig makes use of Hadoop job chaining. ✓ ✗

This option is incorrect.

Code written for the Pig engine is compiled into Hadoop jobs. ✓ ✗

This option is incorrect.

Code written for the Pig engine is directly compiled into machine code. ✓ ✗

This option is correct.

Lets take the following file dataset.txt:

Frank,19,44,1st_year,12
John,23,,2nd_year,-1
Tom,21,,,0

and the following Pig Latin script:

A = load 'dataset.txt' using PigStorage(',');
B = filter A by $1>20;
C = group B by $2;
dump C;

How many records will be generated as output when running this script?

0

Incorrect.
1

Correct!
2

Incorrect.
3

Incorrect.

Let's consider the file above once more. You are tasked with writing a Pig Latin script that outputs the unique names (first column) occurring in this file. Which Pig Latin operators do you use (choose the minimum number)?

foreach, distinct

Correct!
filter, distinct

Incorrect.
foreach, filter

Incorrect.
foreach

Incorrect.
filter

Incorrect.

Which of the following definitions of complex data types in Pig are correct?

Tuple: a set of key/value pairs

Incorrect.
Tuple: an ordered set of fields.

Correct!
Bag: a collection of key/value pairs.

Incorrect.
Bag: an ordered set of fields.

Incorrect.
Map: an ordered set of fields.

Incorrect.
Map: a collection of tuples.

Incorrect.

Which guarantee that Hadoop provides does Pig break?

Calls to the Reducer's reduce() method only occur after the last Mapper has finished running.

Incorrect.
All values associated with a single key are processed by the same Reducer.

Correct!
The Combiner (if defined) may run multiple times, on the Map-side as well as the Reduce-side.

Incorrect.
Task stragglers due to slow machines (not data skew) can be sped up through speculative execution.

Incorrect.

The file 'complex.txt' contains the following two lines (tab delimited):

TUDELFT   EWI  [adres#Mekelweg,number#4,buildingColor#redblue]    {(computer science), (mathematics), (electronics)}
TUDELFT    3ME [number#2,adres#Mekelweg,postcode#2628CD] {(mechanical engineering), (maritime engineering), (materials engineering)}

What is the output of the following Pig script?

complex = load 'complex.txt' as (uni:chararray, faculty:chararray, location:map[], departments:bag{dlist:(d:chararray)});
A = foreach complex generate uni, flatten(location#'street');
dump A;

()

Incorrect.
()
()

Incorrect.
(TUDELFT,)
(TUDELFT,)

Correct!
(TUDELFT)

Incorrect.

Assume you want to join two datasets within a Pig script: Data set 1 consists of all Wikipedia edits (information about how a single Wikipedia page is edited) captured for all languages in one log file across all the year's of Wikipedia's existance (billions of lines of log data); one line contains the following fields $[Unique\ ID, Wikipedia\ URL, Edit\ Timestamp, Editing\ UserID, Number\ of\ Words\ Added]$. The lines are ordered in ascending order by the $Editing\ UserID$. Data set 2 consists of information about Wikipedia articles written in Danish (less than 100,000 articles overall): $[Unique\ ID, Wikipedia\ URL, Wikipedia\ Title]$. The join should be performed on $[Wikipedia\ URL]$ and the generated data set should look as follows: $[Edit\ Timestamp, Wikipedia\ URL, Wikipedia\ Title]$. Which join is the most efficient one to use here (assuming a Hadop cluster with 20 machines, each one with about 4GB of memory and 1TB of disk space)?

sort-merge join

Incorrect.
skew join

Incorrect.
fragment-replicate join

Correct!
default join

Incorrect.

Assume you want to join two datasets within a Pig script: Data set 1 has an entry per Wikipedia article with the following information: $[Wikipedia\ URL, Last\ Edit\ Timestamp, Editing\ UserID, Number\ of\ Words\ Added]$. The lines are ordered in ascending order by URL. Data set 2 consists also has one line per Wikipedia article and contains the following: $[Unique\ ID, Wikipedia\ URL, Wikipedia\ Title]$. The lines are ordered in ascending order by URL. The join should be performed on $[Wikipedia\ URL]$ and the generated data set should look as follows: $[Last\ Edit\ Timestamp, Wikipedia\ URL, Wikipedia\ Title]$. Which join is the most efficient one to use here (assuming a Hadop cluster with 20 machines, each one with about 4GB of memory and 1TB of disk space)?

sort-merge join

Correct!
skew join

Incorrect.
fragment-replicate join

Incorrect.
default join

Incorrect.

Which of the following statements about Pig is correct?

Pig always generates the same number of Hadoop jobs given a particular script, independent of the amount/type of data that is being processed.

Incorrect.
Pig replaces the MapReduce core with its own execution engine.

Incorrect.
Pig may generate a different number of Hadoop jobs given a particular script, dependent on the amount/type of data that is being processed.

Correct!
When doing a default join, Pig will detect which join-type is probably the most efficient.

Incorrect.

Specific static Java functions can be used in Pig like UDFs. Take a look at the following Pig script:

define hex InvokeForString(‘java.lang.Integer.toHexString’,’int’);nums = load 'numbers' as (n:int);
inHex = foreach nums generate hex(n);

Apart from these three lines of code, what additional coding responsibilities do we have as developer here?

We need to write InvokeForString().

Incorrect.
We need to register the jar containing java.lang.Integer.

Incorrect.
We need to write the toHexString() functionality, extending java.lang.Integer.

Incorrect.
There is nothing else to be done.

Correct!

Which of the following definitions of complex data types in Pig are correct?

Tuple: a set of key/value pairs. ✓ ✗

This option is incorrect.

Tuple: an ordered set of fields. ✓ ✗

This option is correct.

Bag: a collection of tuples. ✓ ✗

This option is correct.

Bag: an ordered set of fields. ✓ ✗

This option is incorrect.

Map: a set of key/value pairs. ✓ ✗

This option is correct.

Map: a collection of tuples. ✓ ✗

This option is incorrect.

Pig and Pig Latin