$\newcommand{\ones}{\mathbf 1}$ Back to HP More quizzes: Hadoop Graph Pig Streaming
All errors are my own! Though if you find any, an email would be appreciated ....
Be aware that some of these questions may not make a lot of sense outside of the taught course.
-Caudia Hauff

MapReduce and Hadoop

Which of the following statements are true about key/value pairs in Hadoop?

Consider the pseudo-code for MapReduce's WordCount example (not shown here). Let's now assume that you want to determine the frequency of phrases consisting of 3 words each instead of determining the frequency of single words. Which part of the (pseudo-)code do you need to adapt?
  1. Only map()
    Correct!
  2. Only reduce()
    Incorrect.
  3. map() and reduce()
    Incorrect.
  4. The code does not have to be changed.
    Incorrect.

Consider the pseudo-code for MapReduce's WordCount example (not shown here). Let's now assume that you want to determine the average amount of words per sentence. Which part of the (pseudo-)code do you need to adapt?
  1. Only map()
    Incorrect.
  2. Only reduce()
    Incorrect.
  3. map() and reduce()
    Correct!
  4. The code does not have to be changed.
    Incorrect.

Which of the following statements about Hadoop's partitioner are true?

Which of the following statements about Hadoop's partitioner are true?

Consider Hadoop's WordCount program: for a given text, compute the frequency of each word in it. The input is read line by line. As input, you are given one le that contains a single line of text:
A Ram Sam Sam
How many Mapper objects and Reducer objects are created? How many calls to map() and reduce() are made?
  1. 3 Mapper objects, 1 Reducer object, 3 calls of map(), 1 calls to reduce()
    Incorrect.
  2. 3 Mapper objects, 3 Reducer objects, 1 call of map(), 1 call to reduce()
    Incorrect.
  3. 1 Mapper object, 3 Reducer objects, 3 calls of map(), 3 calls to reduce()
    Incorrect.
  4. 1 Mapper object, 1 Reducer object, 1 call of map(), 3 calls to reduce()
    Correct!

GFS/HDFS
Which of the following scenarios fulfils the brief of the GFS assumptions?

Let's assume we have a Hadoop cluster with 12 Petabytes of disk space and replication factor 4. What can you say about the maximum possible file size?
  1. The maximum size of a file is restricted to the disk size of the largest DataNode.
    Incorrect.
  2. The maximum size of a file is restricted by the physical disk space available on the NameNode.
    Incorrect.
  3. The maximum size of a file cannot exceed 3 Petabytes.
    Correct!
  4. Files of any size can be processed in the cluster.
    Incorrect.

When adding $n$ GB of disk space to a DataNode in a Hadoop cluster, how much GB of additional space becomes available to the HDFS volume? Assume that replication has been switched off.
  1. Almost $n$ GB.
    Correct!
  2. Approximately $n^2$ GB.
    Incorrect.
  3. About $0.25n$ GB.
    Incorrect.
  4. No additional space is gained.
    Incorrect.

When adding $n$ GB of disk space to a NameNode in a Hadoop cluster, how much GB of additional space becomes available to the HDFS volume? Assume that replication has been switched off.
  1. Almost $n$ GB.
    Incorrect.
  2. Approximately $n^2$ GB.
    Incorrect.
  3. About $0.25n$ GB.
    Incorrect.
  4. No additional space is gained.
    Correct! This assumes that the NameNode does not double as DataNode!

Which of the following statements is correct about Heartbeat messages in a Hadoop cluster?
  1. A heartbeat message is sent every 5 to 30 seconds by every active DataNode.
    Correct!
  2. A heartbeat message is sent at most once a day by each active DataNode.
    Incorrect.
  3. A group of DataNodes together sends a single heartbeat message to save network bandwidth.
    Incorrect.
  4. Heartbeat messages are not send by DataNodes, they are only sent by the NameNode, one every 5 seconds.
    Incorrect.

For which of the following operations is NO communication with the NameNode required?
  1. A client writing a file to HDFS.
    Incorrect.
  2. A client requesting the filename of a given block of data.
    Incorrect.
  3. A client reading a block of data from the cluster.
    Correct!
  4. A client reading a file from the cluster.
    Incorrect.

The operation (or on Hadoop: edit) log is essential to ensure that
  1. the HeartBeat system can recover after a crash.
    Incorrect.
  2. the NameNode can recover after a crash.
    Correct!
  3. the DataNodes can recover after a crash.
    Incorrect.
  4. the data on a cluster can easily be replicated to another location.
    Incorrect.

Which of the following components reside on a NameNode?
  1. filenames, blocks and checksums
    Incorrect.
  2. blocks and heartbeat messages
    Incorrect.
  3. filenames, block locations
    Correct!
  4. blocks and block locations
    Incorrect.

What is the purpose of the setup/cleanup methods in a Hadoop job?
  1. To enable combiners to initialize a global resource.
    Incorrect.
  2. To enable both mappers and reducers to initialize a global resource.
    Correct!
  3. To enable mappers and reducers to execute some code before every map()/reduce() call.
    Incorrect.
  4. To configure a Hadoop job.
    Incorrect.

For which of the following scenarios can we employ Hadoop's user-defined Counters?
  1. To count the number of users currently using the cluster.
    Incorrect.
  2. To count the number of keys appearing in reducers with more than 10 values "attached" to them.
    Correct!
  3. To count the number of heartbeat messages sent by the user's map tasks.
    Incorrect.
  4. To count the number of machines in the cluster processing the user's job.
    Incorrect.

For which of the following scenarios can we employ Hadoop's user-defined Counters?
  1. To log the job status (block usage, network usage, etc.) in real-time.
    Incorrect.
  2. To set global variables that can be read by each mapper/reducer currently active.
    Incorrect.
  3. To count the number of unique keys appearing across all reducers for a given job.
    Correct!
  4. To enable direct communication between map tasks running on different machines.
    Incorrect.

Bob has a Hadoop cluster with 20 machines with the following Hadoop setup: replication factor 2, 128MB input split size. Each machine has 500GB of HDFS disk space. The cluster is currently empty (no job, no data). Bob intends to upload 4 Terabytes of plain text (in 4 files of approximately 1 Terabyte each), followed by running Hadoop’s standard WordCount1 job. What is going to happen?
  1. The data upload fails at the first file: it is too large to fit onto a DataNode.
    Incorrect.
  2. The data upload fails at a later stage: the disks are full.
    Incorrect.
  3. WordCount fails: too many input splits to process.
    Incorrect.
  4. WordCount runs successfully.
    Correct!

Bob has a Hadoop cluster with 20 machines under default setup (replication 3, 128MB input split size). Each machine has 500GB of HDFS disk space. The cluster is currently empty (no job, no data). Bob intends to upload 5 Terabyte of plain text (in 10 files of approximately 500GB each), followed by running Hadoop’s standard WordCount1 job. What is going to happen?
  1. The data upload fails at the first file: it is too large to fit onto a DataNode
    Incorrect.
  2. The data upload fails at a lager stage: the disks are full
    Correct!
  3. WordCount fails: too many input splits to process.
    Incorrect.
  4. WordCount runs successfully.
    Incorrect.

In Hadoop, the optimal input split size is the same as the
  1. block size
    Correct!
  2. average file size in the cluster
    Incorrect.
  3. mininum hard disk size in the cluster
    Incorrect.
  4. number of DataNodes
    Incorrect.

In Hadoop, a Counter is attached to a specific ...
  1. Mapper
    Incorrect.
  2. Reducer
    Incorrect.
  3. machine in the cluster
    Incorrect.
  4. job
    Correct!

Which of the following operations require communication with the NameNode?

You have a search engine's query log of the form:
[userid],[timestamp],[query],[url],[click],[dwell-time]
where the userid is the IP address of the user and the timestamp is the time at which the action took place (in epoch). The user either submits a query to the search engine (in which case query contains the query string submitted, otherwise this value is empty) or views a result list and clicks on a URL (stored in click, otherwise this value is empty). If the user clicked a URL, the log also shows the amount of time in seconds the user spent on the URL (the dwell-time). A concrete toy example of a query log is the following:
312121,1417789177,britney spears,,,,
312121,1417789245,britnay spears,,,,
324325,1417712245,tom jones,,,,
312121,1417789247,,http://en.wikipedia.org/wiki/Britney_Spears,25
324325,1417712111,tom jones singer,,,,
324325,1417712121,,http://en.wikipedia.org/wiki/Tom_Jones_%28singer%29,1
324325,1417712240,,http://www.tomjones.com/,987
Here, user 312121 searches twice for Britney Spears (different spellings) and later clicks on the Wikipedia link and spends 25 seconds on it. The user 324325 searches for Tom Jones and visits Wikipedia for 1 second and then spends a lot of time on the official Tom Jones web site. The log is neither sorted by userid nor by time. We want to know the number of unique queries submitted. To do this, we need to write a Hadoop job that contains:
  1. only a Mapper
    Incorrect.
  2. only a Mapper and a Reducer
    Correct!
  3. only a Mapper and a Counter
    Incorrect.
  4. only a Reducer and a Counter
    Incorrect.

Consider the search engine log again. We want to learn how much time on average users spend on clicked URLs. We use a resolution of one minute (i.e. all URLs viewed between 1 and 60 seconds by a user are counted together, all URLs viewed between 61 seconds and 120 seconds are counted together, etc.). Dwell times above 24 hours are counted together (as data noise). A Hadoop job is written: the mapper outputs as key/value pair (*,[dwell-time]) for each query log line that contains a click (the value is the actual dwell time). The reducer uses local aggregation:
setup():
--- H = associative_array;
reduce(key k, values v):
--- foreach value v in values:
------ H{v}=H{v}+1;
cleanup():
--- foreach value v in H:
------ EmitIntermediate(v,count H{v});
What happens if this Hadoop job is started with a query log containing 10 billion lines?
  1. The Hadoop job crashes and reports an out-of-memory-error.
    Incorrect.
  2. The Hadoop job runs without an error and outputs the expected results.
    Correct!
  3. The Hadoop job does not compile - hashmaps cannot be used in the Reducer.
    Incorrect.
  4. The Hadoop job runs without an error but outputs nothing.
    Incorrect.

Consider the search engine log again. We now want to know for how many queries users did not find what they were looking for. A user did not find what he was looking for when after submitting a query (logged in our query log), the user as next action then submitted another query. If the action after a submitting a query is a click, we assume the user found what he was looking for. Which of the introduced design patterns offers the best way to implement a program that answers this question?
  1. Local aggregation
    Incorrect.
  2. Pairs and stripes
    Incorrect.
  3. Order inversion
    Incorrect.
  4. Secondary sorting
    Correct!

And again ... the search engine log. We now want to know how many queries overall lead to clicks on Wikipedia pages. The simplest Hadoop job that answers this question consists of:
  1. Only a Mapper and a Counter
    Correct!
  2. Only a Mapper and a Reducer
    Incorrect.
  3. Only a Mapper, Reducer and a Counter
    Incorrect.
  4. Only a Reducer
    Incorrect.

For the last time ... the search engine log. We now also have a second file (the "document log"), containing two columns of data:
[url] [number of words]
For each URL on the Web we know the number of words contained in the Web page the URL refers to. For instance,
http://en.wikipedia.org/wiki/Tom_Jones_%28singer%29  4643
http://en.wikipedia.org/wiki/Britney_Spears 34221
http://www.tomjones.com/ 424
We want to derive a new data set (the "click log"), that only contains the log entries showing a click on a URL as follows:
[user id] [timestamp] [url] [number of words]
What should be the intermediate key space of the Mapper to make this join as efficient as possible?
  1. When processing the query log, the Mapper emits ([url],-1) as intermediate key. When processing the document log, the Mapper emits ([url],Integer.MAX) as intermediate key.
    Incorrect.
  2. When processing the query log, the Mapper emits [url] as intermediate key. When processing the document log, the Mapper emits ([url],[num. words]) as intermediate key.
    Incorrect.
  3. When processing the query log, the Mapper emits ([url],-1) as intermediate key. When processing the document log, the Mapper emits ([url],[num. words]) as intermediate key.
    Correct!
  4. When processing the query log, the Mapper emits [url] as intermediate key. When processing the document log, the Mapper emits [url] as intermediate key.
    Incorrect.

Counters can be used to:
  1. set cluster-wide values (e.g. the maximum found key value) in order to communicate between machines.
    Incorrect.
  2. monitor Hadoop jobs during their execution (e.g. the number of Java exceptions thrown).
    Correct!
  3. monitor individual machines during job execution.
    Incorrect.
  4. replace the Mapper.
    Incorrect.

Which of the following database operations implemented as Hadoop jobs require the use of a Mapper and a Reducer (instead of only a Mapper)?

The time it takes for a Hadoop job's Map task to finish mostly depends on:
  1. the placement of the blocks required for the Map task
    Correct!
  2. the duration of the job's shuffle & sort phase
    Incorrect.
  3. the placement of the NameNode in the cluster
    Incorrect.
  4. the duration of the job's Reduce task
    Incorrect.

Which of the following database operations – implemented as Hadoop jobs – require the use of a Mapper and a Reducer (instead of only a Mapper). Assume that the dataset(s) to be used do not fit into the main memory of a single node in the cluster.

Which of the following are fallacies of distributed computing?