$\newcommand{\ones}{\mathbf 1}$ Back to HP More quizzes: Hadoop Graph Pig Streaming
All errors are my own! Though if you find any, an email would be appreciated ....
Be aware that some of these questions may not make a lot of sense outside of the taught course.
-Caudia Hauff

# Big data, streams and samples

Big data in general
What are two differences between large-scale computing and big data processing?

What is an often occurring phenomenon when comparing simple/complex algorithms on small/big data?
1. On small data simple algorithms work very well.
Incorrect.
2. On large data simple algorithms work very well.
Correct!
3. On small data complex algorithms fail.
Incorrect.
4. On large data complex algorithms perform much better than simple algorithms.
Incorrect.

What does it mean when an algorithm is said to 'scale well'?

Which of the following scenarios is a scenario where real-time processing is required?

Streams and samples
Which attribute is _not_ indicative for data streaming?
1. Limited amount of memory
Incorrect.
2. Limited amount of processing time
Incorrect.
3. Limited amount of input data
Correct!
4. Limited amount of processing power
Incorrect.

Which of the following statements about data streaming is true?
1. Stream data is always unstructured data.
Incorrect.
2. Stream data often has a high velocity.
Correct!
3. Stream elements cannot be stored on disk.
Incorrect.
4. Stream data is always structured data.
Incorrect.

Which of the following statements about sampling are correct?

What is the main difference between standard reservoir sampling and min-wise sampling?
1. Reservoir sampling makes use of randomly generated numbers whereas min-wise sampling does not.
Incorrect.
2. Min-wise sampling makes use of randomly generated numbers whereas reservoir sampling does not.
Incorrect.
3. Reservoir sampling requires a stream to be processed sequentially, whereas min-wise does not.
Correct!
4. For larger streams, reservoir sampling creates more accurate samples than min-wise sampling.
Incorrect.

Which of the following statements about the Misra-Gries algorithm is correct?
1. It never produces false negatives, i.e. all items that are frequent will be found by the algorithm.
Correct!
2. It never produces false positives, i.e. all items found by the algorithm will be frequent.
Incorrect.
3. It does multiple passes over the input-data.
Incorrect.
4. It produces an exact answer.
Incorrect.
5. When run with $k=3$, the output contains the 3 most frequent items.
Incorrect.

A Bloom filter guarantees no
1. false positives
Incorrect.
2. false negatives
Correct!
3. false positives and false negatives
Incorrect.
4. false positives or false negatives, depending on the Bloom filter type
Incorrect.

Which of the following statements about standard Bloom filters is correct?
1. It is possible to delete an element from a Bloom filter.
Incorrect.
2. A Bloom filter always returns the correct result.
Incorrect.
3. It is possible to alter the hash functions of a full Bloom filter to create more space.
Incorrect.
4. A Bloom filter always returns TRUE when testing for a previously added element.
Correct!

The "Twitter datastream" contains tuples of the form:
(messageID, message, userID_of_posting_user, in_reply_to_messageID, time_of_posting, language_of_message).

You can assume that $messageID$ and $userID$ are unique, i.e. every message has a unique identifier and every user has a unique identifier. If the message is not posted in reply to any other message, we have $in\_reply\_to\_messageID=null$. Examples of tuples in that stream are:
(124324234324, "@Nelly: I had breakfast just now!", 33523232, 122192225674, "23/11/2014", "English").
(435345332432, "Sitting in Paris, drinking a coffee", null, 122198435674, "24/11/2014", "English").

We want to answer queries by sampling roughly 1/10th of the data. What is a good sampling strategy to answer the following query: What is the fraction of messages written in English?
1. Sample $userID$s and include all messages by a user
Incorrect.
2. Sample $language\_of\_message$ and include all messages of a language
Incorrect.
3. Generate a random number $r$ between 0 and 9 and sample the tuple if $r==0$.
Correct!
4. Sample $messageIDs$ and include all messages with a particular $messageID$
Incorrect.

Consider the same data as in the previous question. What is a good sampling strategy to answer the following query: What fraction of users post only in English?
1. Sample $userID$s and include all messages by a user
Correct!
2. Sample $language\_of\_message$ and include all messages of a language
Incorrect.
3. Sample by the combined key $(userID,language\_of\_message)$
Incorrect.
4. Generate a random number $r$ between 0 and 9 and sample the tuple if $r==0$
Incorrect.

Consider the same data as in the previous two questions. What is a good sampling strategy to answer the following query: How many replies does a tweet have on average?
1. Sample $(userID, in\_reply\_to\_messageID)$
Incorrect.
2. Sample $userID$s and include all messages by a user
Incorrect.
3. Sample $in\_reply\_to\_messageID$s
Correct!
4. Generate a random number $r$ between 0 and 9 and sample the tuple if $r==0$
Incorrect.

The FM-sketch algorithm uses the number of zeros the binary hash value ends in to make an estimation. Which of the following statements is true about the hash tail?
1. Any specific bit pattern is equally suitable to be used as hash tail.
Correct!
2. Only bit patterns with more 0's than 1's are equally suitable to be used as hash tails.
Incorrect.
3. Only the bit patterns 0000000..00 (list of 0s) or 111111..11 (list of 1s) are suitable hash tails.
Incorrect.
4. Only the bit pattern 0000000..00 (list of 0s) is a suitable hash tail.
Incorrect.

The FM-sketch algorithm can be used to:
1. Estimate the number of distinct elements.
Correct!
2. Sample data with a time-sensitive window.
Incorrect.
3. Estimate the frequent elements.
Incorrect.
4. Determine whether an element has already occurred in previous stream data.
Incorrect.

The DGIM algorithm was developed to estimate the counts of 1's occur within the last $k$ bits of a stream window $N$. Which of the following statements is true about the estimate of the number of 0's based on DGIM?
1. The number of 0's cannot be estimated at all.
Incorrect.
2. The number of 0's can be estimated with a maximum guaranteed error.
Correct!
3. To estimate the number of 0s and 1s with a guaranteed maximum error, DGIM has to be employed twice, one creating buckets based on 1's, and once created buckets based on 0's.
Incorrect.

Which of the following statements about the standard DGIM algorithm are false?

What are DGIM’s maximum error boundaries?
1. DGIM always underestimates the true count; at most by 25%
Incorrect.
2. DGIM either underestimates or overestimates the true count; at most by 50%
Correct!
3. DGIM always overestimates the count; at most by 50%
Incorrect.
4. DGIM either underestimates or overestimates the true count; at most by 25%
Incorrect.

Which algorithm should be used to approximate the number of distinct elements in a data stream?
1. Misra-Gries
Incorrect.
2. Alon-Matias-Szegedy
Incorrect.
3. DGIM
Incorrect.
4. None of the above
Correct!

Which of the following statements about Bloom filters are correct?

Which of the following streaming windows show valid bucket representations according to the DGIM rules?
1. 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 1
Incorrect.
2. 1 0 1 1 1 0 0 0 0 1 1 0 0 0 1 0 1 1 1 0 0 1
Incorrect.
3. 1 1 1 1 0 0 1 1 1 0 1 0 1
Incorrect.
4. 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1
Correct!

For which of the following streams is the second-order moment $F_2$ greater than 45?

What is the space complexity of the FREQUENT algorithm? Recall that it aims to find all elements in a sequence whose frequency exceeds $\frac{1}{k}$ of the total count. In the equations below, $n$ is the maximum value of each key and $m$ is the maximum value of each counter.
1. $O(k(\log m + \log n))$
Correct!
2. $o(k(\log m + \log n))$
Incorrect.
3. $O(\log k(m+n))$
Incorrect.
4. $o(\log k(m+n))$
Incorrect.