The "Twitter datastream" contains tuples of the form:
(messageID, message, userID_of_posting_user, in_reply_to_messageID, time_of_posting, language_of_message).
You can assume that $messageID$ and $userID$ are unique, i.e. every message has a unique identifier and every user has a unique identifier. If the message is not posted in reply to any other message, we have $in\_reply\_to\_messageID=null$.
Examples of tuples in that stream are:
(124324234324, "@Nelly: I had breakfast just now!", 33523232, 122192225674, "23/11/2014", "English").
(435345332432, "Sitting in Paris, drinking a coffee", null, 122198435674, "24/11/2014", "English").
We want to answer queries by sampling roughly 1/10th of the data. What is a good sampling strategy to answer the following query: What is the fraction of messages written in English?