You have a search engine's query log of the form:
[userid],[timestamp],[query],[url],[click],[dwell-time]
where the userid is the IP address of the user and the timestamp is the time at which the action took place (in epoch).
The user either submits a query to the search engine (in which case query contains the query string submitted, otherwise this value is empty) or views a result list and clicks on a URL (stored in click, otherwise this value is empty). If the user clicked a URL, the log also shows the amount of time in seconds the user spent on the URL (the dwell-time).
A concrete toy example of a query log is the following:
312121,1417789177,britney spears,,,,
312121,1417789245,britnay spears,,,,
324325,1417712245,tom jones,,,,
312121,1417789247,,http://en.wikipedia.org/wiki/Britney_Spears,25
324325,1417712111,tom jones singer,,,,
324325,1417712121,,http://en.wikipedia.org/wiki/Tom_Jones_%28singer%29,1
324325,1417712240,,http://www.tomjones.com/,987
Here, user 312121 searches twice for Britney Spears (different spellings) and later clicks on the Wikipedia link and spends 25 seconds on it. The user 324325 searches for Tom Jones and visits Wikipedia for 1 second and then spends a lot of time on the official Tom Jones web site. The log is neither sorted by userid nor by time.
We want to know the number of unique queries submitted. To do this, we need to write a Hadoop job that contains: