DataEngineerChallenge - Solution
By Marvin Manananghaya
This is an interview challenge for PayPay. Please feel free to fork. Pull Requests will be ignored.
The challenge is to make make analytical observations about the data using the distributed tools below.
Processing & Analytical goals:
-
Sessionize the web log by IP. Sessionize = aggregrate all page hits by visitor/IP during a session. https://en.wikipedia.org/wiki/Session_(web_analytics)
-
Determine the average session time
-
Determine unique URL visits per session. To clarify, count a hit to a unique URL only once per session.
-
Find the most engaged users, ie the IPs with the longest session times