This ticket is a "heads up" about a possible future request, and is not a direct request yet.
I'm working on developing some canonical user segments for desktop. The goal is to find segments that include/exclude sets of similar clients, so that we can reduce the impact of confounding variables when analysing data.
For example, we'll likely want to isolate "activated" users from non-activated users, for some definition of activated, and by studying the retention of "activated" users we can remove the effects due to bots and computer labs that create short term profiles. Or we might want to study the properties of heavy users.
We would like these segments to be available on GUD, as well as in mozanalysis, and have example queries on DTMO to help people use the common set of segments in manual queries too.
The most eccentric part of this is that I would like the freedom to iterate on the segments - so that we can start using segments soon, and tweak them as we learn what's useful. I imagine having version numbers in the names of segments until they're stable, and I imagine the old versions becoming replaced by the new versions (so presumably GUD would only need the most recent version or two, but the version number should be included in the segment name).
I am starting my search for segments by using features from clients_last_seen
, and building heuristics that can be represented in a SQL SELECT expression. In the long term we might want to move past this and involve ML in deciding which clients fit which segments, but that feels a long way away and there's a lot of value we can unlock in the meantime.
Here are some example segments that give a flavour of what we might want to look at:
- Users who visited/didn't visit 5 uris on a day 7-13 days before
submission_date
- Users who visited at least 213 URIs on
submission_date
- Users who visited x URIs in period y before
submission_date
- Users from Tier 1 countries
For each segment, we want to be able to plot MAU, DAU, retention rates, etc - the full range of metrics.
Describe the solution you'd like
When I provide some segment definitions (e.g. as a PR to bigquery-etl), I would like the GUD front end to allow people to filter graphs to include or exclude users that fit a certain segment. Some segments will come in pairs ("included by the criteria"/"excluded by the criteria"). Others may have multiple levels (e.g. "low usage"/"medium usage"/"high usage"). Comparing included/excluded users will be a common use case.
It seems like some segments will fit under "Product / usage criteria", some might fit under "Country", and others might require their own dropdown?
Describe alternatives you've considered
Still working out the main proposal, haven't got to the point of multiple alternatives yet!
Additional context
Proposal document where I guessed that implementing segments like "visited 5 uris on a day 7-13 days before submission_date
might take 1-2 weeks from the day I submit a PR to bigquery-etl, and I pointed out that "time estimations are plucked from my gut and involved no consultations"
Mentions
@hamilton, @jmccrosky, @klukas, and @openjck.