csharrison / aggregate-reporting-api Goto Github PK

View Code? Open in Web Editor NEW

40.0 40.0 10.0 14 KB

Aggregate Reporting API

aggregate-reporting-api's People

Contributors

Stargazers

Watchers

Forkers

samdutton rajkrishnamurthy joaonatali ressmann william3johnson inferno-inc

aggregate-reporting-api's Issues

"Enough identical reports" threshold could prevent DSPs to abide by the law

As described in this post, the obligation to reach a certain amount k of "identical reports" to access them would quite dramatically impact the reporting for the advertisers and publishers, and it may even prevent the DSPs to comply with the law.
In France (it may also apply to other countries), DSPs are legally forced to disclose the comprehensive list of publishers they printed ads on to their clients (cf. here). This would not be possible using this API as a significant share of the publishers would be hidden to the DSPs, even with a low threshold.
How do you plan to tackle this use case?

Undocumented proposals and statements

Hi @csharrison,

Many of the topics we discussed in the recent "WICG Conversion Measurement API" meetings are not documented in publicly available GitHub repositories.
It includes the app-to-web attribution proposal that you presented on Monday (2021-03-22) and the aggregation proposal that you presented the week before (2021-03-15). It also includes information that you give us answering questions (willingness to support lift measurement, etc.)
While most of this information is accessible in the meeting minutes, I think that we would benefit a lot from having all this formalized in an actual GitHub proposal.

Do you intend to write something down to that effect?

Thank you very much!

Guidance on possible range of values for T

In evaluating this proposal, and providing feedback about how well it supports our various use-cases, it would be very helpful to have at least some guidance on the value of “T”.

I assume that Google is still doing research on this topic, and the value isn’t settled yet. As such, I don’t expect a precise answer. But a range would be helpful. Perhaps just a “90% confidence interval”, to be interpreted as “I, Charlie, estimate that the ultimate value of T will probably wind up lying between A and B with 90% confidence.”

I’m assuming there is a 100% probability that T will lie between 1 and 1,000,000

I’m trying to understand if we should imagine T in the 100 to 1000 range, or if it’s more like the 10,000 to 100,000 range.

Questions about the billing auditing capabilities

Hello,

The audit of the billing by both parties (supply & demand) is a key consideration to take into account.

Based on this proposition, there is no way to have a fully accurate reporting of the impressions printed. From my understanding there are two sources of mismatch:

reports without enough identical reports are not reported.
entries which expired before they were reported.

This raises a serious concern about billing. How can advertisers fairly retribute publishers without a fully accurate reporting of the impressions (and associated cost)?

How can smaller publishers (for which a bigger share of their printed ads won't be reported) expect to be retributed fairly?

Integration with trust tokens?

To help reduce fraudulent aggregate reports, it seems like it would be useful to be able to require a trust token redemption in order to send an aggregate report for a document. This requirement would likely need to be expressed in the document response headers.

Temporal information in the reports

Hi @csharrison and thanks for this proposal.
I was wondering if, with the current state of this proposal, any aggregated information about the "temporal aspect" of the browsing history the users will be transmitted via the API (eg if the user visited in the last month certain website more than the average).
This is of particular importance for certain ML models that need information about the browsing history, obtained today via 3rd party cookies, to dynamically predict bids for each user.
Thanks in advance,
Luca

About Brand and Ad Safety

Hello,

Brand Safety is paramount in advertising. Even running campaigns for performance, you don't want to hurt your overall brand. This is true on the publisher and on the advertiser side and both are very sensitive to it:

Publisher Side - Ad safety: the ability for a publisher to enforce a policy on the type of ads that can be displayed on its properties. (E.g. NSFW policy, Advertising from the competition).
Advertiser Side - Brand safety: the ability to select publishing websites according to their policy. (E.g. only publish on widely considered "high quality" publishers).

Advertisers and/or Publishers want to have strict control over it and be able to audit it. Approximation is not an option in this case since a few instances can damage your brand. Aggregated reporting would only work if there is enough identical reports. This seems to be a clear flaw in the reporting capabilities.

What could be done to ensure Brand Safety in the framework of this proposal?

Potential deanonymization attack

This is per the discussion on AdRoll/privacy#1 , but is spelled out more clearly (hopefully).

A malicious actor opts to abuse the Aggregate Reporting API for deanonymization purposes. They start by selecting a subset of users to track, by, for example, flat bidding on each user, with each user having a unique flat bid. For example, they could select their flat bids from a superincreasing knapsack of the form {1, 2, 4, 8, 16, ... 2^N} which will permit the tracking of (up to) N distinct users as these are effectively user IDs. Each user is frequency-capped to one impression only.

The malicious actor, on impression delivery calls:

const entryHandle = window.writeOnlyReport.get(hash('publisher.example'));
entryHandle.set('spend', userID);

Spend here is just an example; any summable quantity will work. With no noise added, the malicious actor can call the API and get back sum(spend) across all users. This sum, because each user is effectively represented as a bit in an N-bit binary number, perfectly identifies which users have been to publisher.example. (Note that being a binary representation isn't necessary for the attack; any superincreasing knapsack will work.)

In theory, differential privacy adds noise to prevent any revelation that a particular user was included in the dataset. However, the malicious actor attempts to circumvent this. On impression delivery, the malicious actor does:

const entryHandle1 = window.writeOnlyReport.get(hash('publisher.example1'));
entryHandle1.set('spend', userID);
const entryHandle2 = window.writeOnlyReport.get(hash('publisher.example2'));
entryHandle2.set('spend', userID);
const entryHandle3 = window.writeOnlyReport.get(hash('publisher.example3'));
entryHandle3.set('spend', userID);
...
const entryHandleM = window.writeOnlyReport.get(hash('publisher.exampleM'));
entryHandleM.set('spend', userID);

The attacker then queries the API for each of these M reports and receives s1 = sum(spend1), s2 = sum(spend2), ..., sM = sum(spendM). They then compute int((s1 + s2 + ... sM) / M), effectively canceling out the mean 0 noise that was added by differential privacy. The resultant integer, they then convert to binary and read out which users have been to publisher.example.

Note that it is not sufficient to limit the queries by the malicious attacker: they could be using multiple identities with different access policies to the reports, or be colluding with other malicious actors. An increase in noise added by differential privacy can be countered by increasing M. Also, M doesn't need to be so large as to completely recover the least-significant bit; even revealing the most-significant bit results in an epsilon of infinity.

Instead, it seems that the browser needs some mechanism to detect that an attack like this is occurring, which could be difficult given that the different hashes obfuscate that a record for publisher.example is being repeated M times (and these hashes could be a legitimate use case, for example tracking spend on each of (creative, interest group, campaign, advertiser, publisher, etc.), which are likely to appear as hashes on their own).

Campaign Piloting Use Case

Hello,

Advertisers currently have very tight control over the budget they spend on the various online marketing channels they run simultaneously. Many uses cases demand the ability to start / stop / increase / decrease the spend on a campaign with low latency (seasonal sales, urgency management, campaign ramp-ups, ...). Most of the times the latency requirement is in the order of the day, but in some cases such as seasonal sales advertiser want even lower latency, in the order of the hour.

The aggregated reporting proposition talks about reports with delays up to a day. This would mean reacting at the superior order to magnitude if you want to observe a trend. Advertisers would thus pilot their campaigns weekly at best. This sounds quite out of touch with the current use cases.

Could a much more frequent aggregated reporting (hourly basis) be considered to enable a form of campaign piloting?

Examples of Aggergats

Thanks @csharrison & @michaelkleber for your comments last week at TPAC re cohort assembly. I know this is a passionate subject with lots of strongly held opinions. I appreciate your professionalism.

Is it possible to see a schema, or even better an example of the "aggregate" file, that you guye envision as input for cohort assembly? I need something concrete to look at and evaluate. ~thanks

User viwable reports

This is promising.
There should be browser UI so users are able to see the reports and the aggregating services they were sent to, along with the ability to at least opt-out from some or any of them.
The reports should not be available to JS obviously (though maybe let with-permission browser extensions have access).

ML challenge inspired from the aggregate reporting API proposal

Hi everyone,

We are delighted to announce that we will be organising a challenge with adKDD inspired by the aggregate measurement API, tackling the optimisation use case.

Criteo will provide a dataset and some prize money and let researchers and data scientists from around the world compete on how to learn performing bidding models from differential private reports. Link to the challenge here.

We will be happy to work with the Chrome team to set the appropriate parameter to the differential privacy function, to be as close as possible to what real-life operations could look like (e.g. an epsilon level would be beneficial).

We hope to kickstart the challenge in early May, so if you are interested in solving the “optimisation” use case using the aggregate reporting API, please do participate!

Best,

Basile on behalf of the Criteo team

Third-party Reporting on Non-Targetable Verification Segments / Browser Interest Groups

Use Case
Independent verification & transparency are common and necessary requirements in digital advertising. According to the World Federation of Advertisers Sep 2020 Cross-Media Measurement Technical Blueprint, advertisers foundationally require the ability to report on basic_segments (link) which must be independently verified. Stated concretely, browsers should have the ability to report on some strict list of basic_segments (i.e. cohorts / interest groups) that were not necessarily targeted.

The current aggregate reporting api alludes to the ability to for demographic slices to be reported upon (as indicated in the snippet below)

const entryHandle = window.writeOnlyReport.get('campaign-123');

// Add any demographic slices you want or know in the current
// context.
entryHandle.set('country', 'usa');

One could imagine the following:

entryHandle.set('basic_segment', 'age' , 'gender');

...however, the way in which this use case would be achieved is not entirely clear as basic_segments should be non-targetable class of browser Interest Groups.

Proposal
Advertisers and/or publishers should have an option for enabling verification basic_segment reporting. Perhaps this could be accomplished via an extension to the .joinInterestGroup() API, thus allowing verification companies to execute JS that adds basic_segment metadata to the browser:

const verificationGroup = {'owner': 'https://first-verification-company.com',
                 'basic_segment': {'age' : '18-24', 'gender': 'F'},
                };
window.navigator.joinAdInterestGroup(verificationGroup, 30 * kSecsPerDay);

This should allow for demographic reporting similar to the following:

{
 'entryName': 'campaign-123',
 'country': 'usa',
 'age': 18-24',
 'gender': 'F',
 'visits': '1'
}

The generic nature of a basic_segment makes these attributes well-disposed to being differentially private aggregations. Once this data is collected server-side some service can/should ensure that necessary privacy thresholds are met before releasing any aggregates for reporting (as already indicated in the Aggregation Service proposal).

The aggregated data should be forwarded to some .well-known location of the third-party verification company, when basic_segment verification reporting is requested by the publisher and/or advertiser.

Please let me know if this is unclear in any way!

Clear privacy sandbox design rules

This question doesn't only relate to this API but also the other exposed APIs (such as the Conversion Measurement API), other reporting schemes (such as another API mentioned in #6 but never detailed further) and the entire TURTLEDOVE framework.

In this repository and others (such as in here for example), you layout several reporting frameworks, even considering different gradations of the desired user-privacy target we are aiming at protecting (Differential, Local Differential Privacy, K-anonymity, etc.)

Could it be possible to precise the exact requirements you expect from the privacy sandbox? Would it also be possible to have at least an order of magnitude for each of the different related variables (minimal size of the cohorts, the minimum number of identical report, epsilon if we are to consider differential privacy, etc.)

I understand that protecting user privacy requires the entire system to be attack proof and that you can't fix one variable without fixing the others. I also understand that these numbers are not set in stone and are up to discussion. However, having a rough idea is necessary for us to get a clear picture and propose relevant amendments. Taking an extreme example, a differential private world would appear radically different to advertisers depending on whether the value of epsilon is 2 or 200.

Do you also intend to provide a POC we could play around with at some point? Do you have any ETA in mind?

I didn't no what to do

https://github.com/mdn/dom-examples/blob/master/web-storage/main.js

Rapid Aggregation Proposal

Use Case

Accurate campaign pacing is a common concern for the Aggregate Reporting API. As stated during the W3C web-adv working group, we've observed that a delay of 24 hours in feedback can result in a decline of ~10-11% spend on publishers in our experimental group, and that also includes overspend safety mechanisms, which do not appear readily available for implementation under the Aggregate Reporting API proposal.

Perhaps this delay of 24 hours can be adjusted to aid in this, but this is a complementary proposal that may also help.

Proposal

The fundamental issue is maintaining some level of differential privacy before reporting. For more sophisticated reporting, we of course want to be able to do things such as:

const entryHandle = window.writeOnlyReport.get('campaign-123');
entryHandle.set('country', 'usa');
entryHandle.append(“visits”, “1”);
...
// any number of other dimensions of interest

But most of these data are not particularly relevant for pacing considerations. I propose a more limited API call:

const entryHandle = window.writeOnlyRapidReport.get('campaign-123');
// only cost/spend entries permitted

From here, the report could be subjected to a much, much shorter delay as differential privacy is likely to be achieved that much sooner. Additionally, because the data aggregates more quickly, we could use less noise/quantization in the reporting to help pace more accurately (or, conversely, if we wanted to increase the speed of reporting back, the noise could be increased). Perhaps it's worthwhile considering an explicit API function for incrementing cost values to enforce it's the only mechanism to write things into this rapid report.

Essentially, I'm arguing that there are two distinct purposes for reporting:

Collecting an array of information to provide insights into how often, where, etc. ads are displayed, so advertisers can make tuning adjustments.
Dynamically pacing campaigns to keep performance in check, avoid critical overspend failures, and also to observe underspend issues (which may be due to configuration issues).

Given these distinct use cases, I think it makes sense to have separate API calls that reflect them.

Reporting: will it need to be based on differential privacy?

The need for a differential private reporting scheme was mentioned several times during the discussions at the W3C and in some issues (e.g by you on issue #3 ).

Can you please clarify if any reporting system will have to be based on differential privacy?

If the answer is yes, could you please share the level of differential privacy you have in mind? In other words, what order of magnitude of epsilon are you considering?

Ad Server Use-Cases: Ad Impression- and Viewability Measurement

Problem:
As an advertiser I want independent, transparent and accurate measurement of my advertising activity.
Consistent and accurate measurement of advertising activity is critical for the internet as a marketplace where value is exchanged between publishers and advertisers and an important factor in the growth of advertising spending.

Impression measurement is a basic requirement for all brands and advertisers, either for direct IO buys with publishers or executed via a DSP.
Delivery and measurement of ad impressions is core functionality of creative ad servers who act as the single source of truth for reporting and procurement of brands and advertisers’ media activity.
In order to provide accurate measurement and transparency, ad servers are undertaking independent audits to have their measurement methodologies verified and accredited by the likes of MRC and IAB.

What is an ad impression?
Per MRC Desktop Display Impression Measurement Guidelines, an ad impression across all display marketing channels is the measurement of response from an ad delivery system to an ad request from the user’s browser, which is filtered for invalid traffic and is recorded at a point as late as possible in the process of delivery of the creative material to the user’s browser. The ad must be loaded and at minimum begin to render in order to count as a valid impressions. For simplicity we’re not considering mobile web, in-app and video impression measurement guidelines here.

What is a viewable ad impression?
Per MRC Viewable Ad Impression Measurement Guidelines , a served ad can be classified as viewable if the ad was contained in the viewable space of the browser window, on an in-focus tab, based on pre-established criteria such as percentage of ad pixels within the viewable space and length of time the ad is in the viewable space of the browser (a display impression can be classified as a viewable when 50% of the ad is in view for 1 second, for video its 50% and 2 seconds). It is recognized that an “opportunity to see” the ad exists with viewable impressions, which may not be the case with a served as impression.

From the several browser proposals and explainers it is not clear if the use case of event level impression measurement will be supported in the new “privacy by design” web.

There seems to be consensus around cross-domain identifiers, attributed data (Click Through Conversion Measurement – Chrome, Private click Measurement - Safari) and this Aggregate Reporting (API) for reach measurement, but the use case of event level impression measurement using on page JavaScript is not currently addressed.

Is the browsers’ intent to move impression and viewability measurement to browser (APIs) similar to IAB Tech Lab’s Open Measurement standard?
Such measurement would need to be adopted by all browsers universally in other to provide consistency to advertisers, brands and publishers and is fundamental to the ad funded open web.

If impression measurement would be done by the browser instead of the ad server, this would end independent measurement and transparency which is critical in an ecosystem controlled by few.

Will the browsers (like the walled gardens) allow independent auditing of their APIs and measurement methodologies to provide the transparency and assurance to the industry needs?
Will the browser control invalid traffic (IVT) filtration and adhere to the same standards and guidelines as set out by e.g. MRC and or work with these bodies to (re)set them?

The main question here is if the status quo of event-level impression measurement using on-page JS is a) untouched or b) a supported use case in the new privacy by design web?

Usable for non-ads use-cases?

Most of the examples (with the possible exception of the “widget”) begin with an ad impression.

Do you plan to constrain this API in some fashion so much that it must be tied to an “ad impression” or other type of “ad” use-case? Or will it be possible to write to this “write only data store” at any time, for any reason?

csharrison / aggregate-reporting-api Goto Github PK

aggregate-reporting-api's People

Contributors

Stargazers

Watchers

Forkers

aggregate-reporting-api's Issues

Use Case

Proposal

Recommend Projects

Recommend Topics

Recommend Org

Jobs