GithubHelp home page GithubHelp logo

pigin's Introduction

PIGIN: Private Interest Groups, Including Noise

This proposal has been withdrawn —

Please see its successor, TURTLEDOVE

Introduction

Some online advertising today is based on showing an ad to a person who has previously interacted with the advertiser. Today this works by the advertiser recognizing a specific person as they browse across web sites, a core privacy concern with today's web.

We propose an API in which the browser, not the advertiser, holds onto the information about what the advertiser thinks a person is interested in. Shifting this data to the browser instead of servers lets us offer many advantages: clear privacy properties, time limits, transparency into how interest groups are built and used, and granular or global controls over this type of ad targeting.

This type of ad targeting can be of value to people browsing the web, who often prefer ads for things they are interested in, and to advertisers, who wish to show their ads to people more likely to be interested in them. That added value is a significant part of why publishers earn much more money today when cookies are available for ad targeting (more details). Therefore an API that supports this practice and offers privacy guarantees is an important part of creating a web in which publishers can flourish without cross-site tracking.

This proposal will not support all related use cases — for example, we require an advertiser's interest group to be larger than some threshold size, to avoid "microtargeting" in which the API itself could become a tracking mechanism. Nevertheless it shows how an important web advertising use case can be made compatible with strong privacy protections for people's browsing behavior.

Motivating Use Case

An "interest group" is a collection of people whom an advertiser believes will be interested in seeing some type of ad. As people browse and interact with a web site, the site operator can add people to any number of interest groups. Advertisers then use those interest groups for ad targeting, especially to entice past visitors to return to the site. The advertising industry today uses a variety of terms to refer to variations on the idea of what we're calling an interest group, including "user list", "remarketing list", "custom audience", and "behavioral market segment".

As a concrete example, consider retailer WeReallyLikeShoes.com who sells shoes on the web. When they first show up at the web site, the visitor will be associated with a "WeReallyLikeShoes-shopper" interest group. As they view different sections of the site, they will be associated with the "WeReallyLikeShoes-athletic-shoes" or "WeReallyLikeShoes-dress-shoes" interest groups. When they view particular shoes, they will be associated with the "WeReallyLikeShoes-shoe-00123-viewer" interest group. Finally, purchasing the shoe or leaving the website will associate them with either the "WeReallyLikeShoes-buyer" group or "WeReallyLikeShoes-cart-abandoner" group.

Later, WeReallyLikeShoes.com wants to run an ad campaign for potential shoe buyers. They could use their interest groups for more precise targeting, such as offering a discount to members of their "cart-abandoner" group or advertising to their "athletic-shoes" group at the beginning of summer.

The goal of the API is to support this use case with the following outcomes:

  • People who like ads that remind them of sites they're interested in can keep seeing those sorts of ads.
  • People who don't can avoid seeing those sorts of ads.
  • People who wonder "how the ad knew" what they were interested in can get a clear, accurate answer.
  • People who wish can sever their association with the interest group, and can expect to stop seeing ads targeting the group.
  • Advertisers cannot learn the browsing habits of any specific people, even ones who have joined multiple interest groups.

All details of the UI would, of course, be up to the browser. The API provides the infrastructure that enables this ad campaign targeting capability while enforcing privacy rules that make transparency and control possible.

Design Elements

In this proposed API there is no server that keeps track of which people are in what interest groups. Instead, when an advertiser sees an interesting action, we let them ask the browser "Please join my WeReallyLikeShoes-athletic-shoes interest group for the next 30 days". Then on some future web pages with ads, the browser might choose to tell the ad server "By the way, I'm in the WeReallyLikeShoes-athletic-shoes interest group."

Interest group memberships remain private unless the browser chooses to disclose one. The browser has a way to be sure that a group membership is "anonymous enough" before choosing to disclose it. There is no way to ask the browser "Are you in my buyer group?" or "What are all the interest groups you're in?"

Browsers Joining Interest Groups

There is a straightforward JS API for an advertiser asking a browser to join a particular interest group for some amount of time.

var myGroup = {'owner' : 'www.wereallylikeshoes.com',
               'name' : 'athletic-shoes',
               'readers' : ['first-ad-network.com',
                            'second-ad-network.com']
              };
window.navigator.joinPrivateInterestGroup(myGroup, 30 * kSecsPerDay);

The API must be called from a window (top-level or iframe) whose origin matches the owner. This could be on WeReallyLikeShoes.com, or could be a cross-domain iframe — maybe RunningShoeReviews.com writes articles about shoes sold by WeReallyLikeShoes.com, and the review site has an agreement which lets the retailer add people to an interest group with 'name' : 'reads-reviews'. It should also be possible for a site owner to include a cross-domain iframe without giving it this capability.

The browser will only consider revealing group membership on requests to reader domains. This is meant to protect the owner's business interests, and is an extra limit on top of whatever the browser imposes to protect privacy. (Perhaps we should add support for pass-through domains: a way for first-ad-network.com to indicate that it buys ad-displaying opportunities from some-other-ad-platform.com, including a public encryption key so that the browser has a way to pass the group membership information through untrusted channels.)

If an interest group owner needs to know multiple pieces of information to decide whether they'd like a person to join their interest group, then the owner is responsible for tracking all the membership conditions on their own (on their server or in browser storage).

Browsers Disclosing Interest Groups — the tricky part

When making an HTTPS request to a domain that is one of the readers of any of the interest groups a browser has joined, the browser may choose to include information about one or more of those interest groups in an HTTP request header.

GET https://first-ad-network.com/serve_ad.html?width=300&height=250
Referer: https://somelocalnewspaper.com/big-story.html
Sec-CH-PIGIN: www.wereallylikeshoes.com:athletic-shoes
Sec-CH-PIGIN: www.rundontwalk.com:repeat-customer

Those interest group memberships can then be used by the ad network in picking which ad to show.

The browser's key responsibility is figuring out what set of group memberships it can disclose while preventing tracking and respecting privacy. Its secondary goal should be to send the "most valuable" interest group information that meets its privacy threshold.

At a minimum, the set of interest groups that the browser chooses to disclose should be the same for many different people who might visit a website — that is, a k-anonymity threshold. This is necessary so that nobody can use a person's disclosed interest groups as a way to recognize them and track their browsing behavior across sites. Research in differential privacy offers a variety of guarantees to consider that are stronger than simple k-anonymity.

In any case, ensuring that a collection of interest groups is sufficiently private is likely to involve communication with some privacy infrastructure servers. The browser's communication with them should again protect privacy and not enable tracking — preferably, even by the operators of the privacy infrastructure. (For example, a browser sending a complete list of interest groups to a "private subset picking service" could leak someone's identity and full browsing history to the service.)

Picking or approximating the "most valuable" sufficiently-private set of interest groups is a key challenge. This topic should be the subject of discussions between browsers and the advertising industry — an industry with a wealth of experience at picking the most valuable ad.

Understanding which interest groups are the most valuable might involve periodic out-of-band interactions between the browser and additional servers run by various ad tech companies. Any such communication with them must again be designed to protect privacy.

Note that privacy should be preserved even if all the reader domains that co-appear on a site collaborate with each other.

Naive list-picking example

For example, a naive implementation could consist of a daily round of decision-making:

  • Contacting a server run by each reader domain to learn the value of each interest group, using Private Information Retrieval techniques to avoid the need to disclose any list memberships to the reader domain.
  • Sorting interest groups by highest declared value.
  • Querying a server, using Threshold Cryptography, to learn whether at least 1000 other people have the same most-valuable interest group. If not, discard that group and repeat.
  • If so, accept the top interest group, and consider whether it is OK to send the next most valuable group as well. Query the Threshold Crypto server again, this time asking about the top two groups together.
  • Repeat until the browser has chosen a maximum of 5 interest groups or checked all groups.
  • Ad requests during the following day get the subset of the chosen interest groups for which the ad network is a permitted reader domain.

This can be improved on in many ways. Simple k-anonymity is an exploitable notion of privacy, and the value of an interest group to a reader domain may vary depending on what web site the browser is visiting. An implementation in which the browser may send a different set of interest groups while visiting different sites could improve both privacy and advertising value. The threat of reader domain collaboration is addressed by using a global ordering of interest group value, rather than permitting a different value for different readers; this is bad since readers might lie to cause harm to other readers.

User Interface controls

This API means browsers can offer people a UI that provides insight into what interest groups they are on and how they got there, as well as control over both past and future group memberships. Some ideas for controls that a browser might choose to offer include:

  • "What interest groups am I in?" — Show the owners, names, and expiration dates of interest groups. Perhaps browsers should also require interest group owners to give a URL of a sample image from an ad campaign targeting any group.
  • "How did I get on this list?" — The display of any interest group should include the advertiser's domain name, and also the day and domain of the top-level page the browser was visiting when it joined that group.
  • "Remove me from this list" — Tell the browser to leave any or all interest groups.
  • "Block this advertiser" or "Block this site" — The browser could leave all interest groups owned by WeReallyLikeShoes.com or all interest groups joined while on RunningShoeReviews.com, and disallow such list additions in the future.
  • "Disable Private Interest Groups" — This control is analogous to disabling 3rd-party cookies today, but without the disadvantage of breaking other unrelated use cases.

Privacy and Security Considerations

This API allows one site very limited information about a visitor's off-site interests.

Interest group membership indicating a sensitive category

If a person visits a web site and that site learns/decides that they are in some sensitive category, the site could build an interest group of sensitive category members, and could perhaps make that information available even when the person is visiting a different site.

This is partly mitigated by the k-anonymity (or stronger) requirements on what interest groups the browser reveals, but some sensitive categories may be large. It is also partly mitigated by only revealing the "high-value" group memberships, but a well-resourced malicious site could genuinely run a high-priced ad campaign to encourage the browser to disclose this sensitive-category signal.

Differential privacy properties for the revealed interest groups could be chosen to offer some measure of plausible deniability — noise of the "false positive" variety, in addition to the false negatives from only revealing some interest groups. However, note that this would cause some advertisers to incorrectly target their ads.

Collection of a repeat visitor's profile over time

If a person visits a particular first-party site over a long period of time, then that site may have a stable ID for the person, e.g. a first-party cookie or a logged-in account. Interest groups are only revealed to reader domains, but the first party could also be in the ads business, or could collaborate with the readers who provide its ads to record the history of the person's interest groups revealed by the browser. Even if each day's ad requests offer appropriate privacy guarantees, a first-party site can build knowledge over time.

If the publisher, advertiser, and ad networks are all willing to collaborate, this seems technically difficult to prevent. Even if browsers invented a way to render certain ads on a site while making it impossible for a collaborating ad to tell the site which ad was shown, many server-to-server communication schemes could ultimately share the same information.

Tracking a person's browsing across sites

This is partly mitigated by the k-anonymity (or stronger) requirements on what interest groups the browser reveals. However, the "collection of a repeat visitor's profile over time" strategy will cut down on that anonymity. Two different sites that the same person visits frequently could use this to make guesses about their matching visitors.

This would be partly mitigated by the browser sending different choices of interest groups to different sites. The timeline for this attack would be extended by browsers rotating interest groups more slowly. But this cannot be solved completely: If a person visits two different sites often enough for long enough, this is just one of many signals which the sites could use to correlate behavior patterns and try to pick out matching visitors.

pigin's People

Contributors

michaelkleber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

thezedwards

pigin's Issues

Details on withdrawal

PIGIN is far less complex and solves many more of the use-cases for online advertising than TURTLEDOVE.

Can you provide details on why PIGIN was withdrawn?

How programmatic advertising would work?

Hello, and sorry for jumping out of nowhere...But the approach taken in Pigin is something i've been thinking about too (shifting where "segments" are stored, from the DMP, to the browser itself).
In fact, i think there are two "possible" levels (note: i didnt say "factible" levels):

  • To store the segments in the browser, and send them to a third party, which matches "server side".
  • To do the whole "matching" in the browser: It's not the browser the one sending the interest groups to third parties, but the third parties providing the browser in which interests groups are interested, and the browser responding with the matching interest groups.

But i always fail in figuring out how would this apply to programmatic advertising. In the example shown in the documentation, there's a direct relationship between WeReallyLikeShoes and RunningShoeReviews.com. They have an agreement. And, if both parties know each other, in many cases, there's even no need for cookies or id matching.

But how does this work in a programmatic environment?
Scenario:

  • I buy something for a foreign e-Commerce site, A.
  • A adds me to its own private interest group.
  • Then i visit a local newspaper site, B, which allows reading of its own private interest group, to its adserver.
  • I guess that all possible interests groups where the adserver is allowed as "reader", are eligible to be sent to the adserver.
  • That adserver was not included by "A" as an allowed reader of its interest groups (because that would mean that A should add all possible adservers as readers).
  • If some sort of delegation is in place (and resolving all the chain of delegations, before requesting an Ad), the "private interest group" where A added me, is eligible to be sent to the adserver (and follow all the delegation chain up to wherever site A is buying inventory).
  • But, to include the interest group where site A added me, in the list of headers sent to the adserver, site A must compete with all the possible sites i've ever visited that ever allowed the adserver to be a "reader" (and most of them, are not running any campaign, as they used the adserver just to ask for ads, so nobody is targeting those interest groups)

I sincerely may be wrong there, as i dont think i still grasp all the details of the PIGIN proposal.
Feel free to correct me everywhere it's needed.

But i think a key goal of any possible proposal, should be that the method proposed, should be the only possible existing method. This is, if PIGIN exists, but cookies also exists, it'd never have any success. But also means that PIGIN should be, in all use cases, even those out of the scope of preventing user profiling (for example, in ad tech, frequency capping). This kind of use cases should also be described.

But let me share a possible different approach (which is also very very rough).
A possible answer to the question "why is user profiling even possible", is that it's allowed to store lots of information, to all possible actors involved in rendering a web page, and information that can be mapped between different actors (cookie matching).

What would happen if the amount of information stored for each actor, is not the same, and, for any third party, even in the best scenario, that amount isnt enough to even store a reliable "user id"?
So, for example (again, very rough):

  • Any first-party domain, is allowed to store any amount of data as cookies.
  • An iframe of a third-party domain, included directly inside a first-party domain, is only allowed to store, at most, X bytes. Those bytes are related to certain "vendor ids", and that storage should be so small that it's most reasonable use case would be storing "flags". What meaning has those "flags" is completely dependent on who stored it.
    This is, an e-Commerce could use in its "allowed storage", bit 1 as "Visited shoes", bit 2 as "Visited electronics", but not much more.Those bits are only meaningful to this e-Commerce, and, from time to time, they would change their strategy of assigning bits (during some months, they're interested in people who buys shoes, so they use all their storage bytes to save info about shoes. A few months later, that changes, and the whole strategy changes).
    If the information stored is limited, and changes over time, i guess profiling would be very difficult, as you cannot simply mix and match data from different sites, or even match with the same data from some time ago.
    By limiting the amount of data stored by each "vendor", maybe hashing it with vendor-provided keys,etc, the bitset representing all the data all vendors have stored in my browser, could always be shared, as they would be encrypted, and, even if decrypted, what means each bit of data, is something only each vendor, in a point of time, knows.
    Also, the browser could impose limits: Only send data from the last (say) 100 vendors seen.

A possible extension would be that, as we descend deeper and deeper into the third party iframes structure, the number of bits "visible" to vendors, gets smaller.

So, eCommerce A, when i'm visiting site A, can write any traditional cookie, and at the same time has access, via API, to its "vendor-id related" storage.
If eCommerce A shows me an ad, directly from my adserver, it'd still have access to the complete "vendor-id" related storage.
If eCommerce A shows me an ad after my adserver redirected to another one, eCommerce A has access to X - Y bits of their storage (losing precision at the same time they go deeper in the chain).
Maybe, the request to adservers would be authorized to receive the full storage data (so matching is always done over the full storage data), but any third party code, would receive the clipped data.So, matching capabilities are not lost, but profiling capabilities gets narrower.

Very rough, a different approach...Who asked about my opinion anyway?
But , well, let me just waste some of my time! :-D

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.