Re: [dev] [surf] [patch] 13 patches from my Universal Same-Origin Policy branch from Markus Teich on 2015-03-28 (dev mail list archive)

From: Markus Teich <markus.teich_AT_stusta.mhn.de>
Date: Sat, 28 Mar 2015 18:31:58 +0100

Heyho Ben,

tautolog_AT_gmail.com wrote:
> That is a very good point. The reason why I wanted to try this approach is
> because, even with being in a very large anonymous set in HTTP headers, the IP
> address network or region may be used split large sets down to individuals. I
> saw it in a paper, but I don't have it off hand. I thought, what about adding
> noise?

The IP address is totally unrelated to the HTTP headers. If you can be
identified by your IP address, changing the headers will not help at all.

> The nice thing about adding noise is that, no matter how much signal is picked
> up, all of the noise that is not filtered out is anonymizing. Yes, the noise
> patterns may become signal, but other noise can override that signal, too.
> Also, the process of filtering out can drop a lot of signal.

This seems to be coming from an electrical engineers point of view for data
transmissions? From a cryptographers point of view it would be best to choose
fully pseudorandom (indistinguishable from true randomness for a polynomially
bound adversary with a negligible chance of failure) data for the headers.
However servers use the data from the header to adapt their response, so they
should not be choosen randomly, but rather adhere to some kind of policy or
common values (I don't know if there is any specification for the contents of
these two headers and how clients and servers should behave?).

> It is nearly impossible to hide from an active, targeted, sophisticated
> surveillance, but "full-take", passive collection could be significantly
> hindered by small amounts of sophistication that breaks naive assumptions.

No, this is also not true for passive collection. When an adversary tries to
cluster requests into groups of single users, they obviously use all the identifying
information they have. A very easy way to do this is to calculate set
intersections e.g. the set of requests from users with a screen resolution of 1024x768 and the
set of users with a UA of "browzeit v4.2". The resulting set is always smaller
or equal in size to every one of the combined sets. Therefore using another
piece of identifying information can only lead to more exact results:

|IP intersect UA intersect language| <= |UA intersect language|

where IP, UA and language are sets of one specific value for the respective type
(e.g. IP = {requests | request.srcIP == 1.2.3.4}).

Anonymity is better if many different users end up in the same result set.

Since additional fields can not improve our anonymity (It can only throw out
other users from „our“ pool), we can ignore the composition of multiple pieces
of identifying information and just assume users can only be identified by their
UA. Now if every user chooses to use the exact same UA, then the adversary has
not learned a single bit of identifying information. If all the surf users
choose random UA strings for each request, but the firefox users allways choose
the same UA string, then the attacker clearly has learned which users are not
using firefox. There are now many anonymity sets which are all smaller than the
only one in the previous scenario. Even worse, if the adversary actually also
has access to the IP address (or any other piece), he can easily detect that
some requests are thrown in a bucket with no other request in there. He then can
just ignore the UA string for those requests. Now the randomization of the UA
string has no benefit at all anymore.

In case anyone wants to know: For IP addresses this is a bit different, since we
cannot send all the http requests from one single IP address. Therefore tor
actually uses a (relatively small) set of exit nodes and since they are public
and it is known how they work, the all work as one anonymity set for the tor
users. So if only a few people use tor, they would not be surfing anonymously at
all. Tor (and any other IP address related anonymizing software) heavily depends
on a big user base. The more users, the more other request in wich you can blend
in. Funny story: Some secret services once used their own anonymizing software,
but this was soon discovered to be very bad: Every user of „NSA-anonymizer“ was
clearly someone working for the NSA, so they switched to other more widespread
anonymizing software.

> Your suggestion is very good, and I am trying to build something like that,
> but with little affect on compatibility. Maybe collect the set of valid
> headers with large anonymity sets, and select a subset of headers that match
> the real configuration in only the most important features. That way, only
> obscure compatibility tests will fail. And have an option to provide the real
> user-agent string, when an issue happens. Afterall, if you only use the real
> one rarely, how can it be profiled? I suppose you could trick people to turn
> that on, but that is a fairly targeted action, not a full-take action, which
> is the primary issue.
>
> I can collect a set of common user-agent strings, and can find a subset that
> are webkit, and use those. Since compatibility tests are usually about
> rendering engine, that would avoid most compatibility issues with a random
> user-agent. ‎Maybe provide the set of common user agent strings by
> rendering engine as a separate open source project. I deal with enough traffic
> to collect this myself.

Cryptographically I don't think choosing one out of - let's say 10 - common UA
strings for every request randomly would be a problem. However it would also not
bring any observable benefit and adds code complexity, so I advise to leave it
out and just pick one common, sane UA string, which can be changed by users in
config.h as it is the case currently.

> Websites rarely need to know if you are running Linux, but if you are going to
> download software, you can enable the correct OS to be sent. Besides, the
> correct API for websites should be that they request the browser to identify
> the OS, like they request device location, and the user accepts the request
> explicitly. It is such a rare need that not every website needs to know the
> operating system. ‎Browsers do this for credit card data (not that I
> would use it), they can have a form fill for operating system, too. When input
> name="operating_system", prompt to fill it.

I cannot imagine a case where a surf user would need OS detection to download
software. Package managers already know, where they operate on. ;)

> The "noise" I add to the accept-language header is easily identified as a new
> signal, so I am leaning toward abandoning it, but there are some interesting
> opportunities there. For example, it can be used when active surveillance is
> not an issue, but passive surveillance is an issue, to add friction to the
> passive surveillance machine.

As explained above, this does not work. Just use en-US as default and let users
change it to their preferred language in config.h if they want to.

--Markus
Received on Sat Mar 28 2015 - 18:31:58 CET

This archive was generated by hypermail 2.3.0 : Sat Mar 28 2015 - 18:36:14 CET