Re: [dev] [surf] [patch] 13 patches from my Universal Same-Origin Policy branch from Ben Woolley on 2015-03-29 (dev mail list archive)

From: Ben Woolley <tautolog_AT_gmail.com>
Date: Sat, 28 Mar 2015 16:25:59 -0700

Hi Markus,

Thanks again for the reply.

On 3/28/15, Markus Teich <markus.teich_AT_stusta.mhn.de> wrote:
> Heyho Ben,
>
> tautolog_AT_gmail.com wrote:
>> That is a very good point. The reason why I wanted to try this approach
>> is
>> because, even with being in a very large anonymous set in HTTP headers,
>> the IP
>> address network or region may be used split large sets down to
>> individuals. I
>> saw it in a paper, but I don't have it off hand. I thought, what about
>> adding
>> noise?
>
> The IP address is totally unrelated to the HTTP headers. If you can be
> identified by your IP address, changing the headers will not help at all.
>

Yeah. I figure that since the browser doesn't control that, that issue
will need to be addressed by some other solution. However, those
solutions typically involve a range/set of IPs that can split the
anonymous sets, and since the browser *does* have control of that
aspect, there could be something it could do.

>> The nice thing about adding noise is that, no matter how much signal is
>> picked
>> up, all of the noise that is not filtered out is anonymizing. Yes, the
>> noise
>> patterns may become signal, but other noise can override that signal,
>> too.
>> Also, the process of filtering out can drop a lot of signal.
>
> This seems to be coming from an electrical engineers point of view for data
> transmissions? From a cryptographers point of view it would be best to
> choose
> fully pseudorandom (indistinguishable from true randomness for a
> polynomially
> bound adversary with a negligible chance of failure) data for the headers.

Yes, both concepts apply to the problem. I am not an expert in
cryptography, but am famliar with it. I play with SDR stuff, and do a
lot of music synth, so I often think of problems with the connotation
of "signal" rather than "symbol".

> However servers use the data from the header to adapt their response, so
> they
> should not be choosen randomly, but rather adhere to some kind of policy or
> common values (I don't know if there is any specification for the contents
> of
> these two headers and how clients and servers should behave?).
>

Accept-Language is probably standardized, but the standard for
User-Agent, as far as I know, just says what format the parameters
should be in, not having any specification for which parameters need
to be included. The way I look at the User-Agent string is that the
browser doesn't *need* to report anything, but if it doesn't, then it
should be very good at following standards, or at least gracefully
degrading.

>> It is nearly impossible to hide from an active, targeted, sophisticated
>> surveillance, but "full-take", passive collection could be significantly
>> hindered by small amounts of sophistication that breaks naive
>> assumptions.
>
> No, this is also not true for passive collection. When an adversary tries
> to
> cluster requests into groups of single users, they obviously use all the
> identifying
> information they have. A very easy way to do this is to calculate set
> intersections e.g. the set of requests from users with a screen resolution
> of 1024x768 and the
> set of users with a UA of "browzeit v4.2". The resulting set is always
> smaller
> or equal in size to every one of the combined sets. Therefore using another
> piece of identifying information can only lead to more exact results:
>
> |IP intersect UA intersect language| <= |UA intersect language|
>
> where IP, UA and language are sets of one specific value for the respective
> type
> (e.g. IP = {requests | request.srcIP == 1.2.3.4}).
>
> Anonymity is better if many different users end up in the same result set.
>
> Since additional fields can not improve our anonymity (It can only throw
> out
> other users from „our“ pool), we can ignore the composition of multiple
> pieces
> of identifying information and just assume users can only be identified by
> their
> UA. Now if every user chooses to use the exact same UA, then the adversary
> has
> not learned a single bit of identifying information. If all the surf users
> choose random UA strings for each request, but the firefox users allways
> choose
> the same UA string, then the attacker clearly has learned which users are
> not
> using firefox. There are now many anonymity sets which are all smaller than
> the
> only one in the previous scenario. Even worse, if the adversary actually
> also
> has access to the IP address (or any other piece), he can easily detect
> that
> some requests are thrown in a bucket with no other request in there. He then
> can
> just ignore the UA string for those requests. Now the randomization of the
> UA
> string has no benefit at all anymore.
>
> In case anyone wants to know: For IP addresses this is a bit different,
> since we
> cannot send all the http requests from one single IP address. Therefore tor
> actually uses a (relatively small) set of exit nodes and since they are
> public
> and it is known how they work, the all work as one anonymity set for the
> tor
> users. So if only a few people use tor, they would not be surfing
> anonymously at
> all. Tor (and any other IP address related anonymizing software) heavily
> depends
> on a big user base. The more users, the more other request in wich you can
> blend
> in. Funny story: Some secret services once used their own anonymizing
> software,
> but this was soon discovered to be very bad: Every user of „NSA-anonymizer“
> was
> clearly someone working for the NSA, so they switched to other more
> widespread
> anonymizing software.
>

Yes, I agree completely with that theory, and you explained it really
well. The good thing is that, with theory, you can prove the smallest
anonymous set you can have with a particular set of information
(right?). The concern that I have is that, as long as semi-fixed IP
addresses are used, or even ranges of IP addresses are used, anonymous
sets are fairly small practice.

So, if I were using Tor, or something similar, I would be reducing
information in requests as much as possible, because the lack of
information is the technique. However, when not using Tor, the
information in requests is already too much, so the next angle is to
block cookie tracking, and hinder profiling, so that at least the
browsing has been obscured from the vision of commercial systems.

As you pointed out, randomizing IP address is related to randomizing
HTTP headers, with your NSA-specific product example. Say we have a
list of Tor exit nodes. That can be correlated with the headers to
attack the headers just as we would any other range of IPs, since it
is their property as a set of IPs that allows that set intersection.

There is another angle. Even though the information in a full-take
system can track the requests, at least theoretically, and probably
even in practice, that information is often related to the data in
commercial systems. If the data is messed up in commercial systems,
then the commercial data cannot be integrated through that connection,
and the legal loophole of "business correspondence" no longer applies.

One thing that I am noticing is that surveillance systems are
productized. The issue you mentioned above is probably still very
common. Their methods for browser profiling may still be naive. I
imagine that there is an internal research group that has very
advanced methods already, developed by cryptographers, but it wouldn't
surprise me if it weren't productized that well, even now, especially
since they need a way to link the profiles against commercial systems.
Commercial systems integrate by using common hashes, so I imagine
their first system would probably use the common commercial hashes.

Granted, I am pretty much "talking out of my ass", since there is
little theory, and little actual knowledge about the specific
techniques that are used.

>> Your suggestion is very good, and I am trying to build something like
>> that,
>> but with little affect on compatibility. Maybe collect the set of valid
>> headers with large anonymity sets, and select a subset of headers that
>> match
>> the real configuration in only the most important features. That way,
>> only
>> obscure compatibility tests will fail. And have an option to provide the
>> real
>> user-agent string, when an issue happens. Afterall, if you only use the
>> real
>> one rarely, how can it be profiled? I suppose you could trick people to
>> turn
>> that on, but that is a fairly targeted action, not a full-take action,
>> which
>> is the primary issue.
>>
>> I can collect a set of common user-agent strings, and can find a subset
>> that
>> are webkit, and use those. Since compatibility tests are usually about
>> rendering engine, that would avoid most compatibility issues with a
>> random
>> user-agent. ‎Maybe provide the set of common user agent strings by
>> rendering engine as a separate open source project. I deal with enough
>> traffic
>> to collect this myself.
>
> Cryptographically I don't think choosing one out of - let's say 10 - common
> UA
> strings for every request randomly would be a problem. However it would also
> not
> bring any observable benefit and adds code complexity, so I advise to leave
> it
> out and just pick one common, sane UA string, which can be changed by users
> in
> config.h as it is the case currently.

I concede to that recommendation 100%. Even if I tracked the common
user-agents, and provided a patch, it would be easier to distribute it
separately, anyway.

A nice side-effect of the use of random common headers is what it does
to the identifier graph. Whenever there is a match, a link is made. If
you use only one header, then you have less of a chance of obscuring
with other identities than you would have with many headers. When the
node splits off to multiple identities, it lowers the value of the
link through those nodes of the graph.

>
>> Websites rarely need to know if you are running Linux, but if you are
>> going to
>> download software, you can enable the correct OS to be sent. Besides, the
>> correct API for websites should be that they request the browser to
>> identify
>> the OS, like they request device location, and the user accepts the
>> request
>> explicitly. It is such a rare need that not every website needs to know
>> the
>> operating system. ‎Browsers do this for credit card data (not that I
>> would use it), they can have a form fill for operating system, too. When
>> input
>> name="operating_system", prompt to fill it.
>
> I cannot imagine a case where a surf user would need OS detection to
> download
> software. Package managers already know, where they operate on. ;)
>

Very true. I suggest defaulting to reporting like Safari on a Mac.
Even if I just had a URL that sent back the most common Safari on Mac
User-Agent I have seen recently, I could add a target to the Makefile
that patches config.h with the update.

>> The "noise" I add to the accept-language header is easily identified as a
>> new
>> signal, so I am leaning toward abandoning it, but there are some
>> interesting
>> opportunities there. For example, it can be used when active surveillance
>> is
>> not an issue, but passive surveillance is an issue, to add friction to
>> the
>> passive surveillance machine.
>
> As explained above, this does not work. Just use en-US as default and let
> users
> change it to their preferred language in config.h if they want to.
>

I can submit a separate patch that adds an Accept-Language with a
default en-US in the config file. If set to NULL, it can use a locale
setting. Currently, no Accept-Language header is sent. I can also
provide a web service that reports the locale that maxmind's country
database reports for that IP, so that you would blend in with your
neighboring IPs. However, websites that can understand an
Accept-Language header often fall back to MaxMind's database anyway,
when no Accept-Language header is sent, so it may not be so useful.
But if we don't send an Accept-Language header, and send a User-Agent
that reports as a browser that normally sends an Accept-Language
header, we have now isolated ourselves. If we mimic a webkit browser,
the hash may line up, since the header will likely be sent in the same
order. Not sure, though. I will need to check if the order of setting
the header matters, and see if I can get a bit-for-bit mimic.

> --Markus
>
>

Ben
Received on Sun Mar 29 2015 - 00:25:59 CET

This archive was generated by hypermail 2.3.0 : Sun Mar 29 2015 - 00:36:07 CET