Step into the cookie jar
I’ve written my fair number of Raku distributions, but I think the one that seems to be the most successful, in that it appears to have the most use, is HTTP::Tiny, a port of the Perl library of the same name. This was released a couple of years ago already, so you might be familiar with it. You can read the post where it was first announced, but as a summary, here are some of the things that it can do:
-
it’s a fully HTTP/1.1 compliant library
-
it has no dependencies, although some of its functionality (eg. HTTPS support) depends on optional libraries
-
it supports both HTTP and HTTPS proxies
-
it supports streaming requests and responses, form uploads, ranged requests
I had a lot of fun working on it, and I learned loads while doing so as well. But perhaps more importantly, it filled a niche that I think was missing, and judging on its usage, I’m not the only one who thinks this.
Overall, I’m pretty happy with how that one turned out
However, for all the things it does support, there are some things that it can’t do which I really want. And some of those, specifically timeouts and cookie support, have been on the to-do list from the very beginning…
Timeouts are missing because of some limitations with the handles that it uses under the hood, so this will be a hard problem to crack. But from HTTP::Tiny v0.2.0, cookie support is no longer a limitation, and I wanted to talk a little bit about how that came to be.
A cookie primer
I’ve been thinking about HTTP cookies quite a bit while working on this, but I know that not everybody has had the pleasure of reading the RFC. So just in case, let’s do a very quick cookie primer.
The core idea of HTTP cookies is to give the server the ability to store state in the client. It’s a simple idea, but as you can imagine, it has all sorts of ramifications and security implications, some of which we’ll discuss below.
They’ve been used for a long time. They started being used experimentally in 1994, and since then there’s been several RFCs that have specified what you may call different “versions” of cookies: RFC 2109 published in 1997, RFC 2965 published in 2000, and RFC 6265 published in 2011, which is the most current one (at the time of writing). The next version is currently being drafted and has seen several revisions, so it’s very likely that it will end up being published Real Soon Now.
But since RFC 6265 is the one that is currently active, that’s the one that I’m targeting and the one we’ll be focusing on.
Their shape is very simple: it’s just a key/value pair with a set of
additional attributes. They’re sent by the server in a special header called
Set-Cookie
with the key and the value, followed by the cookie’s attributes.
Every RFC so far has specified a list of recognised attributes with specific
meanings, but this list is constantly changing, with new attributes being
added and old ones being removed. But users have always used attributes to
extend the specification and to experiment with additional features. Sometimes
these become well-received and spread, becoming de-facto standards (like the
SameSite
attribute, which has yet to be specified in any RFC).
All of the attributes that are currently recognised by RFC 6265 are there to govern what the scope of the cookie should be. Some examples of existing attributes are:
-
Secure
andHttpOnly
, which restrict the protocols of the cookie -
Domain
andPath
, which restrict the URLs the cookie is applicable to -
Expires
andMax-Age
, which restrict the lifespan of the cookie
Using these attributes, the client determines what cookie is applicable
for a request, and then sends it back to the server in the Cookie
header
(although the cookie that the client sends back to the server does not contain
any attributes, only the value).
Domains and paths
Back to the cookie attributes, the Domain
and Path
attributes are
interesting because there’s some risks involved.
Let’s take the following four cookies as an example:
A=1; Domain= domain.com; Path=/
B=1; Domain= domain.com; Path=/path
C=1; Domain=sub.domain.com; Path=/
D=1; Domain=sub.domain.com; Path=/path
Each of these cookies has a separate combination of Domain
and Path
, and
they will serve to illustrate how these attributes interact. The table below
illustrates which cookies the client can set (in the Cookie
header) for any
given request, and which can be set by the server (in the Set-Cookie
header)
for the same requests.
Domain | Path | Cookie | Set-Cookie | ||||||
---|---|---|---|---|---|---|---|---|---|
domain.com |
/ |
A | A | B | |||||
domain.com |
/path |
A | B | A | B | ||||
sub.domain.com |
/ |
A | C | A | B | C | D | ||
sub.domain.com |
/path |
A | B | C | D | A | B | C | D |
So, for example, with the cookies above, a client making a request to
domain.com/
can only send cookie A along with the request. All other cookies
either apply to a subdomain of the domain the request is for, or to a path that
is under the one that is being requested.
On the other hand, a request to sub.domain.com/path
can be sent by a client
with cookies A, B, C, and D, because they all apply to a domain that is either
the one being requested or to a domain that is above it, and to either the path
that is being requested or one above it.
Note, however, that the rules that specify which cookies the client can send with a given request are not the same as those that specify which cookies the server can set for the same request.
Taking the same requests as in the previous examples, in response to a
request to domain.com/
the server can set cookies A and B, even though B
applies to a path that sits below the one that was requested. And the same goes
for a response to a request from sub.domain.com/path
: in that case, the server
can set all four of the example cookies, including cookies that apply to
domain.com
, which sits above the one that was requested.
Supercookies
And here’s where we bump into the first real security consideration around cookies.
If we go back to the table above, we can see that a cookie that applies to
domain.com
applies also to any domain under domain.com
. But this means
that there’s technically nothing that stops you from setting a cookie with a
Domain
set to plain com
. And if you did, that cookie would technically
apply to any domain that ended with .com
.
Because of their ability to apply so widely, any cookie with a Domain
set to
a top-level domain is called a “supercookie”, and they present a risk that
needs to be accounted for.
Before we get too far down the rabbit-hole, let’s put a pin on this and get back to our main topic. We’ll come back to supercookies later.
Finding a cookie jar
Throughout the discussion of cookie applicability above I was careful to use the word “can”: “the client can send such-and-such cookie”. Why is that? This is because cookies are entirely optional (indeed, this is why we can have HTTP user agents that are compliant with HTTP/1.1 but do not support cookies, like HTTP::Tiny).
And even if the client supports cookies, there are good reasons why the client can choose not to send them, or the server not to set them: for the client, this requires cookies to be stored somewhere, which uses memory, and this may be limited. And for the server, setting cookies requires sending data on the request’s headers, which themselves have limits, and will have some overhead, etc.
The goal of this project was to create a library that would take on that responsibility on the client side. It would retrieve cookies from a response, determine whether they were being legitimately set by the server, find storage space for them, and determine also which of the cookies that have been stored are applicable to any new requests. This is traditionally called a “cookie jar”.
Furthermore, since the idea was for this was for it to be usable with HTTP::Tiny, I wanted this cookie jar to follow the same principles as that library: with minimal dependencies, and compliant with the specification, in this case RFC 6265.
Prior art
When I looked in the Raku ecosystem I could find two cookie jars: HTTP::Cookies, which seems to be a port of a Perl library, and Cro::HTTP::Client::CookieJar. But both of these had issues that made them unattractive from my perspective.
Poor interoperability
The first issue was that both of these are actually distributed with an HTTP client. This might not sound like a big problem, but remember the goal here is to have a cookie jar that can be used with HTTP::Tiny, and that is a distribution which is supposed to have no dependencies. If in order to use cookies with it you need to install another user agent, then the exercise becomes a little pointless.
A related issue is that, because they are distributed with these user agents,
they rely on those frameworks to work. For example, the HTTP::Cookies module
has a add-cookie-header
method that will generate a Cookie
header to send
with a request. But in order to use it, you need to give it an HTTP::Request
object, and the method will inject the header into the request directly,
there is no way to access that header without using that object.
The Cro::HTTP::Client::CookieJar add-to-request
method has a similar
requirement, but this time using a different class to represent a request:
this one needs a Cro::HTTP::Request object, as well as a Cro::Uri object.
What this means is that in order to use these libraries, you have to buy in to the rest of their frameworks, which makes interoperability more difficult (and makes them unsuitable for this particular project).
Non-compliant behaviours
But even if I had decided that this was not a problem, or I had worked around it with some wrapper code or something of that sort, none of these were as compliant with RFC 6265 as I would have liked.
Some of these were relatively minor. For example, the spec specifies an
order in which the cookies should be serialised in the Cookie
header, but Cro::HTTP::Client::CookieJar does not follow this order. This is “relatively minor” because even though this goes
against the spec, this point is only a “should”, and the spec also states that
servers “should not” rely on the order of the cookies.
But some of these can have more serious consequences. For example, HTTP::Cookies does not do path restriction for cookies, and in this case this is a “must”.
And although not in the spec, perhaps their most serious limitation in my eyes was that neither of them had a way to protect the user against the supercookies mentioned above (we’ll get back to this point later).
Introducing Cookie::Jar
Since none of the existing tools were suitable for my goals, this project resulted in the release of a distribution called Cookie::Jar. In the spirit of HTTP::Tiny, Cookie::Jar is a minimalist HTTP cookie jar which takes inspiration from existing Perl libraries1, and is compliant with RFC 6265. It does have one dependency on IDNA::Punycode to support international domains, but it has no other functional dependencies, by which I mean that that is the only distribution that it needs in order to function.
Other than that, the main difference between this cookie jar and the ones discussed above is that its interface relies only on Raku built-in types, to make it usable in the largest possible set of contexts.
But in order to protect against supercookies there is still one more problem we need to solve.
The Public Suffix List
We said supercookies are cookies that have their Domain
attribute set to a
top-level domain. But this begs the questions: what is a top-level domain?
Some of them are pretty easy to spot: .com
, .org
, .net
. Those we all
know.
In recent years, however, the list has grown significantly with the release
of newer top-level domains: we now have .pizza
, .香港
. And new ones get
added all the time.
More seriously, you cannot simply assume that the last component of the domain
will be top-level domain, since there are combined top-level domains, like
.co.uk
, .ne.jp
or even cases like .pvt.k12.ma.us
.
I do not want to get into the discussion about how close to the top a domain needs to be to be a “top-level domain”. Not only because it’s difficult, but mainly because, luckily for us, that question has already been answered by a thing called the Public Suffix List.
The Public Suffix List is a project initially started by Mozilla, and currently maintained by a community of volunteers who is constantly keeping an eye on what new top-level domains appear. It’s available online, and it’s used very widely by a large number of different projects.
The list includes both domains that have been registered by ICANN, as well
as “private” domains that have been submitted by certain domain owners who
want to treat their domains as a top-level domain. This includes cases like
.blogspot.com
, .github.io
, or .googleapis.com
(at time of writing).
This list has become the way to protect against supercookies, so if I wanted to be able to identify supercookies, I needed a way to get access to the list.
Like with the cookie jar, I tried looking for prior art in the Raku ecosystem, but surprisingly there was nothing. This explains why none of the other cookie jars offer this feature.
Using the list
The list has three types of entries. There’s literal top-level domains, like
kumamoto.jp
; patterns that match all domains under a given literal, like
*.yokohama.jp
; and exceptions that restrict those patterns, like
!city.yokohama.jp
. Since the list has been designed to be machine-readable,
parsing it and using it is not actually the hardest part of this problem.
The real challenge comes from the fact that this list is constantly being updated, sometimes several times per week. Which raised the question about how a library that uses this list can be kept up to date.
One possible solution would be to curate the changes, so that every month or so I could look at the list and decide whether the most recent changes were worthy of a release. However, I did not want to commit to that, and I also did not want to take on the responsibility of making that decision.
Another possibility would be to shift that responsibility to the user, and either require or allow them to provide a list that the library can then use. This is indeed the approach taken by some of the existing Perl libraries I looked at as a reference. However, this meant that the list could not be pre-compiled into the module, and had to be parsed every time the module was loaded. This was something I was trying to avoid.
The third possibility was an idea I got from the psl
Rust library,
which every day checks for changes in the upstream list, and
mints a new release if anything has changed. The first time I saw this I
thought it was madness, but this is actually the method I ended up going with.
Introducing PublicSuffix
This resulted in a new distribution called PublicSuffix which was released very soon after Cookie::Jar. This is a very small distribution that parses the list, stores it on compilation, and gets updated automatically whenever a new version of the list is made available.
It is not clear to me whether this is the best solution to this problem, but it did mean that I could offer a good balance between allowing users to control what version of the list they are on, while still supporting fast execution times by compiling the list into the module2. That said, I am open to the idea of adding support for users providing their own lists if this proves to be a wanted feature.
Putting it all together
We’ve now delved several levels deep, so it’s a good idea to try to go back to the surface and see how all these pieces fit together.
The PublicSuffix module is now available on the Raku ecosystem, and can be used on its own for whatever you think might benefit from it.
use PublicSuffix;
say public-suffix '福.個人.香港';
# OUTPUT: 個人.香港
say public-suffix 'test.pvt.k12.ma.us';
# OUTPUT: pvt.k12.ma.us
say registrable-domain 'raku.land';
# OUTPUT: raku.land
By importing it, it will export two functions. The main function is
public-suffix
, which takes a string representing a valid host and returns a
string with the top-level domain of that host, or the type object if the host
is an IP address. It will also export the registrable-domain
function, which
also takes a valid host as a string and returns the “registrable domain” of
the host, or “the host’s public suffix and the domain label preceding it, if
any”.
Both of them support hosts in either ASCII “punycode” or in unicode, and they will attempt to return a value in the same format that it was provided in.
In order to use it with Cookie::Jar, all you have to do is have the module installed. As long as is it is usable, Cookie::Jar will load it and use it to reject any supercookies that may have been received from a response.
use Cookie::Jar;
my $c = Cookie::Jar.new;
$c.add: 'https://foo.com', 'foo=123; Domain=foo.com';
$c.add: 'https://foo.com', 'bar=234';
# Rejected if PublicSuffix is available
$c.add: 'https://foo.com', 'super=1; Domain=com';
say "{ .name } -> { .value }"
for $c.get: 'GET', 'https://foo.com';
# OUTPUT:
# foo -> 123
# bar -> 234
# The 'bar' cookie does not apply to this subdomain
say $c.header: 'GET', 'https://www.foo.com';
# OUTPUT:
# foo=123
As for the cookie jar itself, as you can see in the code snippet above, you
can add cookies with the URL that issues the response, and the value from the
Set-Cookie
header. The jar will know how to parse that value, it will
determine which cookies are valid and can be set by this response, and will
know where to store them for future use.
To allow for inspection or for any extra processing you may want to do on the cookies, you can retrieve them from the jar. In this case, however, you have access to read-only versions of the cookies, to ensure that the cookies in the jar remain the ones that were received from the server.
Alternatively, you can give it the request method, and the URL the request is
for, and the jar will return a the value for the Cookie
header that can be
sent along with that request.
And coming all the way to the top, starting from HTTP::Tiny version 0.2.0 the
constructor accepts a new cookie-jar
parameter which can be set to an object
to use to process cookies. Importantly, the client does not internally check
that the cookie jar object is an instance of Cookie::Jar: as long as the class
of the cookie jar provides an interface that is compatible, HTTP::Tiny is
happy to use it
use HTTP::Tiny;
use Cookie::Jar;
my $cookie-jar = Cookie::Jar.new;
$cookie-jar.load: 'cookie.jar' if 'cookie.jar'.IO.e;
my $client = HTTP::Tiny.new: :$cookie-jar;
... # Client will use the cookie jar
$cookie-jar.save: 'cookie.jar';
Parting thoughts
Before wrapping up, I thought it would be good to go over a couple of things that this project brought to mind.
The first is the now almost commonplace idea that Raku is “a young language with a long youth”. A couple of times now, when starting to use the language in earnest for a serious task, I bump into holes in surprising places. For a specific example, when setting out to implement cookie support, I did not expect to find no tool to check the public suffix list and to have to implement my own. It caught me by surprise to have to go that low-level.
The good thing, as I think has also been said by others before me, is that Raku is very versatile, so the tools you need to fill in those holes are often available. I find that experiences like this one have a value not only in their specific outcomes, but also in the holes that get covered along the way.
A related but different aspect is that even when the tools are available, it is not always easy to find them, or to evaluate which of them are fit for purpose. The process of finding existing cookie jars, and then realising why they were not useful for my task, took a considerable amount of effort. And there may be other cookie jars that I couldn’t find, which only makes this more serious.
I think the Raku ecosystem is good. I think there’s plenty of good in there, but to a large extent because of that “long youth”, there’s a lot of stuff that we need to curate, and building the tools that allow us to find the real gems in that ecosystem is going to be hard work, but very valuable.
Thank you.
-
Specially HTTP::CookieJar, HTTP::Cookies, and Mojo::UserAgent::CookieJar. My gratitude to the authors and maintainers of those distributions. 🙇 ↩
-
Note that this is just my expectation, and has not been verified by a benchmark. ↩