In Postgres you have the ability to write recursive queries that refer to
themselves. This is a very powerful construct, and among other use cases can
greatly simplify fetching any type of hierarchical data, such as any data where
an attribute refers to another row on the same table.
For example, a graph structure where a parent identifier refers to the an ID on
the same table. Another example can be seen on audit trail type tables,
containing prior and new versions of some value.
Let’s prepare an example to show showcase recursive querying in postgres.
First, a table holding old and new email address values:
A few potentially new concepts in here. First of all we have WITH queries,
formally called Common Table Expressions. CTEs can be used as a a view, but
they have the important property that they are performance gates: postgres is
guaranteed to execute each CTE in a query indepently. CTEs are much more than
a view however: they can return data from a DELETE operation that can be
consequently operated on, they can define updatable datasets, and more to the
point, they can be self-referential or recursive.
Recursive CTEs are those that refer to themselves in the body, such as the
above query. They’re not really recursive, but rather iterative, but the SQL
standard is the law around here, so we live with it.
Breaking apart the above query, here’s what happens:
SELECT * FROM all_emails; invokes the all_emails CTE.
all_emails has two parts, the non-recursive and the recursive terms,
which are separated by the UNION keyword. It could also be UNION ALL;
the former removes dupes and the latter doesn’t. At this stage, the
non-recursive term is executed and placed on a data structure called the working table.
The working table now contains one row:
The contents of the working table are appended to the result of the recursive
query at this point.
Then the recursive term is executed, where the self-reference is
substituted with the contents of the working table, placing the result
in another data structure called an intermediate table. Thus, the
intermediate table looks like so:
The contents of the working table are now replaced by those of the
intermediate table, and the intermediate table is emptied.
Because the working table is not empty, steps 2 through 4 are repeated.
When the working table is empty, return the query results
Whether it’s for OLTP/application usage, or for ad-hoc reporting, Recursive
CTEs can help optimize and simplify code required to get at the data you need.
To learn more about other powerful operations that can be done with CTEs, see
the official postgres
Postgres has a few handy primitives for dealing with distributed process level
locking. Because these locks are atomic and handled by the database, they are
well suited for coordinating concurrent processes’ access to shared resources.
For example, a checkout endpoint on an ecommerce site should be placed behind
such an advisory lock, so that if the user initiates two check out processes
consecutively, only one occurs. Even if the two requests make it to different
web processes, the second attempt should gracefully be rolled back.
Another example can be seen in worker processes that perform some sort of
action in a loop or on a schedule, but for which only one process should be
executing at a time. You may spin up two such processes ensuring high
availability, but they only operate if a lock is acquired. In this way, the
first process acquires the lock, and the second blocks until the first one
releases it, either because it crashed or it gracefully exited.
Postgres has various functions that can be used for creating locks. The topic
of this article is advisory locks. Explicit locks are those taken by database
operations such as concurrent writes, and they are guaranteed to be obeyed no
matter what. For example, one process updating a column and another trying to
drop the very same column, would be a situation that cannot be handled
concurrently, and one will inevitably have to wait for the other to complete
in order to do it’s job.
In the advisory lock case, the application requests a lock in order to perform
some operation, and releases it when it’s done. It’s up to the application
itself to obey this lock, hence the name. It’s entirely possible for a missbehaved
application to simply ignore the lock.
In the advisory lock family of functions, we can categorize them as:
Session level: those that affect a session or connection. When taking a
session level lock, if a transaction fails and rolls back, the lock will
continue to be held until it is explicitely released.
Transaction level: locks that automatically get released at the end of a
transaction. In many cases the semantics offered by transaction level locking
are a better fit for the underlying problem. In general, if the action being
performed is wrapped in a transaction, this is likely the best suited locking
Beyond the above, locks can be blocking or non-blocking. In the blocking case,
the request for a lock will wait until it can be obtained, whereas in the non-blocking
case the request returns immediately but returns false.
Let’s look at session level locks first. Opening up two psql sessions we can
experiment with the locking semantics:
psql session 1
psql session 2
Here, session 2 just blocks until the lock is acquired. On a third session we
can check in pg_locks and pg_stat_activity to verify this:
Because this is a session level lock, failing or rolling back any transaction
while holding a lock does not release the lock at all. Only explicitely unlocking
or disconnecting the client would release the lock.
Now let’s take a look at how transaction level locks work:
psql session 1
Note also that we’re using the try variant of locking functions. Instead of
blocking on a lock to be obtained, the try variants return true or false
depending on whether a lock could be obtained.
On session 2, we can try grabbing a lock as well:
psql session 2
It returned false, because the first session already has said lock. If we were
to complete this transaction either by failing, a COMMIT or a ROLLBACK, then
this lock can be acquired. Let’s try that:
psql session 1
After exiting the transaction on one session, the other ought to be able to
psql session 2
All postgres locking functions take an application defined key in the form of
one 64-bit integers or two 32-bit integers. An application may create CRC
values from higher level abstractions to be supplied as locking keys. For
example, a suitable integer for a lock on user ID 4232 may be the CRC of the
string “user4232”. This pattern is implemented in the
ruby library as well as python’s
library, and works well as an easy to understand application level abstraction.
Applications in modern software development and delivery are distributed in
nature. Distributed mutexes and locks are an important primitive to help
synchronize and ensure correctness of behavior under concurrent environments.
Many projects and products with distributed lock managers exist (Zookeeper,
Google Chubby, Doozer, etc), but Postgres provides a lightweight mechanism that
is suitable for many locking needs, without incurring the cost of additional
infrastructure dependencies to an environment that already makes use of
Postgres for data storage.
More information about advisory locks in postgres can be found here and here.
At Heroku we have APIs for pretty much everything. Need logs for an app? Is
that database available? You just beat someone at ping pong? There’s an API for
that. Having such rich datasets available is great. It allows us to build
dashboards with mashups of different datasets and serve them from a web
Here are some of the techniques implemented in order to wire up a Backbone.js
application speaking to remote hosts in a secure manner. We will explore
Cross-Origin Resource Sharing (CORS) as well as HMAC based auth tokens with
cryptographically timestamped data that an attacker wouldn’t be able to
auto-replay. The end goal is to have an application running on a browser, and
securely request data from an API running on a remote host.
The first problem when issuing XHR requests across hosts will be the
same-origin policy violation. Go ahead, issue an AJAX request against a remote
host. The browser should fail with an error similar to the following:
XMLHttpRequest cannot load https://some.remote.com. Origin https://your.site.com is not allowed by Access-Control-Allow-Origin
This is where Cross Origin Resource Sharing (CORS)
comes in. The way it works is that the Origin (the client) will issue what’s
called a pre-flight request, asking the server “hey, can I make a request with
HTTP verb foo to path /bar with headers x-baz?”, to which the server responds,
“Sure, bring it!”, or “No, you may not”. This pre-flight request is made to the
same path as the actual request, but the HTTP OPTIONS verb is used instead.
The server responds with the following headers, should the request be allowed:
Access-Control-Allow-Origin: Specifies what Origins are allowed remote XHR
requests to be made against this server. Allowed values include a URL (eg:
https://your.site.com), a comma separated list of URLs, or an asterisk
indicating all origins are allowed.
Access-Control-Allow-Headers: Specifies a comma separated list of headers
that the Origin is allowed to include in requests to this server. There are
many reasons to include custom headers - we’ll see an example of this further
Access-Control-Max-Age: This is optional, but it allows the browser to
cache this response for the given number of seconds, so browsers will save
themselves the pre-flight request any subsequent times. Freely set it to a
large number, like 30 days (2592000)
There are more headers that allow you to whitelist and otherwise control access to the resource. Be sure to read up on CORS.
Thus, a Sinatra app acting as the remote end of the system can respond to pre-flight OPTIONS requests like so:
Inclusion of the Allow Origin and Allow Headers headers is also necessary on responses to any other remote XHR requests. We can extract the headers directive to a helper and use it on both pre-flight and other requests:
And just like that, browsers can now issue XHR requests against remote APIs. Of
course, there is no authentication in place yet.
We will implement an HMAC based auth token mechanism. Both the remote server
and your app share a secret. This secret is used to generate a token containing
a timestamp that is used for validating token recency. The HMAC digest is a
signature that is generated with the shared secret, and it can be used to
verify the authenticity of the entity that generated the token. It answers
the question of whether the the client of the API request is authentic.
To generate the token, we create a JSON document containing an issued_at
timestamp, and we calculate its sha256 HMAC token using the secret known to
both parties. Finally, we append this signature to the JSON document and we
base64 encode it to make it safe to send over the wire. Here’s an example
This token is used on the API server to authenticate requests. The client can
be made to send a custom header, let’s call it X_APP_AUTH_TOKEN, which it must
be able to reconstruct the token from the JSON data, and then validate that the
request is recent enough. For example in a Sinatra application:
In the above code, we consider a token invalid if it was issued more than 2
minutes ago. Real applications will probably include more data in the auth
token, such as the email address or some user identifier that can be used for
auditing and whitelisting. All of the above data token generation and
verification has been extracted to a handy little gem called
Do not reimplement this, just use fernet. In addition to HMAC signature,
fernet also makes it easy to encrypt the message’s payloads, which opens it up
for other interesting use cases.
The authenticate! method must be invoked before serving any request. This
means that the auth token must be included on every request the client makes.
There are many ways of doing this. One approach, if you’re using JQuery to back
Backbone.sync(), is to use its $.ajax beforeSend hook to include the header, as
can be seen in the following coffeescript two-liner:
The /auth_token server side endpoint simply responds with a new valid
The fernet token expires every minute by default. I decided to update it
every 29 seconds instead so that it can be updated at least twice
before it has a chance to hold and use an expired token against a remote API.
In this app, the server side is used for one thing only: user authentication.
The way it works is that when a request is made, the sinatra app performs oauth
authentication against our google apps domain. Once the oauth dance has
suceeded, the app generates a token that is handed on to the client for
authenticating against backend, remote APIs.
This whole setup has worked great for some months now.
Some developers like writing tests first, letting your tests drive the implementation, essentially becoming the first clients of your software. Others like writing production code first and kind of figure out how it works, and then they apply some tests after the fact. I am convinced that writing tests in this way is far less effective, but this is not an article on the merits of TDD. Others don’t like writing tests at all, so it’s a varying scale.
You can write software from the bottom up, where you are forced to figure out what the data model and base objects should be and how it is to support all known use cases. This is hard, particularly in this day and age where software requirements are unpredictable and are meant to change very rapidly - at least in the fun industries.
You can also write software from the top down. In this case you let the outer shell of your software dictate what the inner layers need. For example, start all the way on the user interactions via some sketches and HTML/CSS in a web app. Or think through the CLI that you want. Readme driven development can help with this, too.
Outside-in development puts you in a great position to practice top-down design.
The advantage of Outside-in development is twofold. Not only are you left with acceptance tests that will catch regressions and help refactor later. But also, the top layers of your software becomes a client of the code you are about to write, helping define its interface in a similar way that practicing TDD for your unit tests help guide the design of software component at a very granular level.
These practices will help you define internal APIs that feel good faster, because you will notice cumbersome interfaces sooner, and are therefore given the opportunity to fix them when they have the least possible number of dependencies.
I know of many developers who prefer writing no tests, or prefer a bottom-up strategy. This does not mean that their software is of poor quality, by no means. But I will observe that I can’t ship software of the quality standard that I put myself to unless I write tests first and follow outside-in methodologies. Indeed, this seems to indicate they are smarter than me.
Boston.io took place yesterday at the Microsoft NERD Center. The event is aimed at students in the Greater Boston area who are interested in entrepreneurship and coding, whether that’s design, development, or ambidextrous.
There were technical and talks on every aspect of the stack, from the metal all the way up to serving and consuming APIs and user experience design. There were also some not-so-technical talks about topics like the Boston startup scene, open source and hacker culture.
My talk was about PostgreSQL and it focused on how to use SQL to mine data and get information out of it. I stored the twitter stream for about 24 hours prior to the conference and showed how to look at that data showcasing CTEs, Window Functions and other Postgres features. Among the insights we saw that the tweeps who post hashtags and urls the most are spam accounts. Also, the most posted URLs come from shortener services, but also surprised to see livingsocial.com among those. We found other fun facts during the talk too. It went great.
Hopefully the talk gave a good taste of what you can do with Postgres and SQL. It was SQL heavy, but that did not come without warning
boston.io talk ready to go. SQL, do you speak it? After this you will!
Other talks worth mentioning included Mike Burns’ classic UNIX talk. This time around he used curl, sed and grep to automate a SQL injection attack on a hilarious page he staged for this purpose. Erik Michaels-Ober’s talk was also great and surely inspired a few students to put code out there for the world to see. So many great talks altogether though, by people like Nick Quaranto on TDD, Ben Orenstein on vim, Bryan Lyles on OOP and more.
Oh, and there was a dude at the afterparty with a heroku shirt. Here’s a photo.
Yesterday was a good day. @pvh and I paired for a few
hours, even though we’re at opposite coasts.
Here’s what you need:
A server somewhere that both pairs have access to. We used an EC2 instance.
We’ll use it to meet up and create the tunnel.
SSH client libs - your should already have this unless you’re on windows in
which case you probably want PuTTY.
Skype for audio.
As you know, The Internet is made of nodes connected by Tubes. Unfortunately,
there is no tube from your machine to your pair’s machine. What we’ll do here is
use a third node that has tubes to both of your machines to relay traffic
through, in essence creating an Internet Tube from your machine to your pair’s.
This kind of Internet Tube is called a Tunnel. Since we’re using SSH to do the
traffic encryption and forwarding, it’s an SSH Tunnel. Yes, that’s somewhat made
up, but sounds legit!
As it turns out, setting up a tunnel is fairly simple. For example, let’s set
up a tunnel between you and Jane using a remote server saturn.
You: Open up a shell and forward traffic to your local port 9999 over to
Saturn’s port 1235:
ssh -L 9999:locahost:1235 saturn_user@saturn
Jane: Open up a shell and forward traffic from saturn’s port 1235 to her port
ssh -R 1235:localhost:22 saturn_user@saturn
You: Open up another shell, and ssh into your local port 9999 specifying a
username on Jane’s machine.
ssh jane_user@jane -p 9999
And you’re good to go. Create a shared screen session, open up $EDITOR, use
skype, google hangouts, face time or whatever for audio and start ping ponging.
The latency was surprisingly minimal. We left this tunnel open most of the day.
It sat idle for periods of time and the connection was left active. All in all,
a great setup.
If this is something you’ll do often, you might as well add a more permanent
configuration to ~/.ssh/config. For example, you might add:
~/.ssh/config on your machine
Then you’d do, on one terminal, ssh tunnel_to_jane, and on the
other ssh jane.
And Jane might add:
~/.ssh/config on Jane’s machine
And she’d just do, ssh tunnel_from_you
This can be used not ony for remote pairing, but rather to forward any
traffic on a port, over an SSH encrypted channel, to a remote host. For more
see ssh_config(5) and ssh(1), and happy pairing!
In case you’ve missed it so far, PostgreSQL West will take place next week in San Jose, California. You should go.
This weekend I worked on the slides and content for my Redis talk. Why would I decide to speak about Redis in a PostgreSQL conference? As it turns out, we’ve had great success in using Redis to compliment our architecture, not to replace a main data store. I will speak about our experience in scaling out write heavy apps with Redis.
I will first introduce Redis from a general perspective, and then dig into the data types it offers and the operations it can do on each data type. During the course of the talk, I will demonstrate how we’ve used certain data types to solve some rather tricky scaling problems: real use cases solving real problems.
I hope this talk will be entertaining and informative to both DBAs and application developers. I will be sharing the slides here and on twitter afterwards, so stay tuned.
Additionally, on Tuesday I will be teaching a one day workshop on Ruby and Rails with a PostgreSQL focus. This is the second time I will do this at PostgreSQL conf, the last time being at PostgreSQL conf east in New York City. The class size was small, and the feedback I received was very positive in that attendees got a good grasp of the Ruby programming language and how Rails and Postgres fit in the ecosystem. I hope this time around is even better.
pg_basebackup - this can be used alongside Streaming Replication to perform a backup or clone of your database. I can imagine heroku adopting this as an even faster and reliable way to sync your production and staging databases, for example (when and if they upgrade to 9.1). However it can also be used to create plain old tarballs and create standalone backups.
Another replication goodie: Synchronous replication. On Postgres 9.0, replication was asynchronous. By enabling synchronous replication, you are basically telling the master database to only report the transaction as committed when the slave(s) have successfully written it to its own journal. You can also control this on a specific session by doing SET synchronous_commit TO off.
The new pg_stat_replication view displays the replication status of all slaves from a master node.
Unlogged tables. What? Postgres supports unlogged tables now? Yes, it does. They can be used for data you don’t necessarily care about, and a crash will truncate all data. They are much faster to write to and could be useful for caching data or keeping stats that can be rebuilt. You can create them like so:
SQL/MED gets the ability to define foreign tables. This is rich. It means that you can define a foreign data wrapper for a different database and access it from Postgres seamlessly. Some hackers have already built some nifty foreign data wrappers for mysql, oracle, and even redis and couchdb. Although I’m of the mind that if you’re actually using any of these databases to supplement your app’s data needs, just talk to them directly from your app. However, it may be possible to write some higher performance reports that use different data sources, and you let Postgres do the heavy lifting of munging all the data together.
You no longer need to specify all columns on a select list from your GROUP BY clause: functionally dependent columns are inferred by the planner, so specifying a primary key is sufficient. I talked about this before, and it’s a cause of great frustration to users coming from MySQL.
Today I made a few design tweaks to this blog. The goal is to move a bit away from the stock default octopress theme - but my design chops aren’t all that great and I can’t really budget the time to do a design from scratch.
A few simple changes and I’m OK with how it looks for now:
I created the basic font logo that you now see on the header. This was the idea from the beginning, but I never had a chance to include it here. Commit: 27d0107f.
Change the typography in the site, as the original seems a bit heavy for my taste. I started out by using a Helvetica Neue on headers, but decided to go with Antic from the Google web font service. Commits: 2dd8f46b4 and 84bad437.
I wanted something different for that dark background. Went hunting for tiles and patterns. There’s good stuff out there, but I settled for the dark wood background found here. Commit: 9593a00431. At this point, it also made sense to make the header and footer have a light background and dark font.
Finally, added a gradient to the header. Commit fe918b0e
I’m fine with the result for now, but will probably revisit soon. Here’s a before and after.