Redis Talk at PostgreSQL Conf West

In case you’ve missed it so far, PostgreSQL West will take place next week in San Jose, California. You should go.

This weekend I worked on the slides and content for my Redis talk. Why would I decide to speak about Redis in a PostgreSQL conference? As it turns out, we’ve had great success in using Redis to compliment our architecture, not to replace a main data store. I will speak about our experience in scaling out write heavy apps with Redis.

I will first introduce Redis from a general perspective, and then dig into the data types it offers and the operations it can do on each data type. During the course of the talk, I will demonstrate how we’ve used certain data types to solve some rather tricky scaling problems: real use cases solving real problems.

I hope this talk will be entertaining and informative to both DBAs and application developers. I will be sharing the slides here and on twitter afterwards, so stay tuned.

Additionally, on Tuesday I will be teaching a one day workshop on Ruby and Rails with a PostgreSQL focus. This is the second time I will do this at PostgreSQL conf, the last time being at PostgreSQL conf east in New York City. The class size was small, and the feedback I received was very positive in that attendees got a good grasp of the Ruby programming language and how Rails and Postgres fit in the ecosystem. I hope this time around is even better.

Hope to see you there.

PostgreSQL 9.1 Released

Version 9.1 of PostgreSQL was released yesterday.

Among the exciting new features there is:

  • pg_basebackup - this can be used alongside Streaming Replication to perform a backup or clone of your database. I can imagine heroku adopting this as an even faster and reliable way to sync your production and staging databases, for example (when and if they upgrade to 9.1). However it can also be used to create plain old tarballs and create standalone backups.

  • Another replication goodie: Synchronous replication. On Postgres 9.0, replication was asynchronous. By enabling synchronous replication, you are basically telling the master database to only report the transaction as committed when the slave(s) have successfully written it to its own journal. You can also control this on a specific session by doing SET synchronous_commit TO off.

  • The new pg_stat_replication view displays the replication status of all slaves from a master node.

  • Unlogged tables. What? Postgres supports unlogged tables now? Yes, it does. They can be used for data you don’t necessarily care about, and a crash will truncate all data. They are much faster to write to and could be useful for caching data or keeping stats that can be rebuilt. You can create them like so:

1
CREATE TABLE UNLOGGED some_stats(value int)
  • SQL/MED gets the ability to define foreign tables. This is rich. It means that you can define a foreign data wrapper for a different database and access it from Postgres seamlessly. Some hackers have already built some nifty foreign data wrappers for mysql, oracle, and even redis and couchdb. Although I’m of the mind that if you’re actually using any of these databases to supplement your app’s data needs, just talk to them directly from your app. However, it may be possible to write some higher performance reports that use different data sources, and you let Postgres do the heavy lifting of munging all the data together.

  • You no longer need to specify all columns on a select list from your GROUP BY clause: functionally dependent columns are inferred by the planner, so specifying a primary key is sufficient. I talked about this before, and it’s a cause of great frustration to users coming from MySQL.

There’s much more in this release. Here are the release notes.

Huge thanks, Postgres hackers!

Design Tweaks

Today I made a few design tweaks to this blog. The goal is to move a bit away from the stock default octopress theme - but my design chops aren’t all that great and I can’t really budget the time to do a design from scratch.

A few simple changes and I’m OK with how it looks for now:

  1. I created the basic font logo that you now see on the header. This was the idea from the beginning, but I never had a chance to include it here. Commit: 27d0107f.

  2. Change the typography in the site, as the original seems a bit heavy for my taste. I started out by using a Helvetica Neue on headers, but decided to go with Antic from the Google web font service. Commits: 2dd8f46b4 and 84bad437.

  3. I wanted something different for that dark background. Went hunting for tiles and patterns. There’s good stuff out there, but I settled for the dark wood background found here. Commit: 9593a00431. At this point, it also made sense to make the header and footer have a light background and dark font.

  4. Finally, added a gradient to the header. Commit fe918b0e

I’m fine with the result for now, but will probably revisit soon. Here’s a before and after.

Before

Before

After

After

You Should Work for These Guys

For the last few months we’ve been working for an awesome company on a greenfield project here in Boston/Cambridge.

The industry: Healthcare. The goal: Improve patient’s lives by changing the way the entire system works. It’s exciting, and it is happening, and you can be a part of it.

The stack

Rails 3.1, Backbone.js, Coffeescript, faye, PostgreSQL, Cucumber, RSpec and Jasmine.

The process

Daily standups, TDD, code reviews via github pull requests.

The result

A highly responsive non-trivial app with a very clean code base and beautiful design.

The future holds a mobile app, web service integrations and ongoing maintenance to the current code base.

If you live in the Boston area, you should apply. If you don’t, you should move here. Right here.

New Domain Name, New Blog Engine

I haven’t touched my blog for a while. Part of it is that I just didn’t identify myself with “Awesomeful” any more. On the other hand, have I got a deal for you! Both awesomeful.net and awesomeful.org are for sale, so hit me up if you’re interested - I’m talking to you mister whois awesomeful.com.

Welcome to the new blog: Practice Over Theory. I hope that the new name and engine inspire me to post more often.

Migrating was not a huge task at all. I decided to give octopress a try. It prescribes a really weird method of deploying to github pages which involves cloning yourself into a subdirectory (!!), but now I have a pretty neat set up. It’s backed by jekyll and has a few nice addons, the most useful of which is it’s code highlighting theme which is based on Solarized.

Speaking of code highlighting, let me show you a little rack app that redirects the old awesomeful.net posts to their new warm locations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
require 'rubygems'
require 'sinatra'
REDIRECTS = {
'awesomeful-post-1' => 'new-location-1',
'awesomeful-post-2' => 'new-location-2'
}.freeze
get '/*' do |path|
one_year_in_seconds = 31536000
headers 'Cache-Control' => "public, max-age=#{one_year_in_seconds}",
'Expires' => (Time.now + one_year_in_seconds).httpdate
redirect to("https://practiceovertheory.com/#{REDIRECTS[path]}"), 301
end

Pretty neat, huh? The syntax highlighting, I mean.

Regarding the above sinatra app, I just have a dictionary[1] mapping the old paths to the new ones, and respond with a HTTP 301 Moved Permanently. The interesting bit is the HTTP caching employed. Heroku’s (awesome) varnish servers will remember that response for one year. Try it here.

[1] it’s a hash!

Machine Learning - Who’s the Boss?

In the Machine Learning field, there are two types of algorithms that can be applied to a set of data to solve different kinds of problems: Supervised and Unsupervised learning algorithms. Both of these have in common that they aim to extract information or gain knowledge from the raw data that would otherwise be very hard and unpractical to do. This is because we live in very dynamic environments with changing parameters and vast amounts of data being gathered. This data hides important patterns and correlations that are sometimes impossible to deduce manually, and where computing power and smart algorithms excel. They are also heavily dependent on the quantity and quality of the input data, and as such, evolve in their output and accuracy as more and better input data becomes available.

In this article we will walk through what constitues Supervised and Unsupervised Learning. An overview of the language and terms is presented, as well as the general workflow used for machine learning tasks.

Supervised Learning

In supervised machine learning we have a set of data points or observations for which we know the desired output, class, target variable or outcome. The outcome may take one of many values called classes or labels. A classic example is that given a few thousand emails for which we know whether they are spam or ham (their labels), the idea is to create a model that is able to deduce whether new, unsean emails are spam or not. In other words, we are creating a mapping function where the inputs are the email’s sender, subject, date, time, body, attachments and other attributes, and the output is a prediction as to whether the email is spam or ham. The target variable is in fact providing some level of supervision in that it is used by the learning algorithm to adjust parameters or make decisions that will allow it to predict labels for new data. Finally of note, when the algorithm is predicting labels of observations we call it a classifier. Some classifiers are also capable of providing a probability of a data point belonging to class in which case it is often referred to a probabilistic model or a regression - not to be confused with a statistical regression model.

Lets take this as an example in supervised learning algorithms. Given the following dataset, we want to predict on new emails whether they are spam or not. In the dataset below, note that the last column, Spam?, contains the labels for the examples.

Subject Date Time Body Spam?
I has the viagra for you 03/12/1992 12:23 pm Hi! I noticed that you are a software engineer
so here’s the pleasure you were looking for…
Yes
Important business 05/29/1995 01:24 pm Give me your account number and you’ll be rich.
I’m totally serial
Yes
Business Plan 05/23/1996 07:19 pm As per our conversation, here’s the business plan for our new venture
Warm regards…
No
Job Opportunity 02/29/1998 08:19 am Hi !
I am trying to fill a position for a PHP …
Yes
[A few thousand rows ommitted]
Call mom 05/23/2000 02:14 pm Call mom. She’s been trying to reach you for a few days now No

A common workflow approach, and one that I’ve taken for supervised learning analysis is shown in the diagram below:

The process is:

  1. Scale and prepare training data: First we build input vectors that are appropriate for feeding into our supervised learning algorithm.
  2. Create a training set and a validation set by randomly splitting the universe of data. The training set is the data that the classifier uses to learn how to classify the data, whereas the validation set is used to feed the already trained model in order to get an error rate (or other measures and techniques) that can help us identify the classifier’s performance and accuracy. Typically you will use more training data (maybe 80% of the entire universe) than validation data. Note that there is also cross-validation), but that is beyond the scope of this article.
  3. Train the model. We take the training data and we feed it into the algorithm. The end result is a model that has learned (hopefully) how to predict our outcome given new unknown data.
  4. Validation and tuning: After we’ve created a model, we want to test its accuracy. It is critical to do this on data that the model has not seen yet - otherwise you are cheating. This is why on step 2 we separated out a subset of the data that was not used for training. We are indeed testing our model’s generalization capabilities. It is very easy to learn every single combination of input vectors and their mappings to the output as observed on the training data, and we can achieve a very low error in doing that, but how does the very same rules or mappings perform on new data that may have different input to output mappings? If the classification error of the validation set is very big compared to the training set’s, then we have to go back and adjust model parameters. The model will have essentially memorized the answers seen in the training data, loosing its generalization capabilities. This is called overfitting, and there are various techniques for overcoming it.
  5. Validate the model’s performance. There are numerous techniques for achieving this, such as ROC analysis and many others. The model’s accuracy can be improved by changing its structure or the underlying training data. If the model’s performance is not satisfactory, change model parameters, inputs and or scaling, go to step 3 and try again.
  6. Use the model to classify new data. In production. Profit!

Unsupervised Learning

The kinds of problems that are suited for unsupervised algorithms may seem similar, but are very different to supervised learners. Instead of trying to predict a set of known classes, we are trying to identify the patterns inherent in the data that separate like observations in one way or another. Viewed from 20 thousand feet, the main difference is that we are not providing a target variable like we did in supervised learning.

This marks a fundamental difference in how both types of algorithms operate. On one hand, we have supervised algorithms which try to minimize the error in classifying observations, while unsupervised learning algorithms don’t have such luxuries because there are no outcomes or labels. Unsupervised algorithms try to create clusters of data that are inherently similar. In some cases we don’t necessarily know what makes them similar, but the algorithms are capable of finding these relationships between data points and group them in significant ways. While supervised algorithms aim to minimize the classification error, unsupervised algorithms aim to create groups or subsets of the data where data points belonging to a cluster are as similar to each other as possible, while making the difference between the clusters as high as possible.

Another main difference is that in a clustering problem, the concept of “Training Set” does not apply in the same way as with supervised learners. Typically we have a dataset that is used to find the relationships in the data that buckets them in different clusters. We could of course apply the same clustering model to new data, but unless it is too unpractical to do so (perhaps for performance reasons), we will most certainly want to rerun the algorithm on new data as it will typically find new relationships within the data that may surface up given the new observations.

As a simple example, you could imagine clustering customers by their demographics. The learning algorithm may help you discover distinct groups of customers by region, age ranges, gender and other attributes in such way that we can develop targeted marketing programs. Another example may be to cluster patients by their chronic diseases and comorbidities in such a way that targeted interventions can be developed to help manage their diseases and improve their lifestyles.

For unsupervised learning, the process is:

  1. Scale and prepare raw data: As with supervised learners, this step entails selecting features to feed into our algorithm, and scaling them to build a suitable data set.
  2. Build model: We run the unsupervised algorithm on the scaled dataset to get groups of like observations.
  3. Validate: After clustering the data, we need to verify whether it cleanly separated the data in significant ways. This includes calculating a set of statistics on the resulting clusters (such as the within group sum of squares), as well as analysis based on domain knowledge, where you may measure how certain attributes behave when aggregated by the clusters.
  4. Once we are satisfied with the clusters created there is no need to run the model with new data (although you can). Profit!

Step zero

A common step that I have not outlined above and should be performed when working on any such problem is to get a strong understanding for the characteristics of the data. This should be a combination of visual analysis (for which I prefer the excellent ggplot2 library) as well as some basic descriptive statistics and data profiling such as quartiles, means, standard deviation, frequencies and others. R’s Hmisc package has a great function for this purpose called describe.

I am convinced that not performing this step is a non starter for any datamining project. It will allow you to identify missing values, general distributions of data, early outlier detection, among many other characteristics that drive the selection of attributes for your models.

Wrapping up

This is certainly quite a bit of info, especially if these terms are new to you. To summarize:

Supervised Learning Unsupervised Learning
Objective Classify or predict a class. Find patterns inherent to the data, creating cluster of like data points. Dimensionality Reduction.
Example Implementations Neural Networks (Multilayer Perceptrons, RBF Networks and others, Support Vector Machines, Decision Trees (ID3, C4.5 and others), Naive Bayes Classifiers K-Means (and variants), Hierarchical Clustering, Kohonen Self Organizing Maps
Who’s the Boss? The target variable or outcome. The relationships inherent to the data.

Hopefuly this article shows the main differences between Unsupervised and Supervised Learning. On followup posts we will dig into some of the specific implementations of these algorithms with examples in R and Ruby

Experiences Porting a Helper Plugin to Rails 3

Today I spent a few minutes porting truncate_html to Rails 3. This gem/plugin provides you with the truncate_html() helper method, which is very similar to rails’ truncate(), but it takes care of closing open html tags and other peculiarities of truncating HTML. It works by using regular expressions and does not have any dependencies. I use this gem on this blog, as well as on the bostonrb.org site. Some other people have found it to be useful.

One of the promises of Rails 3 is that there is an API for plugin developers that will allow you to hook into the right parts of Rails to add the functionality that your plugin provides. This means that you should not be mixing in or monkeypatching Rails core willy-nilly. In fact, it is now expected for you as a plugin developer to figure out how to hook into the right parts of Rails using the new API, as opposed to doing something like the following:

1
2
3
ActionView::Base.class_eval do
include TruncateHtmlHelper
end

At this stage, there isn’t much documentation around what the API actually is. But this shouldn’t stop you from investigating and finding out. In this case, cloning the rails repo and using ack pointed me towards actionpack/lib/action_controller/metal/helpers.rb, where I found all the info I needed to remove the now outdated meta-programmed mixin technique of the dark Rails 2 days. From the docs:

In addition to using the standard template helpers provided in the Rails framework,
creating custom helpers to extract complicated logic or reusable functionality is strongly
encouraged. By default, the controller will include a helper whose name matches that of
the controller, e.g., MyController will automatically include MyHelper.
Additional helpers can be specified using the helper class method in
ActionController::Base or any controller which inherits from it.

Perfect. All I need to do in this case is call the helper class method with my helper’s module: ActionController::Base.helper(TruncateHtmlHelper). A quick run through the app demonstrates however that we now need to mark strings as html_safe. Fine, let’s do that: (TruncateHtml::HtmlTruncator.new(html).truncate(options)).html_safe!

Finally, let’s run the test suite - and facepalm. The way this plugin is set up is that RSpec must be installed in the containing app for it to run the spec suite. Here’s where I ran into the first real issue with the upgrade: I have not been able to install RSpec on a Rails 3 app. I also can’t find any obvious way to do it by browsing its source code. For now I seem to be stuck in limbo land until the the RSpec/Rails 3 affair is all sorted out.

Backward Compatibility

The bigger question is how to maintain backward compatibility. One way to accomplish this is to continue to maintain two git branches for Rails2 and Rails3 (master), and cherry-picking any bug fixes or enhancements from the master branch into the Rails2 branch. However, how could we manage gem bundling and distribution of two gems built for two version of Rails? I’d like to know how you are planning on maintaining backward compatibility. In this particular case, I almost don’t care for backward compatibility, and users will simply have to know that version 0.2.2 of the gem is the latest working Rails 2 version, and must install that specific version when running under Rails 2.

Spec Your Yields in RSpec

Message expectations in RSpec’s Mocking/Stubing framework provide means for spec’ing the yielded objects of a method. For example, consider the following spec where we expect the here_i_am method to yield self:

1
2
3
4
5
6
7
8
9
10
11
12
describe Triviality do
describe '#here_i_am' do
let(:triviality) { Triviality.new }
it 'yields self' do
triviality.should_receive(:here_i_am).and_yield(triviality)
triviality.here_i_am { }
end
end
end

Nice and easy. First we set the expectation and then we exercise the method so that the expectation is met, passing it a “no op” block - {}.

Here’s the method to make it pass.

1
2
3
4
5
6
7
class Triviality
def here_i_am
yield self
end
end

Furthermore, we can test many yielded values by chaining the and_yield method on the expectation. Let’s add a spec for a method that yields many times and see how that would play out:

1
2
3
4
5
6
7
8
9
10
11
12
describe Triviality do
describe '#one_two_three' do
let(:triviality) { Triviality.new }
it 'yields the numbers 1, 2 and 3' do
triviality.should_receive(:one_two_three).and_yield(1).and_yield(2).and_yield(3)
triviality.one_two_three { }
end
end
end

And the method to make that pass:

1
2
3
4
5
6
7
8
9
class Triviality
def one_two_three
yield 1
yield 2
yield 3
end
end

This is kind of ugly though. What if it yields many more times, or if you just want to test that it yields all items of an array? A good example of this is the Enumerable’s each method. In such cases we can store the MessageExpectation object and call and_yield on it many times, in a loop. Take a look at the following example where we yield each letter of the alphabet:

1
2
3
4
5
6
7
8
9
10
11
12
13
describe Triviality do
describe '#alphabet' do
let(:triviality) { Triviality.new }
it 'yields all letters of the alphabet' do
expectation = triviality.should_receive(:alphabet)
('A'...'Z').each { |letter| expectation.and_yield(letter) }
triviality.alphabet { }
end
end
end

And finally, the method to make it pass:

1
2
3
4
5
6
7
class Triviality
def alphabet
('A'...'Z').each do { |letter| yield letter }
end
end

and_yield is not only useful for message expectations. You can also use it on your stubs, just like you’d use and_returns.

An Object Quacks Like a Duck

I’ve been toying around with the idea of spec’ing mixins: that a class includes a module. Suppose the following class:

1
2
3
4
5
6
7
8
9
10
11
12
class FooList
include Enumerable
attr_accessor :some_array
def initialize(opts)
@some_array = opts[:the_array] || []
end
def each
@some_array.each { |item| yield item }
end
end

We can test the behavior of the each method using RSpec, but we can also make sure that FooList actually acts like an Enumerable. Here’s a quick RSpec Matcher just for that (require it in your spec_helper.rb)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Spec::Matchers.define :quack_like do |mod|
match do |instance|
mod.instance_methods.inject(true) { |accum, method| accum && instance.respond_to?(method) }
end
failure_message_for_should do |instance|
"expected the class #{instance.class.name} to include the module #{mod}"
end
failure_message_for_should_not do |instance|
"expected the class #{instance.class.name} not to include the module #{mod}"
end
description do
"expected the class to behave like a module by responding to all of its instance methods"
end
end

This allows us to spec some quacking:

1
2
3
4
5
6
7
8
9
describe FooList do
def foo_list
@foo_list ||= FooList.new
end
it "quacks like an Enumerable" do
foo_list.should quack_like Enumerable
end
end

I am still experimenting with this. In a way it is not really testing behavior, but it’s not really testing the implementation either. In other words, if every method in Enumerable is implemented in FooList and we remove the include Enumerable line, the spec still passes.

I’ve discussed this over IRC with some other smart folks, but I want more input . Do you think this is appropriate? Useless?

PostgreSQL’s Group By

Last night I noticed a user on IRC complaint on two different channels (#heroku and #rubyonrails) claiming something along the lines of “PostgreSQL sucks: i have this code

1
2
3
named_scope :with_questions,
:joins => :questions,
:group => "categories.id, categories.name, categories.created_at, categories.updated_at"

because of the way postgresql handles group by. It should only be "categories.id".”

The user was surprised that this query works on MySQL. Surely, the user was getting the PostgreSQL message: ERROR: column "categories.name" must appear in the group by clause or be used in an aggregate function. It turns out that this is not a bug, and PostgreSQL does not suck as this user initially thought. Furthermore, I tried a similar query on MS SQL Server, and it rightfully complaints: Column 'categories.name' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Let’s look at solutions.

Alternative Queries

The first thing that’s wrong about this query is that what the user really wanted was a distinct list of categories that had questions. This is the requirement. To that end, the query should look something like the following two options.

  • Option 1: Drop the join and group by, and just use a condition checking whether a question exists for the category:
1
2
Category.all(:conditions =>
'exists (select 1 from questions where categories.id = questions.category_id)')

A variation of this can be achieved with the in operator:

1
2
Category.all(:conditions =>
'clients.id in (select client_id from questions)')
  • Option 2: Again, drop the group by, and use a distinct instead:
1
2
Category.all(:select => 'distinct items.*',
:joins => :questions)

Why PostgreSQL doesn’t like the original query

The group by clause is used to collect data from multiple records having common values in a select statement, and project the result based on some aggregate function. It really does not make any sense to add a group by to a query that does not have an aggregate such as sum(), avg(), min(), max(), count(). There is an exception, but we’ll talk about that later.

As an example, we could retrieve every item along with a count of categories per item:

1
2
3
4
5
select id, name, count(id)
from items
inner join categories
on items.id = categories.item_id
group by id, name

Note that every non-aggregated column on the select list must appear on the group by list. This is necessary for PostgreSQL to know which item’s to count on (or sum, or calculate the max on). Let’s walk through a simplified example of what happens if we don’t include one of these columns on the group by list.

Suppose the following table

code | city

0 | Cambridge 0 | Boston 1 | Foxboro

What happens if we run the following query:

1
2
3
select code, city
from table
group by code

What would you expect PostgreSQL to return for the row with a code equal to 0? Cambridge or Boston? When PostgreSQL is presented with an ambiguous query such as the above, it will stop and report an error. Some other databases may go on and make their own decision as to what to return. To me, this is a broken spec. Futhermore, the result set may be inconsistent and unpredictable across DBMSes, or even queries on the same DB.

Exception to the rule

On previous versions of PostgreSQL (pre 8.2), the query plan for a group by was much more efficient than a select distinct. In some older Rails apps, we wrote things like the following to optimize performance:

1
2
3
Question.find(:all
:group => Question.column_names.join(', '),
:conditions => '...')

Instead of the more natural:

1
2
3
Question.find(:all,
:select => 'distinct items.*',
:conditions => '...')

This was an optimization that was specific to our environment and helped us avoid the relatively poor query plan and expensive Seq Scan that was slowing our app down.

I hope that after reading this you realize that this error is helping you as a user write better SQL. Complaining that the example query doesn’t run on PostgreSQL is like complaining that your new Fender Strat sucks because when you play Here comes the Sun the very same way you played it on your Beatles Rock Band guitar, it doesn’t sound the same. /endrant