New Domain Name, New Blog Engine

I haven’t touched my blog for a while. Part of it is that I just didn’t identify myself with “Awesomeful” any more. On the other hand, have I got a deal for you! Both awesomeful.net and awesomeful.org are for sale, so hit me up if you’re interested - I’m talking to you mister whois awesomeful.com.

Welcome to the new blog: Practice Over Theory. I hope that the new name and engine inspire me to post more often.

Migrating was not a huge task at all. I decided to give octopress a try. It prescribes a really weird method of deploying to github pages which involves cloning yourself into a subdirectory (!!), but now I have a pretty neat set up. It’s backed by jekyll and has a few nice addons, the most useful of which is it’s code highlighting theme which is based on Solarized.

Speaking of code highlighting, let me show you a little rack app that redirects the old awesomeful.net posts to their new warm locations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
require 'rubygems'
require 'sinatra'
REDIRECTS = {
'awesomeful-post-1' => 'new-location-1',
'awesomeful-post-2' => 'new-location-2'
}.freeze
get '/*' do |path|
one_year_in_seconds = 31536000
headers 'Cache-Control' => "public, max-age=#{one_year_in_seconds}",
'Expires' => (Time.now + one_year_in_seconds).httpdate
redirect to("http://practiceovertheory.com/#{REDIRECTS[path]}"), 301
end

Pretty neat, huh? The syntax highlighting, I mean.

Regarding the above sinatra app, I just have a dictionary[1] mapping the old paths to the new ones, and respond with a HTTP 301 Moved Permanently. The interesting bit is the HTTP caching employed. Heroku’s (awesome) varnish servers will remember that response for one year. Try it here.

[1] it’s a hash!

Machine Learning - Who’s the Boss?

In the Machine Learning field, there are two types of algorithms that can be applied to a set of data to solve different kinds of problems: Supervised and Unsupervised learning algorithms. Both of these have in common that they aim to extract information or gain knowledge from the raw data that would otherwise be very hard and unpractical to do. This is because we live in very dynamic environments with changing parameters and vast amounts of data being gathered. This data hides important patterns and correlations that are sometimes impossible to deduce manually, and where computing power and smart algorithms excel. They are also heavily dependent on the quantity and quality of the input data, and as such, evolve in their output and accuracy as more and better input data becomes available.

In this article we will walk through what constitues Supervised and Unsupervised Learning. An overview of the language and terms is presented, as well as the general workflow used for machine learning tasks.

Supervised Learning

In supervised machine learning we have a set of data points or observations for which we know the desired output, class, target variable or outcome. The outcome may take one of many values called classes or labels. A classic example is that given a few thousand emails for which we know whether they are spam or ham (their labels), the idea is to create a model that is able to deduce whether new, unsean emails are spam or not. In other words, we are creating a mapping function where the inputs are the email’s sender, subject, date, time, body, attachments and other attributes, and the output is a prediction as to whether the email is spam or ham. The target variable is in fact providing some level of supervision in that it is used by the learning algorithm to adjust parameters or make decisions that will allow it to predict labels for new data. Finally of note, when the algorithm is predicting labels of observations we call it a classifier. Some classifiers are also capable of providing a probability of a data point belonging to class in which case it is often referred to a probabilistic model or a regression - not to be confused with a statistical regression model.

Lets take this as an example in supervised learning algorithms. Given the following dataset, we want to predict on new emails whether they are spam or not. In the dataset below, note that the last column, Spam?, contains the labels for the examples.

Subject Date Time Body Spam?
I has the viagra for you 03/12/1992 12:23 pm Hi! I noticed that you are a software engineer
so here’s the pleasure you were looking for…
Yes
Important business 05/29/1995 01:24 pm Give me your account number and you’ll be rich. I’m totally serial Yes
Business Plan 05/23/1996 07:19 pm As per our conversation, here’s the business plan for our new venture Warm regards… No
Job Opportunity 02/29/1998 08:19 am Hi !I am trying to fill a position for a PHP … Yes
[A few thousand rows ommitted]
Call mom 05/23/2000 02:14 pm Call mom. She’s been trying to reach you for a few days now No

A common workflow approach, and one that I’ve taken for supervised learning analysis is shown in the diagram below:

The process is:

  1. Scale and prepare training data: First we build input vectors that are appropriate for feeding into our supervised learning algorithm.
  2. Create a training set and a validation set by randomly splitting the universe of data. The training set is the data that the classifier uses to learn how to classify the data, whereas the validation set is used to feed the already trained model in order to get an error rate (or other measures and techniques) that can help us identify the classifier’s performance and accuracy. Typically you will use more training data (maybe 80% of the entire universe) than validation data. Note that there is also cross-validation), but that is beyond the scope of this article.
  3. Train the model. We take the training data and we feed it into the algorithm. The end result is a model that has learned (hopefully) how to predict our outcome given new unknown data.
  4. Validation and tuning: After we’ve created a model, we want to test its accuracy. It is critical to do this on data that the model has not seen yet - otherwise you are cheating. This is why on step 2 we separated out a subset of the data that was not used for training. We are indeed testing our model’s generalization capabilities. It is very easy to learn every single combination of input vectors and their mappings to the output as observed on the training data, and we can achieve a very low error in doing that, but how does the very same rules or mappings perform on new data that may have different input to output mappings? If the classification error of the validation set is very big compared to the training set’s, then we have to go back and adjust model parameters. The model will have essentially memorized the answers seen in the training data, loosing its generalization capabilities. This is called overfitting, and there are various techniques for overcoming it.
  5. Validate the model’s performance. There are numerous techniques for achieving this, such as ROC analysis and many others. The model’s accuracy can be improved by changing its structure or the underlying training data. If the model’s performance is not satisfactory, change model parameters, inputs and or scaling, go to step 3 and try again.
  6. Use the model to classify new data. In production. Profit!

Unsupervised Learning

The kinds of problems that are suited for unsupervised algorithms may seem similar, but are very different to supervised learners. Instead of trying to predict a set of known classes, we are trying to identify the patterns inherent in the data that separate like observations in one way or another. Viewed from 20 thousand feet, the main difference is that we are not providing a target variable like we did in supervised learning.

This marks a fundamental difference in how both types of algorithms operate. On one hand, we have supervised algorithms which try to minimize the error in classifying observations, while unsupervised learning algorithms don’t have such luxuries because there are no outcomes or labels. Unsupervised algorithms try to create clusters of data that are inherently similar. In some cases we don’t necessarily know what makes them similar, but the algorithms are capable of finding these relationships between data points and group them in significant ways. While supervised algorithms aim to minimize the classification error, unsupervised algorithms aim to create groups or subsets of the data where data points belonging to a cluster are as similar to each other as possible, while making the difference between the clusters as high as possible.

Another main difference is that in a clustering problem, the concept of “Training Set” does not apply in the same way as with supervised learners. Typically we have a dataset that is used to find the relationships in the data that buckets them in different clusters. We could of course apply the same clustering model to new data, but unless it is too unpractical to do so (perhaps for performance reasons), we will most certainly want to rerun the algorithm on new data as it will typically find new relationships within the data that may surface up given the new observations.

As a simple example, you could imagine clustering customers by their demographics. The learning algorithm may help you discover distinct groups of customers by region, age ranges, gender and other attributes in such way that we can develop targeted marketing programs. Another example may be to cluster patients by their chronic diseases and comorbidities in such a way that targeted interventions can be developed to help manage their diseases and improve their lifestyles.

For unsupervised learning, the process is:

  1. Scale and prepare raw data: As with supervised learners, this step entails selecting features to feed into our algorithm, and scaling them to build a suitable data set.
  2. Build model: We run the unsupervised algorithm on the scaled dataset to get groups of like observations.
  3. Validate: After clustering the data, we need to verify whether it cleanly separated the data in significant ways. This includes calculating a set of statistics on the resulting clusters (such as the within group sum of squares), as well as analysis based on domain knowledge, where you may measure how certain attributes behave when aggregated by the clusters.
  4. Once we are satisfied with the clusters created there is no need to run the model with new data (although you can). Profit!

Step zero

A common step that I have not outlined above and should be performed when working on any such problem is to get a strong understanding for the characteristics of the data. This should be a combination of visual analysis (for which I prefer the excellent ggplot2 library) as well as some basic descriptive statistics and data profiling such as quartiles, means, standard deviation, frequencies and others. R’s Hmisc package has a great function for this purpose called describe.

I am convinced that not performing this step is a non starter for any datamining project. It will allow you to identify missing values, general distributions of data, early outlier detection, among many other characteristics that drive the selection of attributes for your models.

Wrapping up

This is certainly quite a bit of info, especially if these terms are new to you. To summarize:

Supervised Learning Unsupervised Learning
Objective Classify or predict a class. Find patterns inherent to the data, creating cluster of like data points. Dimensionality Reduction.
Example Implementations Neural Networks (Multilayer Perceptrons, RBF Networks and others, Support Vector Machines, Decision Trees (ID3, C4.5 and others), Naive Bayes Classifiers K-Means (and variants), Hierarchical Clustering, Kohonen Self Organizing Maps
Who’s the Boss? The target variable or outcome. The relationships inherent to the data.

Hopefuly this article shows the main differences between Unsupervised and Supervised Learning. On followup posts we will dig into some of the specific implementations of these algorithms with examples in R and Ruby

Experiences Porting a Helper Plugin to Rails 3

Today I spent a few minutes porting truncate_html to Rails 3. This gem/plugin provides you with the truncate_html() helper method, which is very similar to rails’ truncate(), but it takes care of closing open html tags and other peculiarities of truncating HTML. It works by using regular expressions and does not have any dependencies. I use this gem on this blog, as well as on the bostonrb.org site. Some other people have found it to be useful.

One of the promises of Rails 3 is that there is an API for plugin developers that will allow you to hook into the right parts of Rails to add the functionality that your plugin provides. This means that you should not be mixing in or monkeypatching Rails core willy-nilly. In fact, it is now expected for you as a plugin developer to figure out how to hook into the right parts of Rails using the new API, as opposed to doing something like the following:

1
2
3
ActionView::Base.class_eval do
include TruncateHtmlHelper
end

At this stage, there isn’t much documentation around what the API actually is. But this shouldn’t stop you from investigating and finding out. In this case, cloning the rails repo and using ack pointed me towards actionpack/lib/action_controller/metal/helpers.rb, where I found all the info I needed to remove the now outdated meta-programmed mixin technique of the dark Rails 2 days. From the docs:

In addition to using the standard template helpers provided in the Rails framework,
creating custom helpers to extract complicated logic or reusable functionality is strongly
encouraged. By default, the controller will include a helper whose name matches that of
the controller, e.g., MyController will automatically include MyHelper.

Additional helpers can be specified using the helper class method in
ActionController::Base or any controller which inherits from it.

Perfect. All I need to do in this case is call the helper class method with my helper’s module: ActionController::Base.helper(TruncateHtmlHelper). A quick run through the app demonstrates however that we now need to mark strings as html_safe. Fine, let’s do that: (TruncateHtml::HtmlTruncator.new(html).truncate(options)).html_safe!

Finally, let’s run the test suite - and facepalm. The way this plugin is set up is that RSpec must be installed in the containing app for it to run the spec suite. Here’s where I ran into the first real issue with the upgrade: I have not been able to install RSpec on a Rails 3 app. I also can’t find any obvious way to do it by browsing its source code. For now I seem to be stuck in limbo land until the the RSpec/Rails 3 affair is all sorted out.

Backward Compatibility

The bigger question is how to maintain backward compatibility. One way to accomplish this is to continue to maintain two git branches for Rails2 and Rails3 (master), and cherry-picking any bug fixes or enhancements from the master branch into the Rails2 branch. However, how could we manage gem bundling and distribution of two gems built for two version of Rails? I’d like to know how you are planning on maintaining backward compatibility. In this particular case, I almost don’t care for backward compatibility, and users will simply have to know that version 0.2.2 of the gem is the latest working Rails 2 version, and must install that specific version when running under Rails 2.

Spec Your Yields in RSpec

Message expectations in RSpec’s Mocking/Stubing framework provide means for spec’ing the yielded objects of a method. For example, consider the following spec where we expect the here_i_am method to yield self:

1
2
3
4
5
6
7
8
9
10
11
12
describe Triviality do
describe '#here_i_am' do
let(:triviality) { Triviality.new }
it 'yields self' do
triviality.should_receive(:here_i_am).and_yield(triviality)
triviality.here_i_am { }
end
end
end

Nice and easy. First we set the expectation and then we exercise the method so that the expectation is met, passing it a “no op” block - {}.

Here’s the method to make it pass.

1
2
3
4
5
6
7
class Triviality
def here_i_am
yield self
end
end

Furthermore, we can test many yielded values by chaining the and_yield method on the expectation. Let’s add a spec for a method that yields many times and see how that would play out:

1
2
3
4
5
6
7
8
9
10
11
12
describe Triviality do
describe '#one_two_three' do
let(:triviality) { Triviality.new }
it 'yields the numbers 1, 2 and 3' do
triviality.should_receive(:one_two_three).and_yield(1).and_yield(2).and_yield(3)
triviality.one_two_three { }
end
end
end

And the method to make that pass:

1
2
3
4
5
6
7
8
9
class Triviality
def one_two_three
yield 1
yield 2
yield 3
end
end

This is kind of ugly though. What if it yields many more times, or if you just want to test that it yields all items of an array? A good example of this is the Enumerable’s each method. In such cases we can store the MessageExpectation object and call and_yield on it many times, in a loop. Take a look at the following example where we yield each letter of the alphabet:

1
2
3
4
5
6
7
8
9
10
11
12
13
describe Triviality do
describe '#alphabet' do
let(:triviality) { Triviality.new }
it 'yields all letters of the alphabet' do
expectation = triviality.should_receive(:alphabet)
('A'...'Z').each { |letter| expectation.and_yield(letter) }
triviality.alphabet { }
end
end
end

And finally, the method to make it pass:

1
2
3
4
5
6
7
class Triviality
def alphabet
('A'...'Z').each do { |letter| yield letter }
end
end

and_yield is not only useful for message expectations. You can also use it on your stubs, just like you’d use and_returns.

An Object Quacks Like a Duck

I’ve been toying around with the idea of spec’ing mixins: that a class includes a module. Suppose the following class:

1
2
3
4
5
6
7
8
9
10
11
12
class FooList
include Enumerable
attr_accessor :some_array
def initialize(opts)
@some_array = opts[:the_array] || []
end
def each
@some_array.each { |item| yield item }
end
end

We can test the behavior of the each method using RSpec, but we can also make sure that FooList actually acts like an Enumerable. Here’s a quick RSpec Matcher just for that (require it in your spec_helper.rb)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Spec::Matchers.define :quack_like do |mod|
match do |instance|
mod.instance_methods.inject(true) { |accum, method| accum && instance.respond_to?(method) }
end
failure_message_for_should do |instance|
"expected the class #{instance.class.name} to include the module #{mod}"
end
failure_message_for_should_not do |instance|
"expected the class #{instance.class.name} not to include the module #{mod}"
end
description do
"expected the class to behave like a module by responding to all of its instance methods"
end
end

This allows us to spec some quacking:

1
2
3
4
5
6
7
8
9
describe FooList do
def foo_list
@foo_list ||= FooList.new
end
it "quacks like an Enumerable" do
foo_list.should quack_like Enumerable
end
end

I am still experimenting with this. In a way it is not really testing behavior, but it’s not really testing the implementation either. In other words, if every method in Enumerable is implemented in FooList and we remove the include Enumerable line, the spec still passes.

I’ve discussed this over IRC with some other smart folks, but I want more input . Do you think this is appropriate? Useless?

PostgreSQL’s Group By

Last night I noticed a user on IRC complaint on two different channels (#heroku and #rubyonrails) claiming something along the lines of “PostgreSQL sucks: i have this code

1
2
3
named_scope :with_questions,
:joins => :questions,
:group => "categories.id, categories.name, categories.created_at, categories.updated_at"

because of the way postgresql handles group by. It should only be "categories.id".”

The user was surprised that this query works on MySQL. Surely, the user was getting the PostgreSQL message: ERROR: column "categories.name" must appear in the group by clause or be used in an aggregate function. It turns out that this is not a bug, and PostgreSQL does not suck as this user initially thought. Furthermore, I tried a similar query on MS SQL Server, and it rightfully complaints: Column 'categories.name' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Let’s look at solutions.

Alternative Queries

The first thing that’s wrong about this query is that what the user really wanted was a distinct list of categories that had questions. This is the requirement. To that end, the query should look something like the following two options.

  • Option 1: Drop the join and group by, and just use a condition checking whether a question exists for the category:
1
2
Category.all(:conditions =>
'exists (select 1 from questions where categories.id = questions.category_id)')

A variation of this can be achieved with the in operator:

1
2
Category.all(:conditions =>
'clients.id in (select client_id from questions)')
  • Option 2: Again, drop the group by, and use a distinct instead:
1
2
Category.all(:select => 'distinct items.*',
:joins => :questions)

Why PostgreSQL doesn’t like the original query

The group by clause is used to collect data from multiple records having common values in a select statement, and project the result based on some aggregate function. It really does not make any sense to add a group by to a query that does not have an aggregate such as sum(), avg(), min(), max(), count(). There is an exception, but we’ll talk about that later.

As an example, we could retrieve every item along with a count of categories per item:

1
2
3
4
5
select id, name, count(id)
from items
inner join categories
on items.id = categories.item_id
group by id, name

Note that every non-aggregated column on the select list must appear on the group by list. This is necessary for PostgreSQL to know which item’s to count on (or sum, or calculate the max on). Let’s walk through a simplified example of what happens if we don’t include one of these columns on the group by list.

Suppose the following table

code | city

0 | Cambridge 0 | Boston 1 | Foxboro

What happens if we run the following query:

1
2
3
select code, city
from table
group by code

What would you expect PostgreSQL to return for the row with a code equal to 0? Cambridge or Boston? When PostgreSQL is presented with an ambiguous query such as the above, it will stop and report an error. Some other databases may go on and make their own decision as to what to return. To me, this is a broken spec. Futhermore, the result set may be inconsistent and unpredictable across DBMSes, or even queries on the same DB.

Exception to the rule

On previous versions of PostgreSQL (pre 8.2), the query plan for a group by was much more efficient than a select distinct. In some older Rails apps, we wrote things like the following to optimize performance:

1
2
3
Question.find(:all
:group => Question.column_names.join(', '),
:conditions => '...')

Instead of the more natural:

1
2
3
Question.find(:all,
:select => 'distinct items.*',
:conditions => '...')

This was an optimization that was specific to our environment and helped us avoid the relatively poor query plan and expensive Seq Scan that was slowing our app down.

I hope that after reading this you realize that this error is helping you as a user write better SQL. Complaining that the example query doesn’t run on PostgreSQL is like complaining that your new Fender Strat sucks because when you play Here comes the Sun the very same way you played it on your Beatles Rock Band guitar, it doesn’t sound the same. /endrant

Three ActiveRecord Model Utility Methods

Lately I’ve found myself implementing at least two out of three of these utility methods on most of my ActiveRecord models on my Rails apps. These methods leverage functionality of either Ruby by itself or Rails that will become helpful in other parts of your application.

The labeler: #to_s

Defining a to_s method enables a very clean way to display a label that represents your model instance. Suppose you have a Post model. You could define a #to_s method similar to the following:

1
2
3
def to_s
title
end

Now your views can simply do something like the following HAML:

1
2
.post
.h2= @post

The quantifier: #<=>

This method is useful when you consistently need to be able to respond to the question “Which of these is bigger/better/taller/fatter/purtier?”. Which metric defines the quantity is completely up to you, but what’s important is that defining this method will allow you to call the Enumerable goodies #min, #max, and #sort. For example, and following the Post model example above, you could imagine the following method definition.

1
2
3
def <=>(other)
rating <=> other.rating
end

Assuming you have a recent named_scope that returns the latest n posts, you could do things like Post.recent.max or Post.recent.sort.

The SEOer: #to_param

Finally, this method is leveraged by Rails itself. You can take advantage of the fact that the :id parameter on Rails URLs can be any sequence of integers followed by any non-whitespace characters. Rails will ignore any character after the integers, and use that as the :id parameter. It is customary to use the post’s title as part of the URL. An easy way to achieve this in Rails is to define the following method on your model:

1
2
3
def to_param
"#{id}-#{title.parameterize}"
end

Rails will call to to_param method when using post_url(@post), and your URLs shal become slugs, in SEO terms.

On the Subject of Enterprise Reporting

Apparently this is how reporting works in big corporate environments.

  1. We have a bunch of data, terabytes. There’s got to be something in there.
  2. After a few months of cranking, the analyst comes up with a one page report with significant statistics and a couple of charts. This is the first, alpha version of the report.
  3. Manager receives report. Loves it, since it’s the first time they can see the deficiencies in the overall system, and sees opportunities for savings, optimization, improvement.
  4. Manger passes it along to another set of managers, maybe by region, for constructive feedback.
  5. A few weeks later, all feedback is taken in, and the analyst implements all of the new requirements into the new, 40 page report.
  6. Report is sent out quarterly, where it sits on top of a pile labeled TO READ.

On the subject of Enterprise Reporting

Small Collection of Useful Vim Tricks

I’ve been using vim for the past couple of years. Vim is a cross-platform programmer’s editor that is highly customizable and has a ton of well known and not-so well known features. As time goes by, I always come across new or better ways of doing things that increase productivity. What’s most important is to force yourself to start using them so that they become part of your workflow, becoming muscle memory.

Here’s a small collection of tips that I’ve picked up along the way:

Moving around files

  • I never bought into NERDTree or any of the other explorer plugins. I usually jump around using gf, using Rails.vim’s :R, :A, etc, or you can open up vim’s built in explorer using :Ex. :Sex opens it up in a split window.

  • Use vim marks to jump around any number of files. Here’s a quick reference

  • gf will open the filename under the cursor. Rails.vim extends this by being smart about view partials. You can also use gf to open the file in a new tab.

  • Use ctags. C-] on any method name or attribute to jump to the declaration point (or -] to open on a new window. Use to come back. The tag stack is maintained. Then, :ta to jump to the tag. I use F5 to rebuild my ctags database: map :!ctags -R --exclude=.svn --exclude=.git --exclude=log *

  • Learn how to use windows and tabs: it can really improve your work flow. I usually have multiple tabs open with different areas of the project. On each tab, I may have up to four windows open. I use tabs when it’s a different context, for example, a CSS and a view file on one tab, and a controller, cucumber feature and step definitions on another tab, each on its own windows. When working with many windows, it’s useful to quickly resize them: _ maximizes the viewing area of the active window (for vertical windows, | does the same). You can come back to equally sized windows with =. The easiest way for me to move around tabs is using gt and gT. To move around windows, I typically use along with one of h, j, k, l.

More efficient editing

  • Automate simple tasks with macros. q{reg}q. Then invoke it n times with n@{reg}. Learn about recursive macros to run it until EOF.

  • Use your registers. Yanking (and putting) and recording macros to different registers is most useful. Use :reg to view contents of your registers.

  • Quickly change text surrounded by something using ci. For example, to change a string in double quotes use ci", or to change parenthesized parameters, use ci(.

  • When working with large number of files, you can run the same command on all buffers/windows/tabs by using :bufdo/:windo/:tabdo. Such a time saver.

  • Learn how to use vim’s substitute command with regular expressions and the grouping operators \( and \). You can morph a file at will using the matched part of the pattern and \1, \2. Throw i at the end of the command for interactive substitution. Here’s a good vim regex guide

  • Vim’s built in autocompletion rocks. Use or in insert mode to activate and move through the options.

  • When moving within a file, f moves the cursor forward until the first . ; repeats the action. You can move to the second occurrence of by using 2f. F does the same, but backwards.

  • Learn the different ways to enter insert mode. The ones I use the most, other than i are

    • a to enter insert mode at the next character (as opposed to i, which enters insert mode before the current character).
    • A to enter insert mode at the end of the line.
    • I to enter insert mode at the beginning of the line.
    • O to create a new line above the current line and enter insert mode.
    • o to create a new line below the current line and enter insert mode.
  • Block selection is magical. Use it when you want to select a vertical block or column to yank it, delete it, etc. To enter visual block select, do . You can also insert arbitrary text before the block with Esc. For example, this is very useful for inserting the same text on a set of aligned

  • s(like adding a class=”foo” attribute) or after hash rockets or equal signs on your ruby code - yet another reason to align these properly.

  • C-e and C-y, scrolls the buffer without moving the cursor. Make it scroll three lines at a time with: nnoremap 3 nnoremap 3

  • Use % to jump to the matching open or closing bracket/parenthesis. Use the matchit plugin to have it match more than single characters, for example def matches end in Ruby code, or

      matches
    in HTML.

Other miscellanea

  • For bash style command tab completion, put set wildmode=list:longest on your vimrc. Then tab away when running commands on ex mode.

  • View unprintable characters with :set list. Cancel out with :set nolist

  • Analyzing log files with vim is also a win. For example, you may be in a rails app and used Rails.vim’s :Rlog command. Now you can use the global command to surface up the specific pattern you’re looking for using :g/{pattern}/p, or you may remove any useless lines with :g/{useless}/d.

  • When pasting something, vim will autoindent which is usually not what you want. Use :set paste before the paste to avoid that.

  • Use gg=G to apply proper indentation to a messy source file. I define 2 space “tabs” for all file types by using set tabstop=2 shiftwidth=2 expandtab.

  • Saving and reopening a vim session is easy using :mksession! and :source. I have the following two maps in my vimrc to make this even easier with F2 and F3: map :mksession! ~/vim_session " Quick write session with F2 map :source ~/vim_session " And load session with F3

  • Ctrl-L to redraw the screen (to fix broken syntax highlighting). The following mapping will also remove highlighting after a search: nnoremap :nohls.

  • About those darn backup files infecting your project: Keep them, but on a different location - you may come to a crash, and you’ll need them. I created a directory called ~/.vim_backups, and vim puts them there. The relevant lines are: silent execute '!mkdir -p ~/.vim_backups' set backupdir=~/.vim_backups// set directory=~/.vim_backups//

  • I usually write blog posts in Vim using markdown. I use vim’s integrated spell checking by setting :set spellang=en_us on my vimrc. To invoke it, use :set spell which will underline misspelled words, and z= brings up suggestions for the misspelled words under the cursor.

  • Word wrapping is sluggish by default. I added these lines to my .vimrc which do a few of things: w toggles between wrap and no wrap. It adds a > character to new lines, making it obvious that you’re on a wrapping line. It tries to wrap on word boundaries, and finally, it remaps k and j so that you’re moving up/down on visual lines, not actual lines.

  • Use set showmatch and set mat=5 on your .vimrc to blink matching parenthesis or brackets. Unobstrusive and useful.

I am sure I’m missing many, but this is what I use most often. Got any tips worth sharing or even better, improvements to the above?

Resources

The following are books, tutorials or articles that I’ve found useful in learning and improving my vim workflow.

On the Subject of E-voting

May sound bitter, but this is exactly how I feel about e-voting. Not only are there many documented issues, but it is just too easy to under/over count. Draw your own conclusions.

On the subject of e-voting