Formatting with Vim Scripts

Written by Pete Corey on Oct 16, 2017.

Over the years my brain has become wired to think in Vim when it comes to editing text. For example, when I want to change an if guard, I don’t want to click, <Del><Del><Del><Del><Del><Del>. Instead I simply change (c) inside (i) parentheses (.

The Vim Way™ of doing things has become so completely ingrained into me that I find myself reaching for Vim when it comes to any and every task involving editing text.

I’ve even incorporated Vim into my writing workflow. Check out how I use Vim scripts to format markdown and Git logs into the articles on this site.

Formatting Markdown

I do the majority of my writing in Ulysses. While it’s a fantastic writing application, but it’s very stubborn about its Markdown formatting. Combined with my even more stubborn Jekyll setup, I usually have to extensively massage every article before it’s formatted well enough to post.

This usually means doing things like decorating inline code blocks with language declarations, wrapping paragraph-style code blocks in <pre><code> tags, and other general formatting changes.

Originally, I did all of this work manually. Needless to say, that was a very time consuming and error-prone process.

While researching ways of automating the process, I had a recurring thought:

I know how to make all of these changes in Vim. Can’t I just script them?

It turns out I could! A Vim “script” can be written by writing Vim commands directly into a file, and running that script file against the file you want to modify. Armed with the global and normal Vim commands, anything is possible.


As a quick introduction to this idea, imagine we have a script file called script.vim:

:%s/cat/dog/g

And now imagine we have a file we want to modify:

I love cats!
Don't you love cats?

Our script will replace all instances of cat with dog. We can run this script against our file by executing the following command and saving the resulting Vim buffer:

vim -s script.vim text

Now that we have the basics under our belt, let’s take things up a notch. Here’s the Vim script I use to transform the markdown produced by Ulysses into a format accepted by Jekyll:

:%s/\t/    /g
:%s/’/'/g
:%s/“/"/g
:%s/”/"/g
:g/^\n    /exe "normal! o<pre class='language-vim'><code class='language-vim'>\<Esc>"
:g/pre class/exe "normal! }O</code></pre>"
:g/pre class/exe "normal! V}2<"
:%s/`.\{-}`{:.language-vim}/\0{:.language-vim}/g

The first line replaces all tab characters with spaces. The next three lines replace unicode quotes with their standard ascii equivalents.

The fifth line is especially interesting. It matches on all lines who’s next line starts with four spaces. It goes into normal mode, opens a line below the matched line and inserts my <pre ...><code ...> tags.

The next line in the script finds where each code block ends and closes those tags.

If this script is saved in a file called clean-post.vim, I can run it on a markdown file with the following command:

vim -s ./clean-post.vim ./_posts/post.markdown

As an added benefit, I can review the results of the transformation in Vim before saving the changes.

Formatting Git Logs

Similarly, I have a script dedicated to formatting my literate commit posts. This script is different.

Instead of formatting Markdown generated by Ulysses, it formats an entire git log (specifically git log --reverse --full-diff --unified=1 -p .), into a human readable article:

Go
g/^Author:/d
:g/^Date:/d
:g/^commit/exe "normal! dwdwjjwwi[\<Esc>A](/commit/\<Esc>pA)\<Esc>"
:g/^---/d
:g/^index/d
:g/^diff/d
:g/^new file/d
:g/^\\ No newline/d
:g/^@@.*@@$/d
:g/^@@/exe "normal! dt@..dwc$ ...\<Esc>"
:g/^+++/exe "normal! dwd2li\<CR><pre class='language-vimDiff'><p class='information'>\<Esc>"
:g/pre class/exe "normal! A</p><code class='language-vimDiff'>\<Esc>"
:g/^    /normal! 0d4l
:g/^ # \[/normal! dl
:g/^\n    /exe "normal! o<pre class='language-vim'><code class='language-vim'>\<Esc>V}2<"
:g/pre class/exe "normal! }O</code></pre>"
gg:%s/`.\{-}`{:.language-vim}/\0{:.language-vim}/g

While this script seems intimidating, the majority of the work its doing is simple string replacements. Combined with some fancy work done in normal mode, it transforms the output of a full Git log into a beautifully formatted markdown article.

Talk about power!

Final Thoughts

Vim scripts are not pretty by any means, but they get the job done. I continually find myself amazed by the power and utility of Vim when it comes to nearly every aspect of text editing.

If you use Vim and you find yourself repeatedly making similar edits to files, I encourage you to consider scripting those edits. We should always be striving to improve our process, and writing Vim scripts is an incredibly powerful tool that we often overlook.

Learning to Crawl - Building a Bare Bones Web Crawler with Elixir

Written by Pete Corey on Oct 9, 2017.

I’ve been cooking up a side project recently that involves crawling through a domain, searching for links to specific websites.

While I’m keeping the details of the project shrouded in mystery for now, building out a web crawler using Elixir sounds like a fantastic learning experience.

Let’s roll up our sleeves and dig into it!

Let’s Think About What We Want

Before we start writing code, it’s important to think about what we want to build.

We’ll kick off the crawling process by handing our web crawler a starting URL. The crawler will fetch the page at that URL and look for links (<a> tags with href attributes) to other pages.

If the crawler finds a link to a page on the same domain, it’ll repeat the crawling process on that new page. This second crawl is considered one level “deeper” than the first. Our crawler should have a configurable maximum depth.

Once the crawler has traversed all pages on the given domain up to the specified depth, it will return a list of all internal and external URLs that the crawled pages link to.

To keep things efficient, let’s try to parallelize the crawling process as much as possible.

Hello, Crawler

Let’s start off our crawler project by creating a new Elixir project:


mix new hello_crawler

While we’re at it, let’s pull in two dependencies we’ll be using in this project: HTTPoison, a fantastic HTTP client for Elixir, and Floki, a simple HTML parser.

We’ll add them to our mix.exs file:


defp deps do
  [
    {:httpoison, "~> 0.13"},
    {:floki, "~> 0.18.0"}
  ]
end

And pull them into our project by running a Mix command from our shell:


mix deps.get

Now that we’ve set up the laid out the basic structure of our project, let’s start writing code!

Crawling a Page

Based on the description given above, we know that we’ll need some function (or set of functions) that takes in a starting URL, fetches the page, and looks for links.

Let’s start sketching out that function:


def get_links(url) do
  [url]
end

So far our function just returns a list holding URL that was passed into it. That’s a start.

We’ll use HTTPoison to fetch the page at that URL (being sure to specify that we want to follow 301 redirects), and pull the resulting HTML out of the body field of the response:


with {:ok, %{body: body}} <- HTTPoison.get(url, [], follow_redirect: true) do
  [url]
else
  _ -> [url]
end

Notice that we’re using a with block and pattern matching on a successful call to HTTPoison.get. If a failure happens anywhere in our process, we abandon ship and return the current url in a list.

Now we can use Floki to search for any <a> tags within our HTML and grab their href attributes:


with {:ok, %{body: body}} <- HTTPoison.get(url, [], [follow_redirect: true]),
     tags                 <- Floki.find(body, "a"),
     hrefs                <- Floki.attribute(tags, "href") do
  [url | hrefs]
else
  _ -> [url]
end

Lastly, let’s clean up our code by replacing our with block with a new handle_response function with multiple function heads to handle success and failure:


def handle_response({:ok, %{body: body}}, url) do
  [url | body
         |> Floki.find("a")
         |> Floki.attribute("href")]
end

def handle_response(_response, url) do
  [url]
end

def get_links(url) do
  headers = []
  options = [follow_redirect: true]
  url
  |> HTTPoison.get(headers, options)
  |> handle_response(url)
end

Hello, Crawler.

This step is purely a stylistic choice. I find with blocks helpful for sketching out solutions, but prefer the readability of branching through pattern matching.

Running our get_links function against a familiar site should return a handful of internal and external links:


iex(1)> HelloCrawler.get_links("http://www.east5th.co/")
["http://www.east5th.co/", "/", "/blog/", "/our-work/", "/services/", "/blog/",
 "/our-work/", "/services/",
 "https://www.google.com/maps/place/Chattanooga,+TN/", "http://www.amazon.com/",
 "https://www.cigital.com/", "http://www.tapestrysolutions.com/",
 "http://www.surgeforward.com/", "/blog/", "/our-work/", "/services/"]

Quick, someone take a picture; it’s crawling!

Crawling Deeper

While it’s great that we can crawl a single page, we need to go deeper. We want to scape links from our starting page and recursively scrape any other pages we’re linked to.

The most naive way of accomplishing this would be to recurse on get_links for every new list of links we find:


[url | body
       |> Floki.find("a")
       |> Floki.attribute("href")
       |> Enum.map(&get_links/1) # Recursive call get_links
       |> List.flatten]

While this works conceptually, it has a few problems:

  • Relative URLs, like /services, don’t resolve correctly in our call to HTTPoison.get.
  • A page that links to itself, or a set of pages that form a loop, will cause our crawler to enter an infinite loop.
  • We’re not restricting crawling to our original host.
  • We’re not counting depth, or enforcing any depth limit.

Let’s address each of these issues.

Handling Relative URLs

Many links within a page are relative; they don’t specify all of the necessary information to form a full URL.

For example, on http://www.east5th.co/, there’s a link to /blog/. Passing "/blog/" into HTTPoison.get will return an error, as HTTPoison doesn’t know the context for this relative address.

We need some way of transforming these relative links into full-blown URLs given the context of the page we’re currently crawling. Thankfully, Elixir’s standard library ships with the fantastic URI module that can help us do just that!

Let’s use URI.merge and to_string to transform all of the links scraped by our crawler into well-formed URLs that can be understood by HTTPoison.get:


def handle_response({:ok, %{body: body}}, url) do
  [url | body
         |> Floki.find("a")
         |> Floki.attribute("href")
         |> Enum.map(&URI.merge(url, &1)) # Merge our URLs
         |> Enum.map(&to_string/1)        # Convert the merged URL to a string
         |> Enum.map(&get_links/1)
         |> List.flatten]
end

Now our link to /blog/ is transformed into http://www.east5th.co/blog/ before being passed into get_links and subsequently HTTPoison.get.

Perfect.

Preventing Loops

While our crawler is correctly resolving relative links, this leads directly to our next problem: our crawler can get trapped in loops.

The first link on http://www.east5th.co/ is to /. This relative link is translated into http://www.east5th.co/, which is once again passed into get_links, causing an infinite loop.

To prevent looping in our crawler, we’ll need to maintain a list of all of the pages we’ve already crawled. Before we recursively crawl a new link, we need to verify that it hasn’t been crawled previously.

We’ll start by adding a new, private, function head for get_links that accepts a path argument. The path argument holds all of the URLs we’ve visited on our way to the current URL.


defp get_links(url, path) do
  headers = []
  options = [follow_redirect: true]
  url
  |> HTTPoison.get(headers, options)
  |> handle_response(url, path)
end

We’ll call our new private get_links function from our public function head:


def get_links(url) do
  get_links(url, []) # path starts out empty
end

Notice that our path starts out as an empty list.


While we’re refactoring get_links, let’s take a detour and add a third argument called context:


defp get_links(url, path, context) do
  url
  |> HTTPoison.get(context.headers, context.options)
  |> handle_response(url, path, context)
end

The context argument is a map that lets us cleanly pass around configuration values without having to drastically increase the size of our function signatures. In this case, we’re passing in the headers and options used by HTTPoison.get.

To populate context, let’s change our public get_links function to build up the map by merging user-provided options with defaults provided by the module:


def get_links(url, opts \\ []) do
  context = %{
    headers: Keyword.get(opts, :headers, @default_headers),
    options: Keyword.get(opts, :options, @default_options)
  }
  get_links(url, [], context)
end

If the user doesn’t provide a value for headers, or options, we’ll use the defaults specified in the @default_headers and @default_options module attributes:


@default_headers []
@default_options [follow_redirect: true]

This quick refactor will help keep things clean going down the road.


Whenever we crawl a URL, we’ll add that URL to our path, like a breadcrumb marking where we’ve been.

Crawler's path.

Next, we’ll filter out any links we find that we’ve already visited, and append the current url to path before recursing on get_links:


path = [url | path] # Add a breadcrumb
[url | body
       |> Floki.find("a")
       |> Floki.attribute("href")
       |> Enum.map(&URI.merge(url, &1))
       |> Enum.map(&to_string/1)
       |> Enum.reject(&Enum.member?(path, &1)) # Avoid loops
       |> Enum.map(&get_links(&1, [&1 | path], context))
       |> List.flatten]

It’s important to note that this doesn’t completely prevent the repeated fetching of the same page. It simply prevents the same page being refetched in a single recursive chain, or path.

Imagine if page A links to pages B and C. If page B or C link back to page A, page A will not be refetched. However, if both page B and page C link to page D, page D will be fetched twice.

We won’t worry about this problem in this article, but caching our calls to HTTPoison.get might be a good solution.

Restricting Crawling to a Host

If we try and run our new and improved WebCrawler.get_links function, we’ll notice that it takes a long time to run. In fact in most cases, it’ll never return!

The problem is that we’re not limiting our crawler to crawl only pages within our starting point’s domain. If we crawl http://www.east5th.co/, we’ll eventually get linked to Google and Amazon, and from there, the crawling never ends.

We need to detect the host of the starting page we’re given, and restrict crawling to only that host.

Thankful, the URI module once again comes to the rescue.

We can use URI.parse to pull out the host of our starting URL, and pass it into each call to get_links and handle_response via our context map:


def get_links(url, opts \\ []) do
  url = URI.parse(url)
  context = %{
    ...
    host: url.host # Add our initial host to our context
  }
  get_links(url, [], context)
end

Using the parsed host, we can check if the url passed into get_links is a crawlable URL:


defp get_links(url, path, context) do
  if crawlable_url?(url, context) do # check before we crawl
    ...
  else
    [url]
  end
end

The crawlable_url? function simply verifies that the host of the URL we’re attempting to crawl matches the host of our initial URL, passed in through our context:


defp crawlable_url?(%{host: host}, %{host: initial}) when host == initial, do: true
defp crawlable_url?(_, _), do: false

This guard prevents us from crawling external URLs.


Because we passed in the result of URI.parse into get_links, our url is now a URI struct, rather than a string. We need to convert it back into a string before passing it into HTTPoison.get and handle_response:


if crawlable_url?(url, context) do
  url
  |> to_string # convert our %URI to a string
  |> HTTPoison.get(context.headers, context.options)
  |> handle_response(path, url, context)
else
  [url]
end

Additionally, you’ve probably noticed that our collected list of links is no longer a list of URL strings. Instead, it’s a list of URI structs.

After we finish crawling through our target site, let’s convert all of the resulting structs back into strings, and remove any duplicates:


def get_links(url, opts \\ []) do
  ...
  get_links(url, [], context)
  |> Enum.map(&to_string/1) # convert URI structs to strings
  |> Enum.uniq # remove any duplicate urls
end

We’re crawling deep now!

Limiting Our Crawl Depth

Once again, if we let our crawler loose on the world, it’ll most likely take a long time to get back to us.

Because we’re not limiting how deep our crawler can traverse into our site, it’ll explore all possible paths that don’t involve retracing its steps. We need a way of telling it not to continue crawling once it’s already followed a certain number of links and reached a specified depth.

Due to the way we structured out solution, this is incredibly easy to implement!

All we need to do is add another guard in our get_links function that makes sure that the length of path hasn’t exceeded our desired depth:


if continue_crawl?(path, context) and crawlable_url?(url, context) do

The continue_crawl? function verifies that the length of path hasn’t grown past the max_depth value in our context map:


defp continue_crawl?(path, %{max_depth: max}) when length(path) > max, do: false
defp continue_crawl?(_, _), do: true

In our public get_links function we can add max_depth to our context map, pulling the value from the user-provided opts or falling back to the @default_max_depth module attribute:


@default_max_depth 3

context = %{
  ...
  max_depth: Keyword.get(opts, :max_depth, @default_max_depth)
}

As with our other opts, the default max_depth can be overridden by passing a custom max_depth value into get_links:


HelloCrawler.get_links("http://www.east5th.co/", max_depth: 5)

If our crawler tries to crawl deeper than the maximum allowed depth, continue_crawl? returns false and get_links returns the url being crawled, preventing further recursion.

Simple and effective.

Crawling in Parallel

While our crawler is fully functional at this point, it would be nice to improve upon it a bit. Instead of waiting for every path to be exhausted before crawling the next path, why can’t we crawl multiple paths in parallel?

Amazingly, parallelizing our web crawler is as simple as swapping our map over the links we find on a page with a parallelized map:


|> Enum.map(&(Task.async(fn -> get_links(URI.parse(&1), [&1 | path], context) end)))
|> Enum.map(&Task.await/1)

Instead of passing each scraped link into get_links and waiting for it to fully crawl every sub-path before moving onto the next link, we pass all of our links into asynchronous calls to get_links, and then wait for all of those asynchronous calls to return.

For every batch of crawling we do, we really only need to wait for the slowest page to be crawled, rather than waiting one by one for every page to be crawled.

Efficiency!

Final Thoughts

It’s been a process, but we’ve managed to get our simple web crawler up and running.

While we accomplished everything we set out to do, our final result is still very much a bare bones web crawler. A more mature crawler would come equipped with more sophisticated scraping logic, rate limiting, caching, and more efficient optimizations.

That being said, I’m proud of what we’ve accomplished. Be sure to check out the entire HelloCrawler project on Github.

I’d like to thank Mischov for his incredibly helpful suggestions for improving this article and the underlying codebase. If you have any other suggestions, let me know!

User Authentication Kata with Elixir and Phoenix

Written by Pete Corey on Oct 2, 2017.

It’s no secret that’s I’m a fan of code katas. Over the years I’ve spent quite a bit of time on CodeWars and other kata platforms, and I’ve written a large handful of Literate Commit style articles outlining kata solutions:

Delete Occurrences of an Element

Point in Polygon

Molecule to Atoms

Nesting Structure Comparison

The Captain’s Distance Request

Not Quite Lisp

I’m a huge fan of the idea behind code katas, and I believe they provide valuable exercise for your programming muscles.

We Should be Training

That being said, the common form of code kata is just that: exercise. Something to get your heart pumping and your muscles moving, but devoid of any real goal.

Exercise is physical activity for its own sake, a workout done for the effect it produces today, during the workout or right after you’re through.

Does my career benefit from reimplementing search algorithms? Probably not. How much do my clients benefit from my skill at coming up with clever solutions to contrived puzzles? Resoundingly little. What’s the objective value of being able to write a non-recursive fibonacci sequence generator? Zero dollars.

Instead of exercising, we should be training.

Our focused practice time should be structured around and focused towards a particular goal. To make the most of our time, this goal should be practical and valuable.

Training is physical activity done with a longer-term goal in mind, the constituent workouts of which are specifically designed to produce that goal.

We should be working towards mastery of skills that bring value to ourselves and those around us.

Practical Code Katas

Lately I’ve been working on what I call “practical code katas”; katas designed to objectively improve my skills as a working, professional web developer.

The first kata I’ve been practicing is fairly simple. Here it is:

Build a web application that lets new users register using a username and a password, and that lets existing users log in and log out. Protect at least one route in the application from unauthenticated users.

The rules for practicing this kata are simple: you can do whatever you want.

You can use any languages you want. You can use any frameworks you want. You can use any tools you want. Nothing is off limits, as long as you correctly, securely, and honestly complete the kata as described.

I encourage you to go try the kata!

How long did it take you? Are you satisfied with the result? Are you satisfied with your process? Where did you get stuck? What did you have to look up? Where can you improve?


When I first completed this practical code kata, I was following along with the first several chapters in the excellent Programming Phoenix book (affiliate link) by Chris McCord, Bruce Tate, and José Valim. Copying nearly every example verbatim (and trying to convert to a Phoenix 1.3 style) as I read and followed along, it took me roughly four hours to first complete the kata.

Later, I tried practicing the kata again. This time, I had a rough lay of the land, and I knew vaguely how to accomplish the task. Two hours and many references back to Programming Phoenix later, I finished the kata.

And then I did it again. This time it only took me forty five minutes. I found myself stumbling on a few pieces of trivia (“Where does assign live? Oh right, Plug.Conn…”), but the process went much more smoothly than before.

And I did it again. Now I was down to thirty minutes. I still found myself forgetting small details (remember: *_path helpers live in <YourProjectWeb>.Router.Helpers).

So I did it again, and again, and again…

The Benefits of Training

As I continue intentionally practicing with this kata and others like it, I continue to find more and more benefits. Some benefits are obvious, while others are more subtle.

An obvious benefit of this type of routine, intentional practice of a small set of common tasks is that implementing those tasks in a real-world project becomes as natural as breathing. Even better, if I ever need to implement a task similar to any of my katas, I’ll immediately know where to start and what to change.

Repetition breeds familiarity.
Familiarity breeds confidence.
Confidence breeds success.

Practicing this type of kata has not only improved my ability to know what code I need to write, but also my ability to write it.

As I repeatedly find myself writing and rewriting identical sections of code time after time, I start looking for shortcuts. I start to notice places where I can use new Vim tricks I’ve been trying to learn. I build up my snippet library with generic, repeated blocks of code. I get more comfortable with the nuts and bolts of what I do for a living: writing code.

Final Thoughts

If it wasn’t obvious before, it’s probably obvious now. I love code katas.

That being said, I think we haven’t been using code katas up to their full potential. Intentional, repeated practice of a set of practical, real-world problems can help us greatly improve our skills as professional web developers.

Stop exercising, and start training!