GenServers and Memory Images: A Match Made in Heaven

On Jun 19, 2017 by Pete Corey

My current project, Inject Detect, is being built with Elixir and makes heavy use of Martin Fowler-style Memory Images. After working with this setup for several months, I’ve come to realize that Elixir GenServers and a Memory Image architecture are a match made in heaven.

Let’s dive into what Memory Images are, and why GenServers are the perfect tool for building out a Memory Image in your application.

What is a Memory Image?

In my opinion, the best introduction to the Memory Image concept is Martin Fowler’s article on the subject. If you haven’t, be sure to read through the article.

For brevity, I’ll try to summarize as quickly as possible. Martin comments that most developers’ first question when starting a new project is, “what database will I use?” Unfortunately, answering this question requires many upfront decisions about things like data shape and even usage patterns that are often unknowable upfront.

Martin flips the question on its head. Instead of asking which database you should use, he suggests you ask yourself, “do I need a database at all?”

Mind blown.

The idea of a Memory Image is to keep the entire state of your application entirely within your server’s memory, rather than keeping it in a database. At first, this seems absurd. In reality, it actually works very well for many projects.

I’ll defer an explanation of the pros, cons, and my experiences with Memory Images to a later post. Instead of going down that rabbit hole, let’s take a look at how we can efficiently implement a Memory Image in Elixir!

Backed By an Event Log

The notion that a Memory Image architecture don’t use a database at all isn’t entirely true. In Inject Detect, I use a database to persist a log of events that describe all changes that have happened to the system since the beginning of time.

This event log isn’t particularly useful in its raw format. It can’t be queried in any meaningful way, and it can’t be used to make decisions about the current state of the system.

To get something more useful out of the system, the event log needs to be replayed. Each event effects the system’s state in some known way. By replaying these events and their corresponding effects in order, we can rebuild the current state of the system. We effectively reduce down all of the events in our event log into the current state of our system.

This is Event Sourcing.


We can implement this kind of simplified Event Sourced system fairly easily:


defmodule State do

  def get do
    state = InjectDetect.Model.Event
    |> order_by([event], event.id)
    |> InjectDetect.Repo.all
    |> Enum.to_list
    |> Enum.map(&(struct(String.to_atom(&1.type), &1.data))
    |> Enum.reduce(%{}, &State.Reducer.apply/2)
  end

end

Each event in our event log has a type field that points to a specific event struct in our application (like SignedUp), and a data field that holds a map of all the information required to replay the effects of that event on the system.

For example, a SignedUp event might look like this when saved to the database:


%{id: 123, type: "SignedUp", data: %{"email" => "user@example.com"}}

To get the current state of the system, we grab all events in our event log, convert them into structs, and then reduce them down into a single state object by applying their changes, one after the other, using our State.Reducer.apply Elixir protocol that all event structs are required to implement.

While this is a fairly simple concept, it’s obviously inefficient. Imagine having to process your entire event log every time you want to inspect the state of your system!

There has to be a better way.

GenServer, Meet Memory Image

Memory Image, meet GenServer.

Rather than reprocessing our entire event log every time we want to inspect our application’s state, what if we could just keep the application state in memory?

GenServers (and Elixir processes in general) are excellent tools for persisting state in memory. Let’s refactor our previous solution to calculate our application’s state and then store it in memory for future use.

To manage this, our GenServer will need to store two pieces of information. It will need to store the current state of the system, and the id of the last event that was processed. Initially, our current application state will be an empty map, and the last id we’ve seen will be 0:


  def start_link, do:
    GenServer.start_link(__MODULE__, { %{}, 0 }, name: __MODULE__)

Next, rather than fetching all events from our event log, we want to fetch only the events that have happened after the last event id that we’ve processed:


  defp get_events_since(id) do
    events = InjectDetect.Model.Event
    |> where([event], event.id > ^id)
    |> order_by([event], event.id)
    |> InjectDetect.Repo.all
    |> Enum.to_list
    {convert_to_structs(events), get_last_event_id(events)}
  end

This function returns a tuple of the fetched events, along with the id of the last event in that list.

When get_events_since is first called, it will return all events currently in the event log. Any subsequent calls will only return the events that have happened after the last event we’ve processed. Because we’re storing the system’s state in our GenServer, we can apply these new events to the old state to get the new current state of the system.

Tying these pieces together, we get something like this:


defmodule State do
  use GenServer

  import Ecto.Query

  def start_link, do: 
    GenServer.start_link(__MODULE__, { %{}, 0 }, name: __MODULE__)
 
  def get, do: 
    GenServer.call(__MODULE__, :get)

  def convert_to_structs(events), do: 
    Enum.map(events, &(struct(String.to_atom(&1.type), &1.data))

  def get_last_event_id(id, events) do
    case List.last(events) do
      nil   -> id
      event -> event.id
    end
  end

  defp get_events_since(id) do
    events = InjectDetect.Model.Event
    |> where([event], event.id > ^id)
    |> order_by([event], event.id)
    |> InjectDetect.Repo.all
    |> Enum.to_list
    {convert_to_structs(events), get_last_event_id(id, events)}
  end

  def handle_call(:get, _, {state, last_id}) do
    {events, last_id} = get_events_since(last_id)
    state = Enum.reduce(events, state, &State.Reducer.apply/2)
    {:reply, {:ok, state}, {state, last_id}}
  end

end

At first this solution may seem complicated, but when we break it down, there’s not a whole lot going on.

Our State GenServer stores:

  1. The current state of the system.
  2. The id of the last event it has processed.

Whenever we call State.get, it checks for new events in the event log and applies them, in order, to the current state. The GenServer saves this state and the id of the last new event and then replies with the new state.

That’s it!

Final Thoughts

Building a Memory Image in Elixir using GenServers is a match made in heaven. When working with these tools and techniques, it honestly feels like solutions effortlessly fall into place.

The Memory Image architecture, especially when combined with Event Sourcing, perfectly lends itself to a functional approach. Additionally, using GenServers to implement these ideas opens the doors to building fast, efficient, fault-tolerant, and consistent distributed systems with ease.

While Memory Images are an often overlooked solution to the problem of maintaining state, the flexibility and speed they bring to the table should make them serious contenders in your next project.

GraphQL NoSQL Injection Through JSON Types

On Jun 12, 2017 by Pete Corey

One year ago today, I wrote an article discussing NoSQL Injection and GraphQL. I praised GraphQL for eradicating the entire possibility of NoSQL Injection.

I claimed that because GraphQL forces you to flesh out the entirety of your schema before you ever write a query, it’s effectively impossible to succumb to the incomplete argument checking that leads to a NoSQL Injection vulnerability.

Put simply, this means that an input object will never have any room for wildcards, or potentially exploitable inputs. Partial checking of GraphQL arguments is impossible!

I was wrong.

NoSQL Injection is entirely possible when using GraphQL, and can creep into your application through the use of “partial scalar types”.

In this article, we’ll walk through how the relatively popular GraphQLJSON scalar type can open the door to NoSQL Injection in applications using MongoDB.

Custom Scalars

In my previous article, I explained that GraphQL requires that you define your entire application’s schema all the way down to its scalar leaves.

These scalars can be grouped and nested within objects, but ultimately every field sent down to the client, or passed in by the user is a field of a known type:

Scalars and Enums form the leaves in request and response trees; the intermediate levels are Object types, which define a set of fields, where each field is another type in the system, allowing the definition of arbitrary type hierarchies.

Normally, these scalars are simple primitives: String, Int, Float, or Boolean. However, sometimes these four primitive types aren’t enough to fully flesh out the input and output schema of a complex web application.

Custom scalar types to the rescue!

Your application can define a custom scalar type, along with the set of functionality required to serialize and deserialize that type into and out of a GraphQL request.

A common example of a custom type is the Date type, which can serialize Javascript Date objects into strings to be returned as part of a GraphQL query, and parse date strings into Javascript Date objects when provided as GraphQL inputs.

Searching with JSON Scalars

This is all well and good. Custom scalars obviously are a powerful tool for building out more advanced GraphQL schemas. Unfortunately, this tool can be abused.

Imagine we’re building a user search page. In our contrived example, the page lets users search for other users based on a variety of fields: username, full name, email address, etc…

Being able to search over multiple fields creates ambiguity, and ambiguity is hard to work with in GraphQL.

To make our lives easier, let’s accept the search criteria as a JSON object using the GraphQLJSON custom scalar type:


type Query {
    users(search: JSON!): [User]
}

Using Apollo and a Meteor-style MongoDB driver, we could write our users resolver like this:


{
    Query: {
        users: (_root, { search }, _context) => {
            return Users.find(search, {
                fields: {
                    username: 1, 
                    fullname: 1, 
                    email: 1
                }
            });
        }
    }
}

Great!

But now we want to paginate the results and allow the user to specify the number of results per page.

We could add skip and limit fields separately to our users query, but that would be too much work. We’ve already seen how well using the JSON type worked, so let’s use that again!


type Query {
    users(search: JSON!, options: JSON!): [User]
}

We’ve extended our users query to accept an options JSON object.


{
    Query: {
        users: (_root, { search, options }, _context) => {
            return Users.find(search, _.extend({
                fields: {
                    _id: 1,
                    username: 1, 
                    fullname: 1, 
                    email: 1
                }
            }, options));
        }
    }
}

And we’ve extended our users resolver to extend the list of fields we return with the skip and limit fields passed up from the client.

Now, for example, our client can make a query to search for users based on their username or their email address:


{
    users(search: "{\"username\": {\"$regex\": \"sue\"}, \"email\": {\"$regex\": \"sue\"}}",
          options: "{\"skip\": 0, \"limit\": 10}") {
        _id
        username
        fullname
        email
    }
}

This might return a few users with users with "sue" as a part of their username or email address.

But there are problems here.

Imagine a curious or potentially malicious user making the following GraphQL query:


{
    users(search: "{\"email\": {\"$gte\": \"\"}}",
          options: "{\"skip\": 0, \"limit\": 10}") {
        _id
        username
        fullname
        email
    }
}

The entire search JSON object is passed directly into the Users.find query. This query will return all users in the collection.

Thankfully, a malicious user would only receive our users’ usernames, full names, and email addresses. Or would they?

The options JSON input could also be maliciously modified:


{
    users(search: "{\"email\": {\"$gte\": \"\"}}",
          options: "{\"fields\": {}}") {
        _id
        username
        fullname
        email
    }
}

By passing in their own fields object, an attacker could overwrite the fields specified by the server. This combination of search and options would return all fields (specified in the GraphQL schema) for all users in the system.

These fields might include sensitive information like their hashed passwords, session tokens, purchase history, etc…

Fixing the Vulnerability

In this case, and in most cases, the solution here is to be explicit about what we expect to receive from the client. Instead of receiving our flexible search and options objects from the client, we’ll instead ask for each field individually:


type Query {
    users(fullname: String,
          username: String,
          email: String,
          skip: Number!,
          limit: Number!): [User]
}

By making the search fields (fullname, username, and email) optional, the querying user can omit and of the fields they don’t wish to search on.

Now we can update our resolver to account for this explicitness:


{
    Query: {
        users: (_root, args, _context) => {
            let search = _.extend({}, args.fullname ? { fullname } : {},
                                      args.username ? { username } : {},
                                      args.email ? { email } : {});
            return Users.find(search, {
                fields: {
                    _id: 1,
                    username: 1, 
                    fullname: 1, 
                    email: 1
                },
                skip: args.skip,
                limit: args.limit
            });
        }
    }
}

If either fullname, username, or email are passed into the query, we’ll add them to our query. We can safely dump this user-provided data into our query because we know it’s a String at this point thanks to GraphQL.

Lastly, we’ll set skip and limit on our MongoDB query to whatever was passed in from the client. We can be confident that our fields can’t possibly be overridden.

Final Thoughts

Custom scalar types, and the JSON scalar type specifically, aren’t all bad. As we discussed, they’re a powerful and important tool for building out your GraphQL schema.

However, when using JSON types, or any other sufficiently expressive custom scalar types, it’s important to remember to make assertions about the type and shape of user-provided data. If you’re assuming that the data passed in through a JSON field is a string, check that it’s a string.

If a more primitive GraphQL type, like a Number fulfills the same functionality requirements as a JSON type, even at the cost of some verbosity, use the primitive type.

Behold the Power of GraphQL

On Jun 5, 2017 by Pete Corey

Imagine you’re build out a billing history page. You’ll want to show the current user’s basic account information along with the most recent charges made against their account.

Using Apollo client and React, we can wire up a simple query to pull down the information we need:


export default graphql(gql`
    query {
        user {
            id
            email
            charges {
                id
                amount
                created
            }
        }
    }
`)(Account);

Question: Where are the user’s charges being pull from? Our application’s data store? Some third party service?

Follow-up question: Does it matter?

A Tale of Two Data Sources

In this example, we’re resolving the current user from our application’s data store, and we’re resolving all of the charges against that user with Stripe API calls.

We’re not storing any charges in our application.

If we take a look at our Elixir-powered Absinthe schema definition for the user type, we’ll see what’s going on:


object :user do
  field :id, :id
  field :email, :string
  field :charges, list_of(:stripe_charge) do
    resolve fn
      (user, _, _) ->
        case Stripe.get_charges(user.customer_id) do
          {:ok, charges} -> {:ok, charges}
          _ -> InjectDetect.error("Unable to resolve charges.")
        end
    end
  end
end

The id and email fields on the user type are being automatically pulled out of the user object.

The charges field, on the other hand, has a custom resolver function that queries the Stripe API with the user’s customer_id and returns all charges that have been made against that customer.

Keep in mind that this is just an example. In a production system, it would be wise to add a caching and rate limiting layer between the client and the Stripe API calls to prevent abuse…

Does It Matter?

Does it matter that the user and the charges agains the user are being resolves from different data sources? Not at all.

This is the power of GraphQL!

From the client’s perspective, the source of the data is irrelevant. All that matters is the data’s shape and the connections that can be made between that data and the rest of the data in the graph.