Molecule to Atoms

This post is written as a set of Literate Commits. The goal of this style is to show you how this program came together from beginning to end.

Each commit in the project is represented by a section of the article. Click each section's header to see the commit on Github, or check out the repository and follow along.

Written by Pete Corey on Jul 27, 2016.

Project Setup

Today we’re going to be tackling the Molecule to Atom kata. The goal of this kata is to parse a string representation of a molecule into its component elements, or atoms.

As always, the first step is to get our project set up. We’ll be using Mocha as our test runner and Babel for ES6 support.

.babelrc

+{ + "presets": ["es2015"] +}

.gitignore

+node_modules/

package.json

+{ + "main": "index.js", + "scripts": { + "test": "mocha ./test --compilers js:babel-register" + }, + "dependencies": { + "babel-preset-es2015": "^6.9.0", + "babel-register": "^6.9.0", + "chai": "^3.5.0", + "lodash": "^4.12.0", + "mocha": "^2.4.5" + } +}

test/index.js

+import { expect } from "chai"; + +describe("index", function() { + + it("works"); + +});

Laying the Groundwork

The best place to start is at the beginning. We’ll begin solving this kata by writing the most basic test we can think of. We expect an empty string to be parsed into an empty object:


expect(parseMolecule("")).to.deep.equal({});

After writing this test, it fails. parseMolecule is not defined. We can quickly remedy this by importing parseMolecule into our test file and then exporting it from our main module.

Lastly, we need parseMolecule to return an empty object. Just like that, our tests are green.

index.js

+export function parseMolecule(formula) { + return {}; +}

test/index.js

import { expect } from "chai"; +import { parseMolecule } from "../index"; -describe("index", function() { +describe("Molecule to Atoms", function() { - it("works"); + it("it parses a molecule", () => { + expect(parseMolecule("")).to.deep.equal({}); + });

Introducing Our Abstractions

Knowing this solution is going to get complex fairly quickly, we’d like to leverage some abstractions to make it more testable and maintainable.

The most obvious abstraction we can think of is a molecule. We create a molecule by passing in a formula. From there, we can call parse to get an object-based representation of the molecule.

We can rewrite our parseMolecule function to use molecule in a clean way.

index.js

+export function molecule(formula) { + function parse() { + return {}; + } + + return { + parse + }; +} + export function parseMolecule(formula) { - return {}; + return molecule(formula).parse(); }

Testing Hydrogen

Let’s add a new test case:


expect(parseMolecule("H")).to.deep.equal({ H: 1 });

This test fails. It’s expecting {} to equal { H: 1 }.

The fastest way we can get ourselves to green is to return an object that maps the provided formula to 1.


return {
    [formula]: 1
};

After making this change, our first test fails. We now need to explicitly handle this base case:


if (!formula) {
    return {};
}

And now we’re back to green again.

index.js

... function parse() { - return {}; + if (!formula) { + return {}; + } + return { + [formula]: 1 + }; }

test/index.js

... expect(parseMolecule("")).to.deep.equal({}); + expect(parseMolecule("H")).to.deep.equal({ H: 1 }); });

Multiple Elements

Next we’ll move onto parsing a molecule with multiple types of elements. Let’s try to parse "HMg":


expect(parseMolecule("HMg")).to.deep.equal({ H: 1, Mg: 1 });

As expected, this test fails. parse is returning { HMg: 1 }, rather than correctly splitting "H" and "Mg".

To fix this, we’ll need to implement a new method on molecule that will split a formula into a number of parts. We’ll start things off by splitting formulas into its component elements:


formula.match(/[A-Z][a-z]*/g) || [];

The || [] default is required to keep our base case of parsing "" happy.

After rewriting parse to use this new parts method to assemble the molecule object, our tests flip back to green.

index.js

export function molecule(formula) { + function parts() { + return formula.match(/[A-Z][a-z]*/g) || []; + } + function parse() { - if (!formula) { - return {}; - } - return { - [formula]: 1 - }; + return parts().reduce((result, part) => { + result[part] = 1; + return result; + }, {}); } ... return { - parse + parse, + parts };

test/index.js

import { expect } from "chai"; -import { parseMolecule } from "../index"; +import { parseMolecule, molecule } from "../index"; ... expect(parseMolecule("H")).to.deep.equal({ H: 1 }); + expect(parseMolecule("HMg")).to.deep.equal({ H: 1, Mg: 1 }); + }); + + describe("molecule", () => { + it("splits a formula into parts", () => { + expect(molecule("HMg").parts()).to.deep.equal(["H", "Mg"]); + }); });

Compiling Parts

Things are getting more complicated. Instead of just breaking apart our formula and looking for elements, we need to look for pairs of elements and their corresponding count, or ratio in the molecule.

A nice way to do this is using regular expression match groups and ES6 pattern matching. We can define our regex, and wrap the element and optional count in match groups:


let regex = /([A-Z][a-z]*)(\d*)/g;

As we iterate over our regex matches, we can destructure the result into its corresponding element and count pair.

Finally, we refactored parts to return an object with part and count keys to keep things explicit. Thinking ahead, I think this might come in handy when we need to deal with nested expressions.

index.js

... function parts() { - return formula.match(/[A-Z][a-z]*/g) || []; + let parts, results = []; + let regex = /([A-Z][a-z]*)(\d*)/g; + while (parts = regex.exec(formula)) { + let [_, part, count] = parts; + results.push({ + part, + count: parseInt(count) || 1 + }); + } + return results; } ... function parse() { - return parts().reduce((result, part) => { - result[part] = 1; + return parts().reduce((result, {part, count}) => { + result[part] = count; return result;

test/index.js

... expect(parseMolecule("HMg")).to.deep.equal({ H: 1, Mg: 1 }); + expect(parseMolecule("H2Mg")).to.deep.equal({ H: 2, Mg: 1 }); }); ... it("splits a formula into parts", () => { - expect(molecule("HMg").parts()).to.deep.equal(["H", "Mg"]); + expect(molecule("HMg").parts()).to.deep.equal([ + { part: "H", count: 1 }, + { part: "Mg", count: 1 }, + ]); });

Fixing an Edge Case

One thing I noticed while looking over our solution is that it did not support multiple instances of the same element in a molecule. For example, with "HMgH", the last "H" element would override the previous.

I added a test to confirm my suspicions, and quickly fixed the problem. Instead of overriding the element in the result object with count, increment it (if it exists) by count:


result[part] = ~~result[part] + count;

The ~~ operator is rarely seen, but it’s an incredibly useful operator. In this case, if result[part] is undefined, it will be converted to 0, preventing a NaN result from our addition.

index.js

... return parts().reduce((result, {part, count}) => { - result[part] = count; + result[part] = ~~result[part] + count; return result;

test/index.js

... expect(parseMolecule("H2Mg")).to.deep.equal({ H: 2, Mg: 1 }); + expect(parseMolecule("H2MgH")).to.deep.equal({ H: 3, Mg: 1 }); });

Beginning Nested Expressions

Next up on our feature todo list is adding support for nested expressions. A simple nested expressions to get us going could be "[H]Mg". Let’s set our expectations with a test:


expect(parseMolecule("[H]Mg")).to.deep.equal({ H: 1, Mg: 1 });

Now each “part” in our parts method could be either an element, or a nested expressions within square brackets. Let’s update our regular expression to reflect that:


let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g;

Thanks to the magic of match groups and destructing, we can assign the nested expression (if it exists) to square:


let [_, __, square, part, count] = parts;

Finally, if square exists, let’s recursively process the nested expression and append its parts onto our list of results.

With those changes, our tests flip back to green. Beautiful.

index.js

... let parts, results = []; - let regex = /([A-Z][a-z]*)(\d*)/g; + let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g; while (parts = regex.exec(formula)) { - let [_, part, count] = parts; - results.push({ - part, - count: parseInt(count) || 1 - }); + let [_, __, square, part, count] = parts; + if (square) { + let nested = molecule(square).parts(); + results = results.concat(nested); + } + else { + results.push({ + part, + count: parseInt(count) || 1 + }); + } }

test/index.js

... expect(parseMolecule("H2MgH")).to.deep.equal({ H: 3, Mg: 1 }); + expect(parseMolecule("[H]Mg")).to.deep.equal({ H: 1, Mg: 1 }); });

Refactoring

Looking forward, the next feature we’ll want to support is using ratios, or count on nested expressions. This means that we’ll need some notion of “multiplying” molecules by some count.

I imagine calling molcule(...).multiply(2) would return a new molecule. This means that multiply will need access to our parsed interpretation of the formula.

We could have multiply call parts, multiply each element by count, serialize the result back into a formula and then create and return a new molecule, but that’s a very roundabout way of doing things.

Instead, let’s define an internal _parts list that’s created every time a molecule is created. The parts method will simply return _parts, and multiply will be able to modify _parts directly.

After doing our refactoring, all of our tests are still passing. We’re safe.

index.js

export function molecule(formula) { - function parts() { - let parts, results = []; - let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g; - while (parts = regex.exec(formula)) { - let [_, __, square, part, count] = parts; - if (square) { - let nested = molecule(square).parts(); - results = results.concat(nested); - } - else { - results.push({ - part, - count: parseInt(count) || 1 - }); - } + let _parts = []; + + let matches; + let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g; + while (matches = regex.exec(formula)) { + let [_, __, square, part, count] = matches; + if (square) { + let nested = molecule(square).parts(); + _parts = _parts.concat(nested); + } + else { + _parts.push({ + part, + count: parseInt(count) || 1 + }); } - return results; + } + + function parts() { + return _parts; }

Multiplying Nested Expressions

With that refactor out of the way, we can easily implement multiply. To give ourselves an explicit objective, let’s write a test first:


expect(molecule("H").multiply(2).parse()).to.deep.equal({ H: 2 });

Implementing multiply is straight forward. We just loop over each of the _parts of the molecule, multiplying its count by the provided count. Finally, we return this, so calls to molecule methods can be chained together.

Next, let’s write a test for handling ratios on nested expressions:


expect(parseMolecule("[HO]2Mg")).to.deep.equal({ H: 2, O: 2, Mg: 1 });

This test fails, as expected, but it’s not difficult to get our suite back to green. We just need to build our nested molecule, multiply is by its corresponding count, and finally append its parts onto our results array.

index.js

... let [_, __, square, part, count] = matches; + count = parseInt(count) || 1; if (square) { - let nested = molecule(square).parts(); - _parts = _parts.concat(nested); + let nested = molecule(square).multiply(count); + _parts = _parts.concat(nested.parts()); } ... part, - count: parseInt(count) || 1 + count }); ... + function multiply(count) { + _parts.forEach((part) => { + part.count *= count; + }); + return this; + } + function parts() { ... parse, - parts + parts, + multiply };

test/index.js

... expect(parseMolecule("[H]Mg")).to.deep.equal({ H: 1, Mg: 1 }); + expect(parseMolecule("[HO]2Mg")).to.deep.equal({ H: 2, O: 2, Mg: 1 }); }); ... }); + it("multiplies an object", () => { + expect(molecule("H").multiply(2).parse()).to.deep.equal({ H: 2 }); + }); });

Regex Woes

Unfortunately, there’s a big problem with our solution. Our regex (\[.*\]) is looking for opening and closing brackets and assuming that everything within them are a nested expression.

This assumption breaks down when we have two nested expressions in the same formula, like "[H]O[H]". Our regex will match on the first and very last square brackets, returning "H]O[H" as the nested expression. We can write this as a test to see the failure in action:


expect(molecule("[H]2O[H]").parts()).to.deep.equal([
    { part: "H", count: 2 },
    { part: "O", count: 1 },
    { part: "H", count: 1 }
]);

Switching to a non-greedy regex (\[.*?\]) would still fail, but this time on nested subexpressions, like "[H[O]]".

We need a better way to parse out our formula parts.

Our new solution uses a regex to match on all opening brackets, closing brackets, and elements with a trailing count:


let regex = /((\[)|(\])|([A-Z][a-z]*))(\d*)/g;

Every time we encounter an opening bracket, we push a new formula onto our new stack:


if (open) {
    stack.push({
        formula: ""
    });
}

A non-empty stack means that we’re collecting the pieces of a nested expression. That means we should just append any elements and counts we find to that sub-expression’s formula:


else if (stack.length) {
    stack[stack.length - 1].formula += part + count;
}

Finally, whenever we encounter a closing bracket, we want to create a new molecule out of the nested expression we just collected, and appends its parts onto our list of parts:


else if (close) {
    let nested = molecule(stack.pop().formula).multiply(count);
    _parts = _parts.concat(nested.parts());
}

Now we’re properly processing sub-expressions and our tests return to green.

index.js

... let matches; - let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g; + let stack = []; + let regex = /((\[)|(\])|([A-Z][a-z]*))(\d*)/g; while (matches = regex.exec(formula)) { - let [_, __, square, part, count] = matches; + let [_, __, open, close, part, count] = matches; count = parseInt(count) || 1; - if (square) { - let nested = molecule(square).multiply(count); + if (open) { + stack.push({ + formula: "" + }); + } + else if (close) { + let nested = molecule(stack.pop().formula).multiply(count); _parts = _parts.concat(nested.parts()); } + else if (stack.length) { + stack[stack.length - 1].formula += part + count; + } else {

test/index.js

... ]); + expect(molecule("[H]2O[H]").parts()).to.deep.equal([ + { part: "H", count: 2 }, + { part: "O", count: 1 }, + { part: "H", count: 1 } + ]); });

Alternate Groupings and Bug Fixes

Last up on our feature list is the ability to use parentheses and curly brackets in addition to square brackets as sub-expression dividers.

We can add a test for this:


expect(parseMolecule("K4{ON(SO3)2}2")).to.deep.equal({K: 4, O: 14, N: 2, S: 4});

The most straight-forward way to accomplish this is to just replace all "{" and "(" characters with "[", and "}" and ")" characters with "]" in our formula:


formula.replace(/[{(]/g, "[").replace(/[})]/g, "]");

Unfortunately, this test also exposed another bug in our solution. it looks like counts on sub-expressions aren’t being applied to other nested expressions.

We can fix this by giving our stack a bit more state. In addition to building up our current expressions’s formula as we go, we also want to keep track of any nested molecules we encounter, so we can multiply them by our count.

After making this change, all of our tests pass.

index.js

... + formula = formula.replace(/[{(]/g, "[").replace(/[})]/g, "]"); + let matches; ... stack.push({ - formula: "" + formula: "", + molecules: [] }); ... else if (close) { - let nested = molecule(stack.pop().formula).multiply(count); - _parts = _parts.concat(nested.parts()); + let popped = stack.pop(); + popped.molecules.push(molecule(popped.formula)); + popped.molecules.forEach((molecule) => { + molecule.multiply(count); + }); + if (!stack.length) { + popped.molecules.forEach((molecule) => { + _parts = _parts.concat(molecule.parts()); + }); + } + else { + let last = stack[stack.length - 1]; + last.molecules = last.molecules.concat(popped.molecules); + } }

test/index.js

... expect(parseMolecule("[HO]2Mg")).to.deep.equal({ H: 2, O: 2, Mg: 1 }); + expect(parseMolecule("K4{ON(SO3)2}2")).to.deep.equal({K: 4, O: 14, N: 2, S: 4}); + expect(parseMolecule("{[Co(NH3)4(OH)2]3Co}(SO4)3")).to.deep.equal({ + Co: 4, + N: 12, + H: 42, + O: 18, + S: 3 + }); });

Final Thoughts

This was a doozy of a kata.

All in all, there are a lot of ideas flying around here and I fear I didn’t explain myself as well as I could have. Balancing trying to keep things as concise, while staying true to the intent of explaining every change is definitely a challenge.

The limitations of regular expressions are explored in this problem was well. Initially, a totally regex based solution seemed like a good approach, but the nested structure of our data made regexes difficult to use.

When using regexes, it’s always important to keep in mind that there are entire classes of problem where they’re simply not suitable.

Finally, my final solution isn’t as clean and clear as I would have liked. By the time I landed on a working solution, I feared that more refactor commits would have muddied the water. In future problems, I need to focus on refactoring sooner and more often.