You can find EmParsec on GitHub: https://github.com/mavnn/EmParsec
Type providers, by their very nature, tend to access data external to the .net ecosystem. It can also be very awkward technically to make use of dependencies during the actual type generation process.
This is rather a pity, because accessing all of that external data is much nicer and easier when you have a decent parser to do it with. And F# has very, very nice parser support via the FParsec library. Instead, many (most?) type providers end up creating mini-one shot parsers which can be a bit slow to write and don't tend to have features that come for free in a more complete solution such as nice error reporting.
Writing yet an other parser (YAOP) this week I decided that enough was enough. What I needed was a shared resource that people could pool improvements for that could be easily embedded in projects like type providers were it isn't desirable (or sometimes possible) to take external binary dependencies.
So I built it.
EmParsec is a single file parser combinator "library", inspired by both FParsec library and Scott's excellent series on building parser combinators.
It consists of a single fs file that can be loaded in the editor of your choice without any requirement for a project file or similar. When you want to use it, you can just reference it as a Paket GitHub dependency (which you'll be wanting to do for the ProvidedTypes.fs file if you're creating a type provider anyway) or even just copy the file across.
If you are compiling EmParsec into a larger project, it marks itself as "internal" so that you don't pollute the end users name space, and so that if two projects you reference have embedded different versions of EmParsec there are no collisions.
How do I use it?
So, you've added EmParsec.fs to your project (manually or with Paket) and now you're wondering how to use it. Let's build some simple examples.
Matching an exact string
Let's start with something simple: what if I just want to match a string?
Parser combinator libraries allow you to combine parsers from simpler parsers (hence the name), but in this case pstring
(the 'p' is there to avoid clashing with the existing string
function) is provided for us by EmParsec.
Let's try it out!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Not bad! It even marks the unexpected character in the error output.
Unfortunately:
1 2 |
|
That probably isn't the behaviour you were hoping for. There's still input left after the parser has finished, but that isn't being seen as an error. EmParsec includes the eof
parser for just this type of occasion - a parser that checks the input is exhausted.
So we want a parser that parses "thing" and then ends.
Let's go:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
That's more like it. The only issue now is that we've combined two parser, so we're getting back a tuple of two results.
A simple tweak tells EmParsec to throw away the unit result returned by eof
.
1 2 3 4 |
|
"Impressive," I hear you say: "You can parse static strings!"
Parsing a simple template language
You have a point. Let's tackle a simple template language. You know the kind of thing:
Welcome {name}! Please spend money here.
That kind of thing. I'm going to start building up a set of helper parsers for this, applying some type annotations both to make the example code clearer and to avoid the value restriction errors that crop up until you actually use the parsers (those occur because these parsers can carry generic user state, but we're not going to go into using that here).
We have two "types" of token that can exist in our template language: values to be replaced, and text to be preserved. Let's start by creating a union type to contain our parse results:
1 2 3 |
|
Then, we'll have a parser that will parse individual characters which are not an opening bracket:
1 2 |
|
satisfy
is a function built into EmParsec which takes a predicate for whether or not it will consume the next character in the input stream. The final string argument is a name for the parser, which will be used in error messages.
Then we'll use that parser to create one that consumes as many "not open bracket" characters as it can, combines them into a string and then counts that string as a Text
part.
1 2 3 4 5 6 7 8 |
|
There's a new function here and a couple of new operators (all taken from FParsec, by the way). |>>
is a map operator; it allows you to transform the result of a parser and then rewrap everything back up into a new parser. This is really at the heart of the power of parser combinator libraries.
The <?>
operator is much simpler: it allows you to name a parser rather than its name being some combination of the parsers it's built of.
The many1
function says "match one or more instances of the parser that follows". There is also a many
, which matches 0 or more repeats.
So that's good - we can capture the text in between our replacable values. Let's go with a parser for the bracketed value names next!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
So we now have parsers for white space and our "valueName" (which we are saying must be at least one character long, and can consist of any character which is not whitespace or a closing curly bracket). We can then use pchar ("parse char") and whitespace to allow for minor variations in syntax (some people like {name}
, others like { name }
).
Finally we build our value parser, using the between
function, which does pretty much what you'd expect: it takes an opening parser, a closing parser, and captures what's in between with third parser.
Our final step is just to combine our parsers for value and text sections. We want to capture "many" of one or the other, until we run out of input. We'll put an explicit eof
on there as well, otherwise things like (for example) an unclosed }
at the end of the string will not show up as an error - the parser will just stop at the character before the opening {
as the last matching input.
Our final parser introduces the <|>
(orElse) operator, and looks like this:
1 2 3 4 |
|
Let's try it out!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
A couple of helpers: showTemplate
knows how to build up a string from a list of template parts and a value map, run'
is just a simple wrapper around run
that throws if parsing is not successful.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
And finally our templates in action. You can see that even with a simple parser like this, it's already reaching a complexity that would be painful to match with a hand rolled creation.
If you want to know more about parser combinators, and especially how to use them to create recursive grammars, do check out the FParsec documentation which is excellent. It is also more complete and much more performant than EmParsec.
But if you need a small, single file parser where performance is not the primary concern - maybe EmParsec is your friend. Anyone who wants to join in making it better is more than welcome! Of particular note is that EmParsec does not yet support controlling when backtracking does or doesn't happen (it will always backtrack) which can make for some pretty confusing error messages.