This post is part of a series! If you haven't already, check out the introduction so you know what's going on.
It's fairly obvious how dependencies work in Shake when all of the files are known while you're writing your rules.
And if a build rule creates a single file matching a pattern, or even a known set of files based on a pattern: that's pretty simple too. Just add a rule (%> for building a single file, &%> for a list) and then when you need
one of the outputs Shake knows how to make sure it's up to date.
Life becomes a little more interesting when you have a rule that takes multiple inputs (detected at run time) and creates multiple outputs (depending on what was found).
Let's look at an example. We're writing a computer game, and the game designers want to be able to quickly specify new types of characters that can exist. The designers and developers settle on a compromise; they'll use Yaml with a few simple type names the developers will teach the designers.
So the designers start churning out character types, which look like this:
1 2 3 |
|
or this:
1 2 |
|
The developers, on the other hand, want to be able to consume nice safe Haskell types like so:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
And we want our code to break at compile time if, for any reason, the Yaml files get changed and we start relying on things that no longer exist. So we're going to set up a build rule that builds a directory full of nice type safe code from a directory full of nice concise and easy to edit Yaml.
Let's see what we can come up to build this safely. Our first shot at a replacement build Rule
looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
This looks very similar to the previous build rule, with just the addition of a few lines to account for the generated files. The only slightly quirky moment is need ["_build/haskell_generation.log"]
; we need this because Shake has no concept of a rule for a directory. So the rule for _build/haskell_generation.log
creates all of our generated files, so that we can then "get" them on the line below.
We also need to add the rules for _build/haskell_generation.log
and for files in the generated directory, to make sure they're generated before they are used.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
createHaskellFiles
here is the logic that writes the Generated files, but it could easily be some external tool being called via a script.
Then you run shake, and … the code works! Awesome, we're done, right?
Well, maybe not. The first sign something might be wrong is in the docs. The docs for getDirectoryFiles state: "As a consequence of being tracked, if the contents change during the build (e.g. you are generating .c files in this directory) then the build not reach a stable point, which is an error - detected by running with –lint. You should normally only call this function returning source files."
That doesn't sound good. Maybe we should check the behaviour of our code.
Let's delete one of the generated files, and run Shake again to check it detects that:
1 2 3 4 5 6 |
|
Whew! Maybe we're okay. We'll just run it once more:
1 2 3 4 |
|
Oh. That's not good: nothing has changed, so why have we invoked ghc
?
Here we hit something very, very, important to understand about getDirectoryFiles
(and other Shake Rules and Oracles): they only run once per invocation of Shake.
Let's step through the implications of what this means on each of the build runs.
Run 1 (from clean)
- We ask for the
_build/main
executable to be built; it doesn't exist, so theAction
in theRule
runs - Among other things, we ask for
_build/haskell_generation.log
; it also doesn't exist, so we run it'sAction
. Several files (let's say,fighter.hs
androgue.hs
) get written to the generated file directory - We call
getDirectoryFiles
, telling Shake that we depend on the generated files directory havingfighter.hs
androgue.hs
and no other Haskell files - We need the content of all the source files and build the executable.
Run 2 (deleted rogue.hs
)
- We ask for the
_build/main
executable to be built; it exists, so Shake starts checking if it's dependencies have changed - Among other things, it call
getDirectoryFiles
on the generated file directory, and records that there's now onlyfighter.hs
in there: the file list has changed _build/main
has changed dependencies so we run it'sAction
- During that action,
getDirectoryFiles
is called on the Generated file directory. It has already been run (see above) so Shake does not run it again: it records that onlyfighter.hs
is depended on even thoughrogue.hs
has now been recreated - We build the executable
Run 3 (no change)
- We ask for the
_build/main
executable to be built; it exists, so Shake starts checking if it's dependencies have changed - Among other things, it call
getDirectoryFiles
on the generated file directory, and records that there's now bothfighter.hs
androgue.hs
in there: the file list has changed again! _build/main
has changed dependencies so we run it'sAction
In fact, it turns out that if we turn on linting in Shake it will tell us about this problem:
1 2 3 4 5 6 7 8 9 |
|
Back to the drawing board
So: what do we want from our rules here? Let's actually put down the end effect we're aiming for:
- If there are any Yaml files are added, removed or changed, we should regenerate
- If any of the Generated files have been removed or changed, we should regenerate
- If a generated file is
need
ed, we should check we have an up to date set of generated files - If the input and output files are unchanged since the last run, we should not regenerate
We can't call getDirectoryFiles
on the generated Haskell files, for the reason given above; and we can't call need
on the Haskell files after generating them in the _build/haskell_generation.log
to rebuild if they change, because they themselves need
the Haskell generation.
We're going to have to break out some bigger guns.
Firstly, we're going to want to encode from custom logic for when to rebuild based on the environment. We model this is Shake by setting up an "Oracle"; this allows us to store a value in the Shake database, and if it changes between one run and the next anything which depends on it is considered dirty and needs rebuilding.
Secondly, _build/haskell_generation.log
is going to stop being just a "stamp" file to get around the fact that Shake doesn't know about directories, and we're going to start storing some useful info in there.
Of course, we still need to be careful: just like running getDirectoryFiles
, our Oracle is only going to be evaluated once for the whole run of Shake, and it will be evaluated to check dependencies before the actual rules which depend on it are run.
Let's go with a model where we assign each run of the generator a unique ID, which we'll use in our Oracle and stash in our output file so that we can return the same ID if nothing has changed on disk.
We'll create some reusable code to do this; we'll take a list of patterns for generated files this rule controls, an output file, and an action to generate the files. I'll show you the code in full, and there's some explanation underneath:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
|
The file starts with some boiler plate code needed for storing the unique identifier in the shake database.
Then we have the logic for creating a run ID:
- Check if an output file already exists.
- If it does:
- we'll read the last UID and list of files created from it
- We'll read the list of files that match the generated pattern
- If the two lists don't match, new UID is created
- If they do, we return the same UID as last time
- If it doesn't, we'll create a new UID
That means that if the list of generated files has changed, we know we need to run the generator.
Then we have a rule that matches all of the patterns for files which will be generated, and depends on the output file.
Finally, we have the rule for the output file:
- this acquires a run ID
- deletes any files that match the generated patterns (this ensures that we don't end up with "stale" generated files that no longer have a creator)
- runs the generation Action the user provided
- and finally writes the output file with the run ID used and the list of files created
This completes the loop and lets us check next time around if the list of generated files has changed.
What does it look like to use? Something like this:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
We have to add the oracle to our rules (only once, not per generator). Then we just call our reusable code with specify the output file, the pattern of files out will be generated, and the logic to generate them (including specifying dependencies of the process).
We're nearly there, but we still have a problem. We called getDirectoryFiles
on the Haskell source files in our Haskell compile build rule! It turns out that it's not just in the Rules for the generated files themselves that you need to be careful: you just can't reliably call getDirectoryFiles
on generated files anywhere in your build specification.
We can get around that in two ways. One is that we can separate depending on source files (call getDirectoryFiles
with a pattern that doesn't include any of the generated files) from the generated files, and add a helper like the one below to get which files have been generated:
1 2 3 4 5 6 7 8 9 |
|
Usefully, this also ensures that if you ask for the list of generated files the file generation rule will be called!
Alternatively, if we're happy that all of our input files have now been created, we can often get our tools themselves to tell us what they used. Shake allows us to call the needed
function here to record a dependency that we've already used. Be aware though that this will error if anything changes the needed
file after you used it!
As an example, we can combine the use of ghc
's dependency generation flag and Shake's makefile parser to rewrite our Haskell rule to the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
This runs the compile process, and then calls ghc
telling it to write all of the dependencies it used to a temporary makefile. We then use neededMakefileDependencies
to specify that we did use those files, even if we didn't know we were going to before building.
Just make sure that you've needed anything that the build system needs to create/update before you run your compile action though!