Capture only the values of referenced variables in formula namespace #25

njsmith · 2013-10-06T13:35:07Z

Right now when creating a formula, we capture the namespace itself.

This can pin large variables in memory, and presents an obstacle to serializiing model designs (#26).

What we should do is to figure out which variables from the enclosing namespace are actually used, and then capture only those.

The klugey way to do this is to observe which variables are accessed when evaluating the formula the first time, and then save only those.

The more principle, reliable, and modular way is to use ast to parse the formula, and then extract all bare variable references. (Or maybe we should just re-use the token-based implementation of this.) Those which are found not in the data, and not in the builtins, but in the environment, should get stashed to use for actual evaluation.

This isn't just an optimization, it does produce a user-visible effect: if some variable name referenced in the formula is rebound after the formula is created, then previously the new value would be used in future predictions, but after this change, the old value will continue to be used.

This is unaesthetic (I think PHP's so-called "closures" work this way?), but in our case it's actually for the best -- ideally we'd save a read-only snapshot of the environment, period, which is not tractable. But this moves us slightly in that direction, so, okay.

The text was updated successfully, but these errors were encountered:

chrish42 · 2014-11-28T19:50:36Z

Any plans from you to work on this soon, or otherwise do you think this is something that someone else could take on (with some coaching from your part)? I'd really like to use patsy, but I need DesignMatrixBuilders to be pickleable.

njsmith · 2015-04-14T23:06:05Z

Based on discussion with @chrish42 at PyCon, I think the basic idea is: Inside design_matrix_builders, which "sniffing" the data, we should check which global variables are referenced inside each factor (e.g. "x" references x, and "np.log(y)" references np and y), and we should make a note for each variable whether it is found in (a) the 'training data', (b) the captured environment, or (c) the builtins. And in case (b) (and possibly also c) we should write down what the value actually was. (This would all go into the factor state.) Then later, at the build_design_matrices stage, we should only do lookups using this state, not the actual environment.

(One side-effect is that this would mean that some corner cases would be handled in a nicer way. E.g. if you have both a global variable called "x" and a variable in your data frame called "x", then right now the initial building will look at the x in the data frame -- but if later when predicting you accidentally forget to give an x variable, you will silently get the global variable x instead. With this change, it will instead error out, insisting that anything that was in the data frame can only come from the data frame.)

A slightly trickier question is how to actually rearrange to accomplish this in practice. Right now the eval environment is captured inside the parsing code, and handed to each EvalFactor. In this new world, the eval environment isn't referenced until design_matrix_builders, and should not be part of the EvalFactor object state. I guess what we have to do then is just take the eval environment out of EvalFactor, and make it an argument to design_matrix_builders, and extend the factor interface so that the evaluation state gets passed in to the various memorization methods. (This is slightly unfortunate since otherwise design_matrix_builders has no coupling at all to EvalFactor -- e.g. right now if you use LookupFactor everywhere then you can do formula evaluation without ever capturing an environment at all. But I guess this doesn't matter too much in practice -- if you want to do formula evaluation without an environment you could just pass {} or whatever and it will be ignored and everything will work fine.)

NB: the data on which factors refer to which variables in which namespaces would also make it trivial to fix #13, and enables interesting new features like #64.

chrish42 · 2015-04-15T16:02:33Z

I've started working on this in my repo, on branch capture-vars-not-env.

chrish42 · 2015-05-20T02:19:18Z

I think this can be closed now...

njsmith mentioned this issue Oct 6, 2013

Make DesignMatrixBuilders pickleable and saveable #26

Open

njsmith mentioned this issue Apr 14, 2015

Add "partial prediction" (initially at the factor level) #64

Open

chrish42 mentioned this issue Apr 16, 2015

Capture vars not env #66

Merged

7 tasks

njsmith closed this as completed May 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture only the values of referenced variables in formula namespace #25

Capture only the values of referenced variables in formula namespace #25

njsmith commented Oct 6, 2013

chrish42 commented Nov 28, 2014

njsmith commented Apr 14, 2015

chrish42 commented Apr 15, 2015

chrish42 commented May 20, 2015

Capture only the values of referenced variables in formula namespace #25

Capture only the values of referenced variables in formula namespace #25

Comments

njsmith commented Oct 6, 2013

chrish42 commented Nov 28, 2014

njsmith commented Apr 14, 2015

chrish42 commented Apr 15, 2015

chrish42 commented May 20, 2015