Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture only the values of referenced variables in formula namespace #25

Closed
njsmith opened this issue Oct 6, 2013 · 4 comments
Closed

Comments

@njsmith
Copy link
Member

njsmith commented Oct 6, 2013

Right now when creating a formula, we capture the namespace itself.

This can pin large variables in memory, and presents an obstacle to serializiing model designs (#26).

What we should do is to figure out which variables from the enclosing namespace are actually used, and then capture only those.

The klugey way to do this is to observe which variables are accessed when evaluating the formula the first time, and then save only those.

The more principle, reliable, and modular way is to use ast to parse the formula, and then extract all bare variable references. (Or maybe we should just re-use the token-based implementation of this.) Those which are found not in the data, and not in the builtins, but in the environment, should get stashed to use for actual evaluation.

This isn't just an optimization, it does produce a user-visible effect: if some variable name referenced in the formula is rebound after the formula is created, then previously the new value would be used in future predictions, but after this change, the old value will continue to be used.

This is unaesthetic (I think PHP's so-called "closures" work this way?), but in our case it's actually for the best -- ideally we'd save a read-only snapshot of the environment, period, which is not tractable. But this moves us slightly in that direction, so, okay.

@chrish42
Copy link
Contributor

Any plans from you to work on this soon, or otherwise do you think this is something that someone else could take on (with some coaching from your part)? I'd really like to use patsy, but I need DesignMatrixBuilders to be pickleable.

@njsmith
Copy link
Member Author

njsmith commented Apr 14, 2015

Based on discussion with @chrish42 at PyCon, I think the basic idea is: Inside design_matrix_builders, which "sniffing" the data, we should check which global variables are referenced inside each factor (e.g. "x" references x, and "np.log(y)" references np and y), and we should make a note for each variable whether it is found in (a) the 'training data', (b) the captured environment, or (c) the builtins. And in case (b) (and possibly also c) we should write down what the value actually was. (This would all go into the factor state.) Then later, at the build_design_matrices stage, we should only do lookups using this state, not the actual environment.

(One side-effect is that this would mean that some corner cases would be handled in a nicer way. E.g. if you have both a global variable called "x" and a variable in your data frame called "x", then right now the initial building will look at the x in the data frame -- but if later when predicting you accidentally forget to give an x variable, you will silently get the global variable x instead. With this change, it will instead error out, insisting that anything that was in the data frame can only come from the data frame.)

A slightly trickier question is how to actually rearrange to accomplish this in practice. Right now the eval environment is captured inside the parsing code, and handed to each EvalFactor. In this new world, the eval environment isn't referenced until design_matrix_builders, and should not be part of the EvalFactor object state. I guess what we have to do then is just take the eval environment out of EvalFactor, and make it an argument to design_matrix_builders, and extend the factor interface so that the evaluation state gets passed in to the various memorization methods. (This is slightly unfortunate since otherwise design_matrix_builders has no coupling at all to EvalFactor -- e.g. right now if you use LookupFactor everywhere then you can do formula evaluation without ever capturing an environment at all. But I guess this doesn't matter too much in practice -- if you want to do formula evaluation without an environment you could just pass {} or whatever and it will be ignored and everything will work fine.)

NB: the data on which factors refer to which variables in which namespaces would also make it trivial to fix #13, and enables interesting new features like #64.

@chrish42
Copy link
Contributor

I've started working on this in my repo, on branch capture-vars-not-env.

@chrish42 chrish42 mentioned this issue Apr 16, 2015
7 tasks
@chrish42
Copy link
Contributor

I think this can be closed now...

@njsmith njsmith closed this as completed May 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants