-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capture only the values of referenced variables in formula namespace #25
Comments
Any plans from you to work on this soon, or otherwise do you think this is something that someone else could take on (with some coaching from your part)? I'd really like to use patsy, but I need DesignMatrixBuilders to be pickleable. |
Based on discussion with @chrish42 at PyCon, I think the basic idea is: Inside (One side-effect is that this would mean that some corner cases would be handled in a nicer way. E.g. if you have both a global variable called "x" and a variable in your data frame called "x", then right now the initial building will look at the x in the data frame -- but if later when predicting you accidentally forget to give an x variable, you will silently get the global variable x instead. With this change, it will instead error out, insisting that anything that was in the data frame can only come from the data frame.) A slightly trickier question is how to actually rearrange to accomplish this in practice. Right now the eval environment is captured inside the parsing code, and handed to each NB: the data on which factors refer to which variables in which namespaces would also make it trivial to fix #13, and enables interesting new features like #64. |
I've started working on this in my repo, on branch capture-vars-not-env. |
I think this can be closed now... |
Right now when creating a formula, we capture the namespace itself.
This can pin large variables in memory, and presents an obstacle to serializiing model designs (#26).
What we should do is to figure out which variables from the enclosing namespace are actually used, and then capture only those.
The klugey way to do this is to observe which variables are accessed when evaluating the formula the first time, and then save only those.
The more principle, reliable, and modular way is to use
ast
to parse the formula, and then extract all bare variable references. (Or maybe we should just re-use the token-based implementation of this.) Those which are found not in the data, and not in the builtins, but in the environment, should get stashed to use for actual evaluation.This isn't just an optimization, it does produce a user-visible effect: if some variable name referenced in the formula is rebound after the formula is created, then previously the new value would be used in future predictions, but after this change, the old value will continue to be used.
This is unaesthetic (I think PHP's so-called "closures" work this way?), but in our case it's actually for the best -- ideally we'd save a read-only snapshot of the environment, period, which is not tractable. But this moves us slightly in that direction, so, okay.
The text was updated successfully, but these errors were encountered: