Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DesignMatrixBuilders pickleable and saveable #26

Open
njsmith opened this issue Oct 6, 2013 · 24 comments
Open

Make DesignMatrixBuilders pickleable and saveable #26

njsmith opened this issue Oct 6, 2013 · 24 comments

Comments

@njsmith
Copy link
Member

njsmith commented Oct 6, 2013

Use cases:

  • It should be possible to pickle a DesignMatrixBuilder (and/or DesignInfo, same issue)
  • Checking if two designs are the same: this comes up for rERPy -- it's only valid to form a grand average across multiple analyses if the underlying regressions were the same. In particular it would be good to be able to check for subtle gotchas like use of center(...) with different means across the different analyses.

The easy part of this is reviewing the inner structure of DesignMatrixBuilder (column builders and all that) to make sure it's sensible, and similarly for factor state dicts.

The more complicated part is capturing the evaluation environment in a reasonable way.

Precondition: #25

@egrublyte
Copy link

Hi there,

I was wondering if there are any news regarding this issue.
I also saw some discussions in:
https://groups.google.com/forum/#!topic/pydata/kcy79nrcFf4
Are there any existing workarounds to overcome this issue?

Many thanks in advance!

@ghost
Copy link

ghost commented Apr 15, 2015

Hey folks,
The dill module (https://pypi.python.org/pypi/dill) is able to pickle DesignMatrixBuilders.
Just tried it!

@njsmith
Copy link
Member Author

njsmith commented Apr 15, 2015

Really? That's bad!

The problem isn't so much making it work at all, it's making it work in
such a way that a DesignMatrixBuilder pickled with patsy version X will
unpickle correctly on patsy version Y. If you use dill now it will probably
break later. (We should just have an error message, but I was lazy about
adding this because pickle didn't really work anyway.)

Also, the way dill "works" at the moment is basically to pickle the whole
universe just accidentally (b/c it will pickle every local variable that
happens to be sitting around in the environment where you used patsy, which
may well include your gigabyte sized dataset or whatever).

On Wed, Apr 15, 2015 at 3:25 PM, Doron-Wiser [email protected]
wrote:

Hey folks,
The dill module (https://pypi.python.org/pypi/dill) is able to pickle
DesignMatrixBuilders.
Just tried it!


Reply to this email directly or view it on GitHub
#26 (comment).

Nathaniel J. Smith -- http://vorpus.org

@chrish42
Copy link
Contributor

Yeah. Don't use dill to pickle v0.3 DesignMatrixBuilder objects. Patsy v0.4 will support pickling. (The harder bits are done.)

@njsmith I guess now that #25 is taken care of, the remaining bit is adding __setstate__ and friends to the right classes, no? Or do you see more to it? Which classes should that include? Just DesignMatrixBuilder? DesignInfo also? Or more? I'll do a first pass at a branch to solve this (and submit a pull request) once I have a bit of a better idea of what you have in mind.

@chrish42
Copy link
Contributor

Created pull request #67 to work on this.

njsmith added a commit that referenced this issue Jun 14, 2015
…ignInfo

Notable changes:
- DesignInfo now exposes lots more metadata about how exactly different
  factors and terms are coded.
- In fact, it exposes enough more metadata that you can now reconstruct
  a design matrix entirely from a DesignInfo, so DesignMatrixBuilder
  becomes redundant and is removed in favor of DesignInfo.
- DesignInfo's constructor is very different; in particular, removed the
  option of specifying terms as strings, which was only useful for
  interoperability with competing formula libraries. Four years later, no
  such competitors have appeared, so I can't be bothered to keep maintaining
  this. Will re-add later if someone actually wants to use it.

This versions works and passes tests, but a bunch more tests need to be
added.

This fixes #61, and sets us up to implemented pickling support (#26, #67).
@alexdamour
Copy link

Hey folks (particularly @chrish42), how's the progress on this issue? I saw PR #67 is still open and has been stalled for a while. How much work is left here for this feature to be functional?

@chrish42
Copy link
Contributor

The remaining step is to write unit tests for the serialization objects, to make sure that patsy doesn't (unknowingly) break support for formulas, etc. pickled with past versions. I'm been kept busy with other things, but my goal is to get this finished before PyCon 2016, so I'm starting work on this piece in the coming weekends.

@njsmith
Copy link
Member Author

njsmith commented Mar 16, 2016

@chrish42: that would be great! Of course feel free to ask for help as well if you are stalled -- maybe @alexdamour wants to help, for example ;-)

@louispotok louispotok mentioned this issue Apr 30, 2016
11 tasks
@chrish42
Copy link
Contributor

chrish42 commented Jun 4, 2016

I'm using the PyCon sprints to start working on this again. As a first step, I'm cleaning up the description of pull request #67 to have an actual list of tasks that must be done to close this. My next step is to update the pull request with enough code so people can see what the approach would look like.

@yongcho822
Copy link

@chrish42 hey guys, wondering if there's any new update on this front? thanks!

@chrish42
Copy link
Contributor

chrish42 commented Jun 9, 2016

Sure. I've had a very productive sprint at PyCon. I know the "0 of 11 tasks complete" hasn't moved, but if you go look at the "code" tab of the pull request, you'll now see a pretty fleshed out testing framework for pickling. Once @njsmith is happy with that part, I can start implementing __getstate__ and __setstate__ for all the patsy objects that we want to support saving to disk (very easy to do) and adding a bunch of pickling testcases (easy too, with a proper framework for it).

If you want to follow the progress, have a look at the pull request, as this bug report will stay pretty quiet until we close it.

@elexira
Copy link

elexira commented Dec 7, 2016

pleaseee fix this issue for the love of god !

NotImplementedError: Sorry, pickling not yet supported. See https://github.com/pydata/patsy/issues/26 if you want to help.

@datascientette
Copy link

+1 Also need to pickle

@christang
Copy link

+1

For anyone interested, I've made a fork of patsy and merged this branch into the fork. I needed to make changes to the test cases so they worked properly for me, but this is fully working for me and we are using it in production.

I'm continuing to follow this thread so that we can switch back to using the patsy main repository once the finalized solution gets merged in.

@saroele
Copy link

saroele commented Mar 5, 2018

@christang I tried your branch, but it does not work for me. Can you confirm that this is supposed to work:

import pickle
from patsy import ModelDesc, Term, LookupFactor
response_term = [Term([LookupFactor('test')])]
pickle.dump(response_term, open('test_pickle.pkl', 'w'))

I still get the NonImplementedError

@christang
Copy link

@saroele Thanks for the note. I believe this branch only adds support for design matrix/info so it may be your other objects still remain without pickling support. I can confirm that the code does not work for me.

@ciberger
Copy link

ciberger commented Sep 19, 2018

@christang, can you help me out with this, please. I'm currently facing a NotImplementedError even when I believe to be doing the pickling right. Any advice would be greatly appreciated

y, X = dmatrices(formula_like, df_model, return_type="dataframe")

with open(models_path+filename, 'wb') as file: pickle.dump(X.design_info, file)

@bertomartin
Copy link

This is really needed guys. what's blocking this implementation?

@njsmith
Copy link
Member Author

njsmith commented May 8, 2019

@bertomartin Patsy maintenance is done on a purely-volunteer basis, and I haven't really had time to work on it (or even review PRs) in several years now. If someone needs this and has funding to spend on it, we could talk about some kind of consulting contract...

@bertomartin
Copy link

bertomartin commented May 10, 2019

@njsmith thanks for the great work so far. Ok, I'll take a stab at it.

@insperatum
Copy link

@bertomartin any news?

@petrhrobar
Copy link

I have also tried looking into it,

import h5py

def save_patsy(patsy_step, filename):
    """Save the coefficients of a linear model into a .h5 file."""
    with h5py.File(filename, 'w') as hf:
        hf.create_dataset("design_info",  data=patsy_step.design_info_)

def load_coefficients(patsy_step, filename):
    """Attach the saved coefficients to a linear model."""
    with h5py.File(filename, 'r') as hf:
        design_info = hf['design_info'][:]
    patsy_step.design_info_ = design_info


save_patsy(pipe['patsy'], "clf.h5")

Perhaps something simple like this?

Howver, still not working.

@matthewwardrop
Copy link
Collaborator

Hi @petrhrobar . I recommend you check out formulaic if you are wanting support for pickling.

@kyle-pena-nlp
Copy link

kyle-pena-nlp commented May 11, 2023

Here is a partial solution.

Before you first import patsy:

def fixed_factorinfo_repr(self,p,cycle):
    assert not cycle
    kwlist = [("factor", self.factor),
              ("type", self.type),
              ("state", self.state)
              ]
    if self.type == "numerical":
        kwlist.append(("num_columns", self.num_columns))
    else:
        kwlist.append(("categories", self.categories))
    patsy.util.repr_pretty_impl(p, self, [], kwlist)
    
def fake_evalenvironment_repr(self):
    return "EvalEnvironment([])"

import patsy
patsy.FactorInfo._repr_pretty_ = fixed_factorinfo_repr
patsy.EvalEnvironment.__repr__ = fake_evalenvironment_repr

Then, to "serialize" a DesignInfo, do:

serialized_design_info = repr(my_design_info)

To "deserialize" a DesignInfo, do:

from collections import OrderedDict
from patsy import DesignInfo, EvalFactor, Term, SubtermInfo, ContrastMatrix
from numpy import array
design_info_instance = eval(serialized_design_info)

To get a design matrix from a design info, you can use the method patsy.build_design_matrices([design_info], X, return_type = "matrix")

Depending on your usage of patsy, there may be other __repr__ methods you may need to monkey-patch like I did here. Obviously the implementation of __repr__ for EvalEnvironment is going to be insufficient if you reference environment variables in your formulas. None of this works if patsy gets imported before the monkey-patching happens. Buyer beware.

The author of this library deserves major credit for almost completely implementing __repr__, which is what made this workaround possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests