Make DesignMatrixBuilders pickleable and saveable #26

njsmith · 2013-10-06T13:35:29Z

Use cases:

It should be possible to pickle a DesignMatrixBuilder (and/or DesignInfo, same issue)
Checking if two designs are the same: this comes up for rERPy -- it's only valid to form a grand average across multiple analyses if the underlying regressions were the same. In particular it would be good to be able to check for subtle gotchas like use of center(...) with different means across the different analyses.

The easy part of this is reviewing the inner structure of DesignMatrixBuilder (column builders and all that) to make sure it's sensible, and similarly for factor state dicts.

The more complicated part is capturing the evaluation environment in a reasonable way.

Precondition: #25

The text was updated successfully, but these errors were encountered:

egrublyte · 2014-10-23T09:26:38Z

Hi there,

I was wondering if there are any news regarding this issue.
I also saw some discussions in:
https://groups.google.com/forum/#!topic/pydata/kcy79nrcFf4
Are there any existing workarounds to overcome this issue?

Many thanks in advance!

ghost · 2015-04-15T19:25:56Z

Hey folks,
The dill module (https://pypi.python.org/pypi/dill) is able to pickle DesignMatrixBuilders.
Just tried it!

njsmith · 2015-04-15T19:39:15Z

Really? That's bad!

The problem isn't so much making it work at all, it's making it work in
such a way that a DesignMatrixBuilder pickled with patsy version X will
unpickle correctly on patsy version Y. If you use dill now it will probably
break later. (We should just have an error message, but I was lazy about
adding this because pickle didn't really work anyway.)

Also, the way dill "works" at the moment is basically to pickle the whole
universe just accidentally (b/c it will pickle every local variable that
happens to be sitting around in the environment where you used patsy, which
may well include your gigabyte sized dataset or whatever).

On Wed, Apr 15, 2015 at 3:25 PM, Doron-Wiser [email protected]
wrote:

Hey folks,
The dill module (https://pypi.python.org/pypi/dill) is able to pickle
DesignMatrixBuilders.
Just tried it!

—
Reply to this email directly or view it on GitHub
#26 (comment).

Nathaniel J. Smith -- http://vorpus.org

chrish42 · 2015-05-20T02:22:37Z

Yeah. Don't use dill to pickle v0.3 DesignMatrixBuilder objects. Patsy v0.4 will support pickling. (The harder bits are done.)

@njsmith I guess now that #25 is taken care of, the remaining bit is adding __setstate__ and friends to the right classes, no? Or do you see more to it? Which classes should that include? Just DesignMatrixBuilder? DesignInfo also? Or more? I'll do a first pass at a branch to solve this (and submit a pull request) once I have a bit of a better idea of what you have in mind.

chrish42 · 2015-05-21T02:52:29Z

Created pull request #67 to work on this.

…ignInfo Notable changes: - DesignInfo now exposes lots more metadata about how exactly different factors and terms are coded. - In fact, it exposes enough more metadata that you can now reconstruct a design matrix entirely from a DesignInfo, so DesignMatrixBuilder becomes redundant and is removed in favor of DesignInfo. - DesignInfo's constructor is very different; in particular, removed the option of specifying terms as strings, which was only useful for interoperability with competing formula libraries. Four years later, no such competitors have appeared, so I can't be bothered to keep maintaining this. Will re-add later if someone actually wants to use it. This versions works and passes tests, but a bunch more tests need to be added. This fixes #61, and sets us up to implemented pickling support (#26, #67).

alexdamour · 2016-03-15T20:07:38Z

Hey folks (particularly @chrish42), how's the progress on this issue? I saw PR #67 is still open and has been stalled for a while. How much work is left here for this feature to be functional?

chrish42 · 2016-03-16T02:08:33Z

The remaining step is to write unit tests for the serialization objects, to make sure that patsy doesn't (unknowingly) break support for formulas, etc. pickled with past versions. I'm been kept busy with other things, but my goal is to get this finished before PyCon 2016, so I'm starting work on this piece in the coming weekends.

njsmith · 2016-03-16T04:47:59Z

@chrish42: that would be great! Of course feel free to ask for help as well if you are stalled -- maybe @alexdamour wants to help, for example ;-)

chrish42 · 2016-06-04T00:07:12Z

I'm using the PyCon sprints to start working on this again. As a first step, I'm cleaning up the description of pull request #67 to have an actual list of tasks that must be done to close this. My next step is to update the pull request with enough code so people can see what the approach would look like.

yongcho822 · 2016-06-08T21:42:22Z

@chrish42 hey guys, wondering if there's any new update on this front? thanks!

chrish42 · 2016-06-09T02:25:32Z

Sure. I've had a very productive sprint at PyCon. I know the "0 of 11 tasks complete" hasn't moved, but if you go look at the "code" tab of the pull request, you'll now see a pretty fleshed out testing framework for pickling. Once @njsmith is happy with that part, I can start implementing __getstate__ and __setstate__ for all the patsy objects that we want to support saving to disk (very easy to do) and adding a bunch of pickling testcases (easy too, with a proper framework for it).

If you want to follow the progress, have a look at the pull request, as this bug report will stay pretty quiet until we close it.

elexira · 2016-12-07T01:15:51Z

pleaseee fix this issue for the love of god !

NotImplementedError: Sorry, pickling not yet supported. See https://github.com/pydata/patsy/issues/26 if you want to help.

datascientette · 2017-01-23T18:58:32Z

+1 Also need to pickle

christang · 2017-11-20T14:39:11Z

+1

For anyone interested, I've made a fork of patsy and merged this branch into the fork. I needed to make changes to the test cases so they worked properly for me, but this is fully working for me and we are using it in production.

I'm continuing to follow this thread so that we can switch back to using the patsy main repository once the finalized solution gets merged in.

saroele · 2018-03-05T15:17:52Z

@christang I tried your branch, but it does not work for me. Can you confirm that this is supposed to work:

import pickle
from patsy import ModelDesc, Term, LookupFactor
response_term = [Term([LookupFactor('test')])]
pickle.dump(response_term, open('test_pickle.pkl', 'w'))

I still get the NonImplementedError

christang · 2018-03-06T11:58:38Z

@saroele Thanks for the note. I believe this branch only adds support for design matrix/info so it may be your other objects still remain without pickling support. I can confirm that the code does not work for me.

ciberger · 2018-09-19T16:06:58Z

@christang, can you help me out with this, please. I'm currently facing a NotImplementedError even when I believe to be doing the pickling right. Any advice would be greatly appreciated

y, X = dmatrices(formula_like, df_model, return_type="dataframe")

with open(models_path+filename, 'wb') as file: pickle.dump(X.design_info, file)

bertomartin · 2019-05-08T21:50:13Z

This is really needed guys. what's blocking this implementation?

njsmith · 2019-05-08T23:53:49Z

@bertomartin Patsy maintenance is done on a purely-volunteer basis, and I haven't really had time to work on it (or even review PRs) in several years now. If someone needs this and has funding to spend on it, we could talk about some kind of consulting contract...

bertomartin · 2019-05-10T15:38:58Z

@njsmith thanks for the great work so far. Ok, I'll take a stab at it.

insperatum · 2020-07-11T23:32:21Z

@bertomartin any news?

petrhrobar · 2022-04-24T19:02:04Z

I have also tried looking into it,

import h5py

def save_patsy(patsy_step, filename):
    """Save the coefficients of a linear model into a .h5 file."""
    with h5py.File(filename, 'w') as hf:
        hf.create_dataset("design_info",  data=patsy_step.design_info_)

def load_coefficients(patsy_step, filename):
    """Attach the saved coefficients to a linear model."""
    with h5py.File(filename, 'r') as hf:
        design_info = hf['design_info'][:]
    patsy_step.design_info_ = design_info


save_patsy(pipe['patsy'], "clf.h5")

Perhaps something simple like this?

Howver, still not working.

matthewwardrop · 2022-04-27T03:34:14Z

Hi @petrhrobar . I recommend you check out formulaic if you are wanting support for pickling.

kyle-pena-nlp · 2023-05-11T17:04:26Z

Here is a partial solution.

Before you first import patsy:

def fixed_factorinfo_repr(self,p,cycle):
    assert not cycle
    kwlist = [("factor", self.factor),
              ("type", self.type),
              ("state", self.state)
              ]
    if self.type == "numerical":
        kwlist.append(("num_columns", self.num_columns))
    else:
        kwlist.append(("categories", self.categories))
    patsy.util.repr_pretty_impl(p, self, [], kwlist)
    
def fake_evalenvironment_repr(self):
    return "EvalEnvironment([])"

import patsy
patsy.FactorInfo._repr_pretty_ = fixed_factorinfo_repr
patsy.EvalEnvironment.__repr__ = fake_evalenvironment_repr

Then, to "serialize" a DesignInfo, do:

serialized_design_info = repr(my_design_info)

To "deserialize" a DesignInfo, do:

from collections import OrderedDict
from patsy import DesignInfo, EvalFactor, Term, SubtermInfo, ContrastMatrix
from numpy import array
design_info_instance = eval(serialized_design_info)

To get a design matrix from a design info, you can use the method patsy.build_design_matrices([design_info], X, return_type = "matrix")

Depending on your usage of patsy, there may be other __repr__ methods you may need to monkey-patch like I did here. Obviously the implementation of __repr__ for EvalEnvironment is going to be insufficient if you reference environment variables in your formulas. None of this works if patsy gets imported before the monkey-patching happens. Buyer beware.

The author of this library deserves major credit for almost completely implementing __repr__, which is what made this workaround possible.

njsmith mentioned this issue Oct 6, 2013

Capture only the values of referenced variables in formula namespace #25

Closed

louispotok mentioned this issue Apr 30, 2016

Patsy pickling #67

Open

11 tasks

saroele mentioned this issue Mar 5, 2018

save and load of multivariable analysis models opengridcc/opengrid#35

Open

This was referenced Sep 20, 2020

How to save/serialize model: WeibulAFTFitter CamDavidsonPilon/lifelines#1138

Closed

Fix pickle issues CamDavidsonPilon/lifelines#845

Merged

tomicapretto mentioned this issue Aug 24, 2021

Coarse-grained multiprocessing with bambi bambinos/bambi#400

Open

roeggealissa mentioned this issue Jan 25, 2022

OrderedResults.save: pickling not supported statsmodels/statsmodels#8040

Open

koaning mentioned this issue Apr 20, 2022

PatsyTransformer in inference koaning/scikit-lego#505

Closed

matthewwardrop added the fixed in formulaic label Apr 27, 2022

lorentzenchr mentioned this issue Oct 24, 2022

Support initializing matrices with Patsy? Quantco/tabmat#145

Closed

lucazav mentioned this issue Nov 25, 2023

Error when pickling an object obtained by draw() has2k1/plotnine#729

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make DesignMatrixBuilders pickleable and saveable #26

Make DesignMatrixBuilders pickleable and saveable #26

njsmith commented Oct 6, 2013

egrublyte commented Oct 23, 2014

ghost commented Apr 15, 2015

njsmith commented Apr 15, 2015

chrish42 commented May 20, 2015

chrish42 commented May 21, 2015

alexdamour commented Mar 15, 2016

chrish42 commented Mar 16, 2016

njsmith commented Mar 16, 2016

chrish42 commented Jun 4, 2016

yongcho822 commented Jun 8, 2016

chrish42 commented Jun 9, 2016

elexira commented Dec 7, 2016 •

edited

Loading

datascientette commented Jan 23, 2017

christang commented Nov 20, 2017

saroele commented Mar 5, 2018

christang commented Mar 6, 2018

ciberger commented Sep 19, 2018 •

edited

Loading

bertomartin commented May 8, 2019

njsmith commented May 8, 2019

bertomartin commented May 10, 2019 •

edited

Loading

insperatum commented Jul 11, 2020

petrhrobar commented Apr 24, 2022

matthewwardrop commented Apr 27, 2022

kyle-pena-nlp commented May 11, 2023 •

edited

Loading

Make DesignMatrixBuilders pickleable and saveable #26

Make DesignMatrixBuilders pickleable and saveable #26

Comments

njsmith commented Oct 6, 2013

egrublyte commented Oct 23, 2014

ghost commented Apr 15, 2015

njsmith commented Apr 15, 2015

chrish42 commented May 20, 2015

chrish42 commented May 21, 2015

alexdamour commented Mar 15, 2016

chrish42 commented Mar 16, 2016

njsmith commented Mar 16, 2016

chrish42 commented Jun 4, 2016

yongcho822 commented Jun 8, 2016

chrish42 commented Jun 9, 2016

elexira commented Dec 7, 2016 • edited Loading

datascientette commented Jan 23, 2017

christang commented Nov 20, 2017

saroele commented Mar 5, 2018

christang commented Mar 6, 2018

ciberger commented Sep 19, 2018 • edited Loading

bertomartin commented May 8, 2019

njsmith commented May 8, 2019

bertomartin commented May 10, 2019 • edited Loading

insperatum commented Jul 11, 2020

petrhrobar commented Apr 24, 2022

matthewwardrop commented Apr 27, 2022

kyle-pena-nlp commented May 11, 2023 • edited Loading

elexira commented Dec 7, 2016 •

edited

Loading

ciberger commented Sep 19, 2018 •

edited

Loading

bertomartin commented May 10, 2019 •

edited

Loading

kyle-pena-nlp commented May 11, 2023 •

edited

Loading