Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Universal table transformer combining univariate transformations dispatched on schema #288

Open
ablaom opened this issue Aug 4, 2020 · 3 comments

Comments

@ablaom
Copy link
Member

ablaom commented Aug 4, 2020

It has been proposed on Slack that it be possible to have a single table transformer that transforms individual columns according to user-specified univariate transformations. This sounds like a good idea, which would also force some uniformity that's a little bit lacking in the current collection of table transformers.

  1. In the most general case I can imagine implementing, the univariate transformer that applies to a particular column is defined by a function that operates on both the name and scitype of the the column (as encoded in the table schema). This has the disadvantage that the user must specify a function with two arguments - or interact through some other complicated interface.

  2. The alternative would be a compositional approach. Each tabular transformer only carries out a single univariate transformer, applying to all specified names and scitypes (or "not"-names and "not"-scitypes, through ignore Boolean parameter), which would cover all conceivable use-cases. (columns not referred to are left alone). However, as we are currently locked into Tables.jl (which are non-mutable in general) we get a lot more copying of data.

Thoughts anyone?

@ablaom
Copy link
Member Author

ablaom commented Oct 13, 2020

I'm inclined to go with option 2, which is more user-friendly. The other issue ought to be solved on the tables interface side, in my opinion.

@ParadaCarleton
Copy link

What about the opposite--a way to limit a multivariate transform to a subset of columns? This seems more general, since all multivariate transforms can be used as a univariate transform, but not vice-versa (e.g. PCA).

I'm not sure how I'd go about implementing this, though (given only MLJ primitives). Is there a package or interface used by MLJ for messing about with tables?

@ablaom
Copy link
Member Author

ablaom commented Oct 29, 2023

| Is there a package or interface used by MLJ for messing about with tables?

In MLJ a "table" is anything implementing the Tables.jl interface and satisfying Tables.istable(X) = true. Unfortunately, the generality of Tables.jl makes it less than ideal for our purposes, as it aims to include out-of-memory tables and tables with an unknown numbers of rows (e.g., lazily iterated). The maintainers are very thoughtful, but reluctant to add any complexity. The API has no method to mutate columns in-place. There is now a Tables.subset method for random access of rows, but this took a very long time to get. Maybe a new specialised pkg is needed, but no-one has ventured to write one.

The method MLJModelInterface.nrows(X) will get you the number of rows, by basically materialising an entire column if necessary (see also "Aside" below).

MLJModelInterface has methods selectrows and selectcols, based on Tables.jl primitives, but I'd now recommend Tables.subset over selectrows. I expect TableTransforms.jl is your best bet for general table manipulations, although it's probably too heavy a dep for MLJModels.jl.

A package called TableOperations.jl provided some useful tools for tables, but is no longer maintained, as far as I can tell.

Aside Another interface, DataAPI, provides the DataAPI.nrow (not DataAPI.nrows) that is implemented by DataFrames.jl and, more recently, some of the table types actually owned by Tables.jl, such as a matrix table wrapper. I'd consider restricting MLJ's definition of table to require implementation of DataAPI.nrow but that would be breaking. One reason for doing so is that tables with this feature also fit into the MLUtils.jl API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants